April 23, 2019

Reproducibility in Computational Analysis

Geraldine Van der Auwera

Reproducibility in Computational Analysis

As Lara Mangravite and John Wilbanks noted in their opening salvo, “open science” is such a multifaceted concept that it defies consensus definition — and so I have found it particularly interesting to hear and read the individual definitions that have come out of the workshop and the first round of this blog series. I find myself celebrating with Titus Brown the progress already made in building communities of practice around portability, preprint culture and the Carpentries, and vigorously agreeing with Cyndi Grossman’s call for greater social justice. I feel enlightened by Brian Nosek’s discussion of preregistration of analysis plans, which I had somehow managed to remain ignorant of until now, and Irene Pasquetto’s meticulous reconnaissance of how much open data is actually reused (or not). Now that it’s time for me to add my stone to the cairn of CAOS (instant metal band name!), I’m going to focus on the facet of open science that keeps me up at night: reproducibility in computational analysis.

In the spirit of clear definitions, let’s distinguish reproducibility from replication. Reproducibility focuses on the analysis process. When we say we’re reproducing an analysis, we’re trying to take the same inputs, put them through the same processing, and hopefully get the same results. This is something we may want to do for training purposes or to build on someone else’s technical work.

Replication, on the other hand, is all about confirming (or infirming? sorry Karl) the insights derived from experimental results. To replicate the findings of a study, we typically want to apply orthogonal approaches, preferably to data collected independently, and see whether the results lead us to draw the same conclusions. It basically comes down to plumbing vs. truth: in one case we are trying to verify that “the thing runs as expected” and in the other, “yes this is how this bit of nature works.”

My personal mission is to drive progress on the reproducibility front, because it is critical for enabling the research community to share and reuse tools and methods effectively. Without great reproducibility, we are condemned to waste effort endlessly reimplementing the same wheels. Unfortunately, whether we’re talking about training or building on published work, reproducibility is still an uphill battle. Anyone who has ever tried and failed to get someone else’s hand-rolled, organic, locally-sourced python package to behave nicely knows what I mean: it’s nice to have the code in GitHub, but it’s hardly ever enough.

Certainly, as Brown noted, the bioinformatics community has made great progress toward adopting technological pieces like notebooks, containers, and standards that increase portability. Yet, there is still a sizeable gap between intended portability (“you could run this anywhere”) and actual ease of use for the average computational life scientist (“you can run this where you are”).

Beyond mechanisms for sharing code, we also need a technological environment that supports the easy convergence of data, code and compute. And we need this environment to be open and emphasize interoperability, both for the FAIRness of data and so that researchers never find themselves locked into silos as they seek to collaborate with other groups and move to new institutional homes throughout their research careers.

That is why I am particularly excited by the Data Biosphere* vision laid out by Benedict Patten et al., which introduces an open ecosystem of interoperable platform components that can be bundled into environments with interfaces tailored to specific audiences. Within a given environment, you can create a sandbox containing all code, data, and configurations necessary to reproduce an analysis, so that anyone can go in and reproduce the work as originally performed – or tweak the settings to test whether they can do better! It’s the ultimate methods supplement or interactive technical manual. (* Disclaimer: I work with Anthony Philippakis, a coauthor of the Data Biosphere blog post.)

The final major obstacle to promoting reproducibility is data access. Just looking at genomic data, much of the data generated for human biomedical research is heavily protected, for good reason. There is some open-access data but it is often not sufficient nor appropriate for reproducing specific analyses, which means that we often can’t train researchers in key methodologies until after they have been granted access to specific datasets – if they can get access at all. This is a major barrier for students and trainees in general, and has the potential to hold back entire countries from developing the capabilities to participate in important research.

The good news is that we can solve this by generating synthetic datasets. Recently, my team participated in a FAIR Data hackathon hosted by NCBI, running a project to prototype some resources for democratizing access to synthetic data generation as a tool for reproducibility. The level of interest in the project was highly encouraging. I am now looking for opportunities to nucleate a wider community effort to build out a collection of synthetic datasets that researchers can use off the shelf for general purposes, like testing and training, accompanied by user-friendly tooling to spike-in mutations of interest into the synthetic data for reproducing specific studies. We did something similar with an early version of our tooling in a previous workshop project presented at ASHG 2018.

The technical work that would be involved in making this resource a reality would be fairly modest. The most important lift will be corralling community inputs so that we can make sure we are building the right solution (the right data and the right tooling) to address most researchers’ needs, and, of course, determining who is the right custodian to host and maintain these resources for the long term.

About this series: In February 2019, Sage Bionetworks hosted a workshop called Critical Assessment of Open Science (CAOS). This series of blog posts by some of the participants delves into a few of the themes that drove discussions – and debates – during the workshop. Read the series.


Geraldine Van der Auwera

Geraldine Van der Auwera is the Associate Director of Outreach and Communications in the Data Sciences Platform at the Broad Institute. She received her PhD in biological engineering from the Université Catholique de Louvain (UCL) in Louvain-la-Neuve, Belgium.