Bringing Open Science to Neuroinformatics

The amazing people at Sage Bionetworks have kindly asked me to contribute to their series of posts about open science. The first thing that occurred to me was that it’s nice to be recognized as part of the open-science movement. The second thing that occurred was that more people should be part of it.

When I was a grad student and postdoc many years ago, open science wasn’t really something we talked about. We collected our data, wrote our papers, rewrote our papers, celebrated when they were published, and that was pretty much it. In 2010, I started working at INCF (International Neuroinformatics Coordinating Facility), and suddenly, data sharing, interoperability, and standards were all we were talking about. I remember that first discussion with then executive director Sten Grillner, where he outlined all the reasons why people should share their data and all the hurdles we needed to jump in order to optimize the process. All I could think was why on earth are there not central facilities doing this for all the other scientific domains, as well?

In the last decade, several major brain initiatives around the world have been launched and are starting to produce a vast amount of data. In order to integrate these diverse data and address issues of transparency and reproducibility, widely adopted standards and best practices around data sharing will be key to achieve an infrastructure which supports open and reproducible neuroscience. At INCF, we’re working at a fundamental level of open science: how to make data shareable, tools interoperable, and researchers at all career levels trained in data management. One of our main activities is to vet and endorse FAIR (Findable Accessible Interoperable Reusable) standards and best practices for neuroscience data (“neuroinformatics” is in our name, after all) [1]. We also support the development of new standards, as well as the extension of existing standards to support additional data types.

Standards should serve as aspirations and be accessible to this and future generations of scientists as tools for thriving in an open-science environment. Support for an open-science environment in turn facilitates collaborations and idea exchange, which enable mutual growth in striving toward scientific goals. Different expertise can be brought to bear on difficult problems, leading to new solutions, and also training a new and more robust scientific enterprise.

I think we can all agree that there’s zero success in announcing a “standard” and thinking that it will be widely adopted (cue the herding cats analogy). Open science is about choice and providing the mechanisms to facilitate open collaborations – like Irene Pasquetto pointed out in another post in this series: “collaborations are the holy grail of reuse.” One of Pasquetto’s main observations is that reputation, trust, and pre-existing networks have as much impact on reuse as how well the data is curated. In our experience, this perspective extends to standards and best practices, as well. With this in mind, we spent some time in 2017 working out a process for vetting and endorsing standards and best practices where the community itself did the vetting and endorsing [2]. The process was opened for submissions in early 2018, and we currently have three endorsed standards with eight more in the pipeline.

To promote uptake of standards and best practices, and implementation of other neuroinformatics methods, INCF has built TrainingSpace, an online hub which provides informatics educational resources for the global neuroscience community. TrainingSpace offers multimedia content from courses, conference lectures, and laboratory exercises from some of the world’s leading neuroscience institutes and societies. As complements to TrainingSpace, INCF also manages NeuroStars, an online Q&A forum, and KnowledgeSpace, a neuroscience encyclopedia that provides users with access to over a million publicly available datasets and links to literature references and scientific abstracts.

We at INCF hope that our efforts to promote open neuroscience will spark some interest in those of you are just now hearing about us, and join in the fun of developing, vetting, and endorsing FAIR standards and best practices. After all, herding cats is easier if there’s more food to choose from!

This post was written with helpful comments from Randy McIntosh, neuroinformatics expert and brainstormer extraordinaire.

References:

1 Abrams, Mathew, et al. “A Standards Organization for Open and FAIR Neuroscience: The International Neuroinformatics Coordinating Facility.” OSF Preprints, 17 Jan. 2019. https://doi.org/10.31219/osf.io/3rt9b

2 Martone M. The importance of community feedback in the INCF standards and best practices endorsement process [version 1; not peer reviewed]. F1000Research 2018, 7:1452 (document) https://doi.org/10.7490/f1000research.1116069.1

About this series: In February 2019, Sage Bionetworks hosted a workshop called Critical Assessment of Open Science (CAOS). This series of blog posts by some of the participants delves into a few of the themes that drove discussions – and debates – during the workshop. Read the series.

On the Ethics of Open Science

Ethics is an essential consideration for any research initiative that collects, uses, or reuses human data or samples. In data activities of these types, ethics concerns – whether raised by institutional review boards (IRBs) tasked with reviewing projects, or in the published literature – seem to focus almost entirely on questions of adequate privacy practices and security protections. Questions are raised about what fields are kept, what permissions have been provided for future access, and whether potential participants have adequately understood relevant parameters.

And yet considerations of ethics go beyond who gives permission for or has access to individuals’ data. Open science promises considerable ethical good: speeding up medical discovery, avoiding unnecessary duplication, creating efficiencies, and encouraging more democratic science. These are unquestionably ethical goods. But leaving discussions of ethics and open science simply to the good that can come and the need for privacy protections is concerningly narrow.

Open-science projects should also actively evaluate how the potential benefit of helping others actually will be realized, and what they can build into their structures to increase the chance that this will occur. Will open science simply allow users to do what they want with data as long as those whose data are included have been informed that the data will be widely shared? Or will the ethics of open-science support work to help ensure that goals, such as benefit and fairness, are more likely to be realized?

This might lead to a few questions that each open-science program might consider:

How open do you want open-data sharing to be?

Are open-science data repositories open by definition for use by everyone regardless of motives or intent? Or is any vetting of the user, the purpose, the commercial interests, or any screening for something nefarious ever part of the equation? Clearly there are risks and benefits from being truly open vs. open with strings attached. But the key is recognizing that there are risks and benefits to each strategy. Ignoring either of these does a disservice to considering the ethics of the enterprise.

What benefits come from open science?

The potential ethical good of open science lies in allowing a stunning increase in the number of discoveries that can be made, and the efficiency with which they can occur relative to traditional science. A critical empirical question is whether anyone is keeping track of the degree to which the hypothesized great potential is really occurring. And whose job it is, not only to set up open data, but to help make sure that the benefit – better diagnoses, better treatments, etc. – of having such data be open are actually realized and not just that more analyses are conducted? Because open science is designed with fewer guard rails, it is essential to ensure that the potential benefit of working in this manner is actively facilitated and not left to chance. While individual participants may choose to opt-in or opt-out of various initiatives, thinking intentionally about the ethics of an activity itself is a form of meta ethics that is central to open science.

Who benefits from open science? Does that end up being fair? And can we put small structures in place to increase the chance that broad commitments to benefit and fairness are realized?

Without deliberate attention, it is unlikely that the benefits of open science will result in better care for those who are most disenfranchised, a narrowing of health inequalities, or a targeting of at least a few conditions that disproportionately affect those at the bottom of the barrel. It’s not that anyone is unsympathetic to those questions or would ever suggest that such questions not be addressed through data available through open science platforms. Without a structure or system that ensures at least some percentage of the work that emerges must focus on questions of this sort, it likely won’t happen in any large, concerted way. In the same way that the national genome project required a small percentage of their federal dollars to go to research questions on the ethical, legal, and social implications of genetics research, should open science ever require that some proportion of open-science analyses focus on questions of social justice? The options are countless and, again, leaving this to chance may help to guarantee that disparities will continue to grow wider rather than narrower.

Are there or should there be community partners?

Is there any relevance for having, or encouraging, community partners for at least some numbers of questions that might be asked of open-science data? Is it worth having some pilot projects to see if doing so changes the nature of the questions, the ways in which data are analyzed, or what happens with analyses once the technical component is completed? And partnerships may also need technical experts — someone who can make some sense out of patterns that emerge from data, to help distinguish the gold from the noise.

Open science is a movement to further discovery, increase collaboration, and, ultimately, to wildly magnify the likelihood and frequency of benefit. Building on that vision with additional commitments to achieving benefit – not only allowing platforms for benefit – and to ensuring that there are always a few ongoing projects that relate to inequity or that work on the needs of those with the worst health outcomes or highest needs, could potentially deepen the vision further. It could also lead to some of those individuals asked to provide their permissions with greater personal interest in doing so.

About this series: In February 2019, Sage Bionetworks hosted a workshop called Critical Assessment of Open Science (CAOS). This series of blog posts by some of the participants delves into a few of the themes that drove discussions – and debates – during the workshop. Read the series.

Reproducibility in Computational Analysis

As Lara Mangravite and John Wilbanks noted in their opening salvo, “open science” is such a multifaceted concept that it defies consensus definition — and so I have found it particularly interesting to hear and read the individual definitions that have come out of the workshop and the first round of this blog series. I find myself celebrating with Titus Brown the progress already made in building communities of practice around portability, preprint culture and the Carpentries, and vigorously agreeing with Cyndi Grossman’s call for greater social justice. I feel enlightened by Brian Nosek’s discussion of preregistration of analysis plans, which I had somehow managed to remain ignorant of until now, and Irene Pasquetto’s meticulous reconnaissance of how much open data is actually reused (or not). Now that it’s time for me to add my stone to the cairn of CAOS (instant metal band name!), I’m going to focus on the facet of open science that keeps me up at night: reproducibility in computational analysis.

In the spirit of clear definitions, let’s distinguish reproducibility from replication. Reproducibility focuses on the analysis process. When we say we’re reproducing an analysis, we’re trying to take the same inputs, put them through the same processing, and hopefully get the same results. This is something we may want to do for training purposes or to build on someone else’s technical work.

Replication, on the other hand, is all about confirming (or infirming? sorry Karl) the insights derived from experimental results. To replicate the findings of a study, we typically want to apply orthogonal approaches, preferably to data collected independently, and see whether the results lead us to draw the same conclusions. It basically comes down to plumbing vs. truth: in one case we are trying to verify that “the thing runs as expected” and in the other, “yes this is how this bit of nature works.”

My personal mission is to drive progress on the reproducibility front, because it is critical for enabling the research community to share and reuse tools and methods effectively. Without great reproducibility, we are condemned to waste effort endlessly reimplementing the same wheels. Unfortunately, whether we’re talking about training or building on published work, reproducibility is still an uphill battle. Anyone who has ever tried and failed to get someone else’s hand-rolled, organic, locally-sourced python package to behave nicely knows what I mean: it’s nice to have the code in GitHub, but it’s hardly ever enough.

Certainly, as Brown noted, the bioinformatics community has made great progress toward adopting technological pieces like notebooks, containers, and standards that increase portability. Yet, there is still a sizeable gap between intended portability (“you could run this anywhere”) and actual ease of use for the average computational life scientist (“you can run this where you are”).

Beyond mechanisms for sharing code, we also need a technological environment that supports the easy convergence of data, code and compute. And we need this environment to be open and emphasize interoperability, both for the FAIRness of data and so that researchers never find themselves locked into silos as they seek to collaborate with other groups and move to new institutional homes throughout their research careers.

That is why I am particularly excited by the Data Biosphere* vision laid out by Benedict Patten et al., which introduces an open ecosystem of interoperable platform components that can be bundled into environments with interfaces tailored to specific audiences. Within a given environment, you can create a sandbox containing all code, data, and configurations necessary to reproduce an analysis, so that anyone can go in and reproduce the work as originally performed – or tweak the settings to test whether they can do better! It’s the ultimate methods supplement or interactive technical manual. (* Disclaimer: I work with Anthony Philippakis, a coauthor of the Data Biosphere blog post.)

The final major obstacle to promoting reproducibility is data access. Just looking at genomic data, much of the data generated for human biomedical research is heavily protected, for good reason. There is some open-access data but it is often not sufficient nor appropriate for reproducing specific analyses, which means that we often can’t train researchers in key methodologies until after they have been granted access to specific datasets – if they can get access at all. This is a major barrier for students and trainees in general, and has the potential to hold back entire countries from developing the capabilities to participate in important research.

The good news is that we can solve this by generating synthetic datasets. Recently, my team participated in a FAIR Data hackathon hosted by NCBI, running a project to prototype some resources for democratizing access to synthetic data generation as a tool for reproducibility. The level of interest in the project was highly encouraging. I am now looking for opportunities to nucleate a wider community effort to build out a collection of synthetic datasets that researchers can use off the shelf for general purposes, like testing and training, accompanied by user-friendly tooling to spike-in mutations of interest into the synthetic data for reproducing specific studies. We did something similar with an early version of our tooling in a previous workshop project presented at ASHG 2018.

The technical work that would be involved in making this resource a reality would be fairly modest. The most important lift will be corralling community inputs so that we can make sure we are building the right solution (the right data and the right tooling) to address most researchers’ needs, and, of course, determining who is the right custodian to host and maintain these resources for the long term.

About this series: In February 2019, Sage Bionetworks hosted a workshop called Critical Assessment of Open Science (CAOS). This series of blog posts by some of the participants delves into a few of the themes that drove discussions – and debates – during the workshop. Read the series.

From Open Systems to Trusted Systems: New Approaches to Data Commons

At Sage, we’re always reflecting on the work we’ve done, and how it might need to evolve. Conversations with partners and colleagues at events such as CAOS are a key part of this process. Over the last couple of years, a common theme has been the role that actively collaborating communities play in successful large-scale research projects, and the factors that accelerate or inhibit their formation.

Like any research organization, Sage experiments with different ideas and works in different areas over time. This challenges our technology platform team to build general-purpose systems that not only support the diversity of today’s research, but also pave the way for new types of science to be performed in the future.

Synapse makes it easier for researchers to aggregate, organize, analyze, and share scientific data, code, and insights.

Synapse, our flagship platform, allows us to develop data commons using a set of open APIs and a web portal hosted in the cloud, and programmatic tools that allow integration of Sage’s services into any analytical environment. Together, this platform makes it easier for researchers to aggregate, organize, analyze, and share scientific data, code, and insights.

Initially, our model for Synapse was GitHub – the software platform that has been the locus of the open source software movement over the past decade. Our thinking was that if we just made enough scientific knowledge open and accessible, scientists around the world would organize themselves and just start doing better science. In part, we saw our role as unlocking the potential of junior researchers who were digital natives and, perhaps, willing to work in ways less natural to the established PIs in the context of the established research ecosystem. Our assumption was that a pure technology solution would be sufficient to accelerate progress.

The reality wasn’t as straightforward as we thought.

Over the course of eight years, we’ve had a lot of large scientific collaborations operate on Synapse, some quite successfully and others less so. The main determinant of success has proven to be the level of alignment of incentives among the participating scientists, and their degree of trust in each other. Further, consortium-wide objectives must be aligned with individual and lab-level priorities. If these elements exist, the right technology can catalyze a powerful collaboration across institutional boundaries that would be otherwise difficult to execute. But without these elements, progress stalls while the exact same technology sits unused.

In a recent talk on the panel Evolving Challenges And Directions In Data Commons at BioIT World West (slides here), I shared several case studies to illustrate the aspects of the platform that were most successful in enabling high-impact science, and the characteristics that contributed to that success:

Digital Mammography Dream Challenge

In the Digital Mammography Dream Challenge, we hosted close to 10TB of medical images with clinical annotations in the cloud, and organized an open challenge for anyone in the world to submit machine learning models to predict the need for follow-up screening. Due to patient privacy concerns, we couldn’t directly release this data publicly. Instead, we built a system in which data scientists could submit models runnable in Docker containers, executed training and prediction runs in the AWS and IBM clouds, and returned output summaries. This was a huge shift in workflow for the challenge participants, who are more accustomed to downloading data to their own systems than uploading models to operate on data they cannot see.

The technical infrastructure, developed under the leadership of my colleagues Bruce Hoff and Thomas Schafter, is one of the more impressive things we’ve built in the last couple of years. Imposing such a shift in workflow on the data scientists risked being a barrier. That proved not to be the case: the incentive structure and publicity generated by DREAM generated enormous interest, and we ended up supporting hundreds of thousands of workflows generated by over a thousand different scientists.

mPower Parkinson’s Study

In the area of digital health, Sage has run mPower, a three-year observational study (led by Lara Mangravite and Larsson Omberg) of Parkinson’s disease conducted in a completely remote manner through a smartphone app. This study produced a more open-ended challenge: how to effectively learn from novel datasets, such as phone accelerometry and gyro data, collected while study participants balanced in place or walked. The study leveraged both Synapse as the ultimate repository for mPower data, as well as Bridge – another Sage technology designed to support real-time data collection from studies run through smartphone apps distributed to a remote study population.

We organized a DREAM challenge to compare analytical approaches. This time, we focused on feature extraction rather than machine learning. Challenge participants were able to directly query, access, and analyze a mobile health dataset collected over six months of observations on tens of thousands of individuals. Again, the access to novel data, and to a scientifically challenging and clinically relevant problem was the key to catalyzing a collaboration of several hundred scientists.

Colorectal Cancer Genomic Subtyping

Our computational oncology team, led by Justin Guinney, helped to organize a synthesis of genomic data on colon cancer originally compiled by six different clinical research teams. Each of these groups had published analysis breaking the disease into biologically-distinct sub-populations, but it was impossible to understand how the respective results related to each other or how to use the analysis to guide clinical work

Unlike the previous two examples, this was an unsupervised learning problem, and it required a lot of effort to curate these originally distinct datasets into a unified training set of over 4,000 samples. However, the effort paid off when the teams were able to identify consensus subtypes of colon cancer, linking patterns in genomic data to distinct biological mechanisms of tumor growth. This project operated initially with only the participation of the teams that conducted the initial clinical studies – and it was only in the confines of this initially private group that researchers were really willing to talk openly about issues with their data. It also helped that each group was contributing part of the combined dataset and therefore everyone felt that all the groups were contributing something to the effort. With the publication of the initial consensus classification system, the data and methods have been opened up and seeded further work by a broader set of researchers relating the subtypes to distinct clinical outcomes.

Towards Community-Specific Data Commons

What do these three examples have in common? From a scientific standpoint, not much. The data types, analytical approaches, and scientific contexts are all completely different. In retrospect, it’s perhaps obvious that there’s not much chance of data, code, or other low level assets being used across these projects. The fact that all three projects were supported on the same underlying platform is evidence that we’ve developed some generally-useful services. But, our monolithic GitHub-style front end has not been an effective catalyst for cross-project fertilization.

What has been a common indicator of success is effective scientific leadership that gives structure and support to the hands-on work of more junior team members. This is even more important when these projects are carried out by highly distributed teams that haven’t previously worked together. Developing this sense of trust and building a functional community is often easier to do in smaller, controlled groups, rather than in a completely open system that, at best, is saturated with irrelevant noise, and, at worst, can be hijacked by actors with bad intentions. Starting small and increasing the “circle of trust” over time is an effective strategy.

It’s becoming clearer that these sort of factors are the case, even in software development. Despite what you might think from some of the open-source rhetoric, most of the really large-scale, impactful open-source projects benefit from strong leadership that gives a sense of purpose and organization to a distributed group of developers. And, even GitHub itself is now a part of Microsoft – who would have bet money on that outcome 10 years ago?

In the past year, the Synapse team has been piloting the development of new web interfaces to our services that repackage how we present these capabilities to different communities into a more focused, community-specific interfaces. With the recent launch of the AMP-AD Data Portal, and NF Data Portal the first couple of these experiments are now public. I’m excited to see how our platform continues to evolve as we enter Sage’s second decade, and even more excited to see what new science emerges on top of it.

About this series: In February 2019, Sage Bionetworks hosted a workshop called Critical Assessment of Open Science (CAOS). This series of blog posts by some of the participants delves into a few of the themes that drove discussions – and debates – during the workshop. Read the series.

How Data Commons Can Support Open Science

In the discussion about open science, we refer to the need for having data commons. What are data commons and why might a community develop one? I offer a brief introduction and describe how data commons can support open science.

Data commons are used by projects and communities to create open resources to accelerate the rate of discovery and increase the impact of the data they host. Notice what data commons aren’t: Data commons are not designed for an individual researcher working on an isolated project to ignore FAIR principles and to dump their data to satisfy data management and data sharing requirements.

More formally, data commons are software platforms that co-locate: 1) data, 2) cloud-based computing infrastructure, and 3) commonly used software applications, tools and services to create a resource for managing, analyzing and sharing data with a community.

The key ways that data commons support open science include:

  1. Data commons make data available so that they are open and can be easily accessed and analyzed.
  2. Unlike a data lake, data commons curate data using one or more common data models and harmonize them by processing them with a common set of pipelines so that different datasets can be more easily integrated and analyzed together. In this sense, data commons reduce the cost and effort required for the meaningful analysis of research data.
  3. Data commons save time for researchers by integrating and supporting commonly used software tools, applications and services. Data commons use different strategies for this. The commons themselves can include workspaces that support data analysis, other cloud-based resources can be used to support the data analysis, such as the NCI Cloud Resources that support the GDC, or data analysis can be done via third party applications, such as Jupyter notebooks, that access data through APIs exposed by the data commons.
  4. Data commons also save money and resources for a research community since each research group in the community doesn’t have to create their computing environment and host the same data. Since operating data commons can be expensive, a model that is becoming popular is not charging for accessing data in a commons, but either providing cloud-based credits or allotments to those interested in analyzing data in the commons or passing the charges for data analysis to the users.

A good example of how data commons can support open science is the Genomic Data Commons (GDC) that was launched in 2016 by the National Cancer Institute (NCI). The GDC has over 2.7 PB of harmonized genomic and associated clinical data and is used by over 100,000 researchers each year. In an average month, 1-2 PB or more of data are downloaded or accessed from it.

The GDC supports an open data ecosystem that includes large scale cloud-based workspaces, as well as Jupyter notebooks, RStudio notebooks, and more specialized applications that access GDC data via the GDC API. The GDC saves the research community time and effort since research groups have access to harmonized data that have been curated with respect to a common data model and run with a set of common bioinformatics pipelines. By using a centralized cloud-based infrastructure, the GDC also reduces the total cost for the cancer researchers to work with large genomics data since each research group does not need to set up and operate their own large-scale computing infrastructure.

Based upon this success, a number of other communities are building their own data commons or considering it.

For more information about data commons and data ecosystems that can be built around them, see:

  • Robert L. Grossman, Data Lakes, Clouds and Commons: A Review of Platforms for Analyzing and Sharing Genomic Data, Trends in Genetics 35 (2019) pp. 223-234, doi.org/10.1016/j.tig.2018.12.006. Also see: arXiv:1809.01699
  • Robert L. Grossman, Progress Towards Cancer Data Ecosystems, The Cancer Journal: The Journal of Principles and Practice of Oncology, May/June 2018, Volume 24 Number 3, pages 122-126. doi: 10.1097/PPO.0000000000000318

About this series: In February 2019, Sage Bionetworks hosted a workshop called Critical Assessment of Open Science (CAOS). This series of blog posts by some of the participants delves into a few of the themes that drove discussions – and debates – during the workshop. Read the series.

Voices from the Open Science Movement

Open science is an umbrella term used by many people to represent a diverse set of research methods designed to increase the speed, value, and reproducibility of scientific output. In general, these approaches work to achieve their goals through increased sharing of research assets or transparency in research methods. Our own work in this field promotes the effective integration of computational analysis into the life sciences. The challenge: While the advancements in technology now support the generation and analysis of large-scale biological data from human samples in a way that can meaningfully expand our understanding of human biology, the established processes for implementation and independent evaluation of research outcomes are not particularly well suited to these emerging forms of scientific inquiry.

The scientific method defines a scientist as one who seeks new knowledge by developing and testing hypotheses. In this frame, the scientist is a neutral observer who is equally satisfied when a hypothesis is either proven or disproven. However, as with any application of theory, the practical implementation of the scientific method is impacted by the conditions in which it is applied.

A Different Era

The U.S. scientific system has a well-established set of processes that were developed in post-war America with the intention of advancing the kinds of science of that era. This system promotes the pursuit of scientific inquiry within the context of research universities, using funding from the government and distributing knowledge across the research community through papers published in journals and patents acquired by technology transfer offices. While this system can be highly effective, it also incentivizes scientists in a manner that directly impacts the outputs of their work.

The current scientific system rewards our scientists for new discoveries. This is the criterion that is used to gate their ability to obtain funding, their ability to advance their own careers and those of their colleagues, and, in some cases, their ability to remain employed. For this reason, we sometimes skew our experiments towards those that prove rather than disprove the hypothesis. We enter into the self-assessment bias – in which we tend to overvalue the impact and validity of our own outputs.

Now, all is not lost: we have a well-established system of peer-review that uses independent evaluation to assess the appropriateness of research conclusions. To this aim, we as a community, are meant to evaluate the evidence presented, determine the validity of an experiment, and understand how that experiment may support the general hypothesis. The task of turning an individual observation into general knowledge may be led by an individual scientific team, but it is the responsibility of the entire field.

Growing Pains

This system is noble and often quite effective. It’s also been strained by the scale and complexity of the research that is currently being pursued – including the integration of computational sciences into biology. The system is so good at encouraging publication in peer-review journals that more than 400,000 papers on biology were published in 2017. This causes a host of problems.

First, it’s a strain for anyone in the scientific community to balance the time it takes to perform this important task with many other demands. Second, the complexity of our modern experiments are not easily conveyed within the traditional means for scholarly communication, making it difficult for independent scientists to meaningfully evaluate each experiment. Third, the full set of evidence needed to evaluate a general hypothesis is usually spread across a series of communications, making it difficult to perform independent evaluation at the level of that broader hypothesis.

This last point can be particularly problematic as conflicting evidence can arise across papers in a manner that can be difficult to support through comparative evaluation. These issues have exploded into a much-publicized replication crisis, making it hard to translate science into medicine.

Open Methods

So what does this all have to do with open science? The acknowledgement of these imperfections in our current system has led to a desire – across many fronts – for an adapted  system that can better solve these problems. Open science contains lots of elements of a new scientific system. For computational research in life sciences, it works on the cloud where we can document our experimental choices with greater granularity. It provides evidence of the scientific process that helps us decide which papers out of that 400,000 to trust – the ones where we can see the work, and the ones where machines can help us read them.

In our own work, we have seen how the use of open methods can increase the justification of research claims. Working inside a series of scientific programs, we have been able to extract general principles and solutions to support this goal. These are our interventions – ways to support scientists in making real progress towards well-justified research outcomes.

These approaches have been encouraged by the scientists, funders, and policy makers involved in these programs, who are seeking ways to increase the translational impact of their work. We have seen cases across the field where these approaches have allowed exactly that. But these are sometimes at odds with the broader system, causing conflict and reducing their adoption. It may be time to contemplate a more complete, systemic redesign of the life sciences that supports our scientists in their quest for knowledge and that has the potential to directly improve our ability to promote human health.

Recognizing the Successes of Open Science

By Titus Brown

In my view, open science fundamentally depends on tools, infrastructure and practice for making the research process more open, transparent, and reproducible. Any progress on the open tooling and open practices front (almost) invariably redounds to the larger benefit of open science, and thus science more generally.

So at the recent Critical Assessment of Open Science (CAOS) meeting in New Orleans, I found myself a bit frustrated by the overall mood of doom and gloom. Sure, open science thinking has thus far failed to magically transform the scientific enterprise into a wonderland of openness and collaboration; the negatives of openness are becoming clearer as we explore them; and existing closed systems are surprisingly robust and adaptable in practice. But I think there’s lots of good news, too.

The good news is that, in the last 10 years, we have seen tremendous adoption of openness in scientific communities. For example, the widespread adoption of Jupyter and R notebook technologies means that data analysis workflows are being made explicit in a way that many can understand, share, and remix. Moreover, these open technologies are being incorporated into essentially every data science stack everywhere. Preprints in biology have taken off and there’s no going back. The majority of tools for bioinformatics are now open source. Sites like GitHub, Zenodo, Figshare, and the Open Science Framework, make it trivial to share content, mint DOIs, and openly integrate digital artifacts into the literature. The rise of cloud means that, increasingly, workflows are portable between groups. FAIR data principles have taken off. And the Carpentries training community has spread like wildfire and teaches, as one of its underlying philosophies, more effective sharing through all of the above mechanisms.

But, we don’t really stop and celebrate these wins in the open science community, because we’re relentlessly focused on the next steps. There’s plenty more to be done, and many disappointments and challenges, even with the successful approaches. The relentless academic focus on the unsolved problems prevents us from properly celebrating the amazing achievements that we’ve already got in the bag.

So, stop and smell the roses! Sit back and appreciate our wins, over a beverage of your choice, in a comfortable community space. And start every presentation and workshop with an optimistic statement about what has already worked. I’m not sure how else to best celebrate, but please consider this a call for suggestions.

What’s next?

With full awareness of the irony, I would like to now ask: what’s next? For me, one of the main challenges moving forward is how to more effectively spread the practices above. Scientific practice tends to shift slowly, for good and bad reasons. Can we accelerate adoption of open practices that demonstrably work?

To a large extent, I think adoption of more open practices is just going to happen: data science is an increasingly large, intrinsic part of science, and notebooks make too much sense to ignore. Preprints and open source are, likewise, deeply embedded in some fields and we just need to wait for the obstacles to retire. Sharing mechanisms aren’t going away. Cloud isn’t going away. FAIR is seeing adoption by funding agencies. And the training done by the Carpentries (and friends) seems increasingly likely to become embedded in undergraduate training, because it’s how data science is done.

But there are a lot of methodologies and practices that take a bit of work. For example, at a recent SIAM CSE minisymposium, many of the talks focused on the better ways we already have of working on and with software: we have good techniques for building and supporting software via community engagement, successful business models for long-term research software support, peer code review techniques that work, robust software citation mechanisms, and good continuous integration systems, with improvements on the way.

The main remaining challenge (in my view) is that of adoption: The future is already here – it’s just not evenly distributed. And distributing skills more evenly is hard, as is adapting them to the on-the-ground needs of each scientific community. In my experience, the most effective way of doing this is by developing organic development of communities of practice that adopt and solidify good practice, ultimately making this practice normative within their enclosing scientific communities.

So, what are my main takeaways? I’ll stick with three:

  1. Open has been really successful in ways that, 10 years ago, we would have found hard to believe. Celebrate!
  2. The leading edge of “open” has identified lots of good and effective practice. We should figure out how to spread and solidify this practice broadly, and not just work on the next exciting unsolved problem.
  3. It’s all about communities of practice, maaaan! Invest now! And let’s talk about how to make them more inclusive and welcoming!

Comments welcome,

–titus

Thanks to the CAOS organizers for running a great meeting, to the minisymposium speakers for their great talks, and especially to Daina Bouqin for the enthusiastic discussion about making good software citation behavior normative.


Dr. C. Titus Brown, an Associate Professor at the University of California, Davis. He runs the Data Intensive Biology Lab at UC Davis, where his team tackles questions surrounding biological data analysis, data integration, and data sharing.

About this series: In February 2019, Sage Bionetworks hosted a workshop called Critical Assessment of Open Science (CAOS). This series of blog posts by some of the participants delves into a few of the themes that drove discussions – and debates – during the workshop.

Voice from this series:

Read here or on Medium.com

Introduction by Lara Mangravite and John Wilbanks

Voices From the Open Science Movement by Lara Mangravite and John Wilbanks

Recognizing the Successes of Open Science by Titus Brown

Open Science is About the Outcomes, Content, and Process of Research by Brian Nosek

Do Scientists Reuse Open Data? by Irene Pasquetto

Open Science: To What End? by Cyndi Grossman

Open Science is About the Outcomes, Content, and Process of Research

By Brian Nosek

Much of the open science movement focuses on improving transparency and accessibility of the contents and outputs of research — papers, data, code, and research materials. The benefits of increasing openness of such resources to advancing science is easy to comprehend.

For example, if the literature were more accessible, then science would be more inclusive of the people who otherwise would not be able to read the literature. They could be better informed about potential applications in scholarly work, commercial development, and translation into policy or practice. Likewise, having open data and materials makes it easier to demonstrate that reanalysis of data reproduces the reported findings, and to reuse data and materials for novel research applications. However, the narrow focus of open science on contents and outputs of research misses one of the most critical rationales for open science — improving the credibility of published findings.

In a recent paper from my lab, we reported evidence that reducing biases in decision-making may require explicitly pointing out the potential source of bias (Axt, Casola, & Nosek, 2019 [open version]). You can visit the Open Science Framework (OSF) and obtain the original code and data for the experiments and verify the outcomes we reported in the paper. That’s good. But, is that sufficient for the claims to be credible? No. Just being able to reanalyze the data does not give you the insight that you need to effectively evaluate the credibility of the research.

For example, what if we had run 50 experiments on the problem and reported only the five that “worked”? And, what if we had analyzed the data in a variety of ways and reported only the outcomes that generated the most interesting findings? Because of our outstanding ability to rationalize behavior when it is in our self-interest, we may not even recognize ourselves when we are dismissing a negative result as a flawed experimental design rather than counterevidence to our claims. Open data and code don’t help you (or me) identify these potential reporting biases.

The solution is open science — but open science about the process of conducting the research.  For you to effectively evaluate the credibility of my findings, you need to be able to see the lifecycle of how those findings came to be. If I can show you that the analyses we reported were planned in advance rather than post-hoc, you will be more confident in interpreting the statistical inferences. If I can show you all the studies that were conducted as part of the research, regardless of whether they made it into the final paper, you can assess any reporting biases that may have occurred. Opening the lifecycle of research facilitates the self-corrective processes that we lionize as critical for scientific progress.

The primary mechanisms for improving research credibility by opening the research process are registration of studies and preregistration of analysis plans. Registration of studies ensures that all studies conducted are discoverable, regardless of publication status. Preregistration of analysis plans clarifies the distinction between confirmatory (hypothesis testing) and exploratory (hypothesis generating) outcomes. Confusing the two is one of the key contributors to irreproducibility and low credibility of published findings.

Registration is well known, and still improving, in clinical trials research. But, it is still rare in pre-clinical and basic research across disciplines. That is changing. For example, since OSF’s launch in 2012, the number of registrations has approximately doubled each year. Now, there is in excess of 20,000 — almost all of it for basic research applications. That is still a drop in the bucket against the yearly volume of research. But, the promising trajectory coupled with a reform-minded community pushing for improving transparency and credibility of research is a positive indicator of continuing improvement in making accessible not just the outcomes and content of research, but the process of discovery too.


About: Brian Nosek is co-Founder and Executive Director of the Center for Open Science (http://cos.io/) that operates the Open Science Framework (http://osf.io/). COS is enabling open and reproducible research practices worldwide. Brian is also a Professor in the Department of Psychology at the University of Virginia. He received his Ph.D. from Yale University in 2002.[/vc_column_text][/vc_column][vc_column width=”1/3″][vc_column_text]About this series: In February 2019, Sage Bionetworks hosted a workshop called Critical Assessment of Open Science (CAOS). This series of blog posts by some of the participants delves into a few of the themes that drove discussions – and debates – during the workshop.[/vc_column_text][vc_separator][vc_column_text]Voice from this series:

Read here or on Medium.com

Introduction by Lara Mangravite and John Wilbanks

Voices From the Open Science Movement by Lara Mangravite and John Wilbanks

Recognizing the Successes of Open Science by Titus Brown

Open Science is About the Outcomes, Content, and Process of Research by Brian Nosek

Do Scientists Reuse Open Data? by Irene Pasquetto

Open Science: To What End? by Cyndi Grossman

Do Scientists Reuse Open Data?

Yes, but not how you think. And it’s all right.

 

By Irene Pasquetto

Open science can mean many different things for different people. The way I see it, open-science practices and policies have two fundamental, intertwined goals: to increase science credibility and efficiency. To reach these goals, open-science advocates and practitioners promote the release of science products and means of production – such as manuscripts, data, code, processes, and tools – under open access.

As Titus Brown pointed out in another post of this series, the open-science community – at least in the biomedical sciences – has made extraordinary advances in the last few years. These include the widespread adoption of notebook technologies, the release of preprints, and the diffusion of open-source tools for bioinformatics, to mention just a few.

Similarly, today several groups promote and adopt norms and practices for data sharing and curation in biomedical research. As we know, depositing curated research data in open repositories contributes to making data analyses more reproducible, and, as a result, science more trustworthy. Without data, there is no evidence; without evidence, there is no science; without science, there are no facts. There is no doubt that ensuring reproducibility is itself a sufficient argument for data sharing.

There is one aspect of data-sharing practices that remains widely misunderstood, which is how scientists reuse open data to produce novel knowledge (not to validate others’ analyses). This is the thing: expectations for how much, how, and by whom biomedical data will be re-purposed once released under open access are often misplaced, to say the least.

Over the last four years, I spent most of my working hours interviewing scientists about their data reuse practices. I visited more than ten labs geographically distributed in seven cities, and I asked scientists to show me how, when, and why they reuse research data hosted on open databases and repositories. I talked with both senior and junior faculty members, graduate students and technicians, working in many subspecialties. They included human geneticists, computational biologists, developmental and evolutionary biologists, surgeons, and even clinicians. What a fun job I have, I know.

This is what I found

Let me get this straight: scientists are certainly reusing open data to produce novel knowledge. Practices of data reuse, however, vary between groups, workflows, and types of data. Overall, science data are reused in many ways and at different speeds and rates.

What do scientists reuse open data for?

  • Scientists commonly reuse open data for control, comparison, calibration, or (more rarely) to conduct meta-analyses and train or test algorithms.
  • Setting aside a few notable exceptions, scientists rarely reuse open datasets to ask novel research questions (i.e., for knowledge discovery). At least in biomedicine, researchers seem to still prefer to use data that they personally collect to conduct novel analyses.

Which kinds of open data are being reused?

  • Typically, among all the datasets hosted in an open repository, a few selected datasets become very popular over time. Researchers tend to consult these well-known datasets almost daily. However, the majority of the datasets tend to be reused only occasionally. Data reuse practices seem to mirror citation patterns for scientific publications.
  • Data curation is necessary but not sufficient for reuse. Releasing curated, high-quality data does not necessarily enable reuse. Scientists reuse data that they find useful and instrumental to their own research agenda and workflows.
  • Data generated by a single lab in a peripheral field are the hardest to reuse: the epistemic costs of learning about the data and the science behind it are often too high – specialized knowledge takes time to be internalized and cannot be easily formalized in metadata and ontologies.
  • In contrast, the most widely reused datasets are those intentionally generated for a specific use, with a specific research audience in mind (e.g., TCGA cancer data).

How do scientists evaluate open data for reuse?

  • As researchers, we easily trust open data that have been reused before, over and over again, by our colleagues. Newly released open data with no record of reuse will need time to conquer scientists’ hearts. The adoption curve could take months or even years.
  • Plus, researchers simply tend to reuse open data collected by people and institutions that they like. Reputation, trust, and pre-existing networks impact reuse as much curation practices do.

What is truly needed to enable reuse of open data?

  • Collaborations are the holy grail of reuse. For all the reasons mentioned above, the most successful cases of reuse originated from multi-labs collaborations that involved both data creators and new users. Often these collaborations resulted in co-authorship of one or multiple articles.

What to expect from data reuse practices

Briefly, these are my recommendations for those of you who are engaging in open data/data sharing efforts for the purpose of reuse:

  • First, no matter what data you collect, keep in mind that reuse is only one reason for data sharing. Data should be released for transparency as much as for reuse.
  • Give up on the idea that all the data you are collecting, curating, and releasing will be widely reused. Some will, some will not, and some will but in unexpected ways.
  • If you truly want to maximize reuse, first assess potential for reuse, then start data collection. Open datasets can be reused in many ways, by different sets of users. What can your data be reused for and by whom?
  • Hire or consult with data curators who understand the curation needs of your potential users, and (equally important) their science workflows, agendas, and interests.
  • Do not try to curate the data “for the entire world.” First, focus on the needs of your immediate users.
  • Facilitate the formation of a community of practice around your data. Once you have identified potential users, bring them together by promoting community norms, encouraging collaboration, and adopting ad hoc curation practices. But, remember that communities of practice are not built out of the blue. Potential users should share a pre-existing interest in a kind of data, or in a specific method, sub-discipline, process, etc.
  • Once you have identified which datasets might be reused for which goals, you can assign different levels of curation and access, accordingly.
  • Encourage collaboration (and co-authorship) between data creators, data curators, and data re-users.

To sum up, open data supports the credibility and efficiency of science by promoting transparent research practices and, potentially, wide reuse of data. However, science drives reuse of open data, not the other way around. Open data can be shared, curated, and reused in many ways. In order to maximize reuse, the “trick of the trade” is to analyze and defy the “science needs” of your potential re-users.

It is not a given what kinds of data people might want or need. You can only estimate the potential for data reuse, and, to do that, you need a thorough analysis of not only where the demand for data is today, but also of where it is heading.


About: Irene Pasquetto is a postdoctoral fellow at the Shorenstein Center on Media, Politics, and Public Policy, at the Harvard Kennedy School.

About this series: In February 2019, Sage Bionetworks hosted a workshop called Critical Assessment of Open Science (CAOS). This series of blog posts by some of the participants delves into a few of the themes that drove discussions – and debates – during the workshop.

Voice from this series:

Read here or on Medium.com

Introduction by Lara Mangravite and John Wilbanks

Voices From the Open Science Movement by Lara Mangravite and John Wilbanks

Recognizing the Successes of Open Science by Titus Brown

Open Science is About the Outcomes, Content, and Process of Research by Brian Nosek

Do Scientists Reuse Open Data? by Irene Pasquetto

Open Science: To What End? by Cyndi Grossman

Open Science: To What End?

By Cyndi Grossman

I have a hard time defining open science. This was confirmed for me during a Critical Assessment of Open Science (CAOS) meeting hosted by Sage Bionetworks where a definition was proposed and elicited strong, immediate pushback from attendees. It even sparked disagreement about what constitutes “science.”

At least I am in good company.

That said, when it comes to how science is disseminated to and consumed by the public, I can easily define what I would like to see fixed:

  • Publication paywalls that limit access to cutting-edge findings
  • Scientific meetings that lack diversity among speakers
  • Abstracts that fail to describe findings in language accessible to non-scientists or non-topic area experts
  • Inefficiencies in the process of discovering and addressing false scientific claims or misconduct

These issues not only reflect an unhealthy exclusivity that pervades science, they also contribute to public mistrust in science.

As a career-long advocate for engaging communities and individuals in research, I have seen research designed by youth to support the mental health needs of their peers, sex workers advocate for HIV vaccine research, and parents with children who are living with a rare disease discover their child’s genetic mutation. Public engagement in science, especially life sciences, is essential to its impact on society. Conducting science more openly is an important component of fostering greater engagement, but the focus of open science is too much on engaging other scientists and not enough on engaging the larger community of non-scientists.

During the CAOS meeting, we discussed the difference between “bolted-on” and “built-in” solutions. The current approaches to address publication paywalls, structure datasets for reuse, and share code and algorithms, are important yet bolted-on solutions. They address each element in isolation rather than redesigning across the way science is incentivized, conducted, and disseminated to build-in the perspectives and needs of non-scientists when it comes to how science intersects with society.

Open science could offer a new way of conducting science in the 21st Century where incentives are restructured toward greater openness and sharing, collaboration and purposeful competition, and structural support for dissemination of scientific tools and results. But this new system must be designed by bringing scientists and non-scientists together if the barriers between science and society are to be broken down.

I don’t know if this is putting too much onto open science, but there are some organizations, like 500 Women Scientists, that support open science, diversity, and social justice, and connect scientists to society through education and volunteerism. We know that the scientific enterprise reflects elements of societal inequity, yet there are relatively few efforts aimed at self-reflection and self-correction. My hope is that open science, however we define it, can be an example of a more inclusive way to conduct science.


About: Cynthia (Cyndi) Grossman is a social and behavioral scientist by training. Most recently, she was director at FasterCures, a center of the Milken Institute where she led efforts to integrate patients’ perspectives in biomedical research and medical product development. She has spent her career supporting research to address unmet needs such as mental health, stigma, and other social determinants of health. She is currently obsessed with the potential of health data to advance research and well-being by connecting individuals, communities and systems.

About this series: In February 2019, Sage Bionetworks hosted a workshop called Critical Assessment of Open Science (CAOS). This series of blog posts by some of the participants delves into a few of the themes that drove discussions – and debates – during the workshop.[

Voice from this series:

Read here or on Medium.com

Introduction by Lara Mangravite and John Wilbanks

Voices From the Open Science Movement by Lara Mangravite and John Wilbanks

Recognizing the Successes of Open Science by Titus Brown

Open Science is About the Outcomes, Content, and Process of Research by Brian Nosek

Do Scientists Reuse Open Data? by Irene Pasquetto

Open Science: To What End? by Cyndi Grossman