The Sage Perspective on Data Management and Sharing

Response to Request for Public Comments on a DRAFT NIH Policy for Data Management and Sharing and Supplemental DRAFT Guidance

// Submitted by Lara Mangravite and John Wilbanks on Behalf of Sage Bionetworks

Editor’s Note: This response to the draft policy involves pasting reply copy into fields in a web form with 8000 characters per field, thus the format. For reference: the AMIA response in 2018 mentioned in the Sage response.

Section I: Purpose (limit: 8000 characters)


Recommendation I.a: Make ‘Timely’ More Specific

The policy states that “shared data should be made accessible in a timely manner…”. Timely should be defined, so that researchers understand the baselines expected of them, and have a boundary beyond which they must share data. These baselines and boundaries should be reflected in the templated DSMPS we recommend elsewhere. More details are provided in our recommendations in the Requirements section, and we further recommend in section VI that such DMSPs are scored elements of applications.

Recommendation I.b: Elevate the Importance of Data Management

We applaud the mention of data management in both the purpose section and even in the title, but it is not given adequate attention in this section. The Purpose text does not address the lessons learned from data sharing within NIH-funded collaborations. From the Cancer Genome Atlas to Clinical Translational Science Awards to the Accelerating Medicines Partnerships, NIH has committed billions annually to such programs. Data sharing sits at the heart of these collaborative networks, but our experience indicates that simply “sharing” data sets is not sufficient to meet the stated purpose. Data management is rarely elevated to a role commensurate with its importance in data reuse. As such, we recommend adding the following text to the end of the first paragraph to delineate the importance of management to achieving the purpose of the policy:

“Data management is an ongoing process that starts well before and goes on well after the deposit of a file under FAIR principles, and NIH encourages practices that have been demonstrated to succeed at promoting data sharing and reuse in previous awards.”

Section II: Definitions (limit: 8000 characters)


Recommendation II.a: Amend the Definition of Data Management and Sharing Plan

The definition of the Data Management and Sharing Plan does not sufficiently capture how DMS is integral to the research process. This Policy should make it clear that the data sharing is not an add-on or checkbox, but an ongoing management process that is integrated into the scientific research process. We recommend adding the following text to the definition of:

“The plan should describe clearly how scientific data will be managed across the entirety of a research grant and specific descriptions of how and when resulting data will be shared, including descriptions of which NIH approved repositories they will be deposited (or, if depositing outside this group, how the proposed repository will be sufficient to meet the requirements).”

Recommendation II.b: Replace the Definition of Data Management

The definition of Data Management does not sufficiently reflect the true extent to which data management must permeate the research process, nor why it is important. Data management is a massive undertaking that improves the quality of shared data. We endorse the 2018 AMIA definition of data management and recommend that the NIH adopt it, replacing the current definition text with the following:

“The upstream management of scientific data that documents actions taken in making research observations, collecting research data, describing data (including relationships between datasets), processing data into intermediate forms as necessary for analysis, integrating distinct datasets, and creating metadata descriptions. Specifically, those actions that would likely have impact on the quality of data analyzed, published, or shared.”

Recommendation II.c: Add a Definition for Scientific Software Artifacts

The stated purpose of this policy is “to promote effective and efficient data management and data sharing.” Per our recommended additions to the Scope section, below, the policy should make clear that what must be managed and shared are not only the “scientific data” and “metadata” created in the course of research, but also the scientific software artifacts created, such as the code underlying the algorithms and models that process data. Accordingly, we echo AMIA’s call for definitions of “scientific software artifacts” and recommend NIH include in this policy the following definition:

“Scientific software artifacts: the code, analytic programs, and other digital, data-related knowledge artifacts created in the conduct of research. These can include quantitative models for prediction or simulation, coded functions written within off-the-shelf software packages such as Matlab, or annotations concerning data or algorithm use as documented in ‘readme’ files.”

Recommendation II.d: Add a Definition for “Covered Period”

Making data available for others to use can pose a significant burden, per the supplemental guidance on Allowable Costs. Investigators will need clear definitions of exactly what will be required of them for data hosting in the short, medium, and long term. As such, we recommend that NIH include a definition in this section for “covered period,” providing as much detail as possible on the expectations for the length of time that investigators must make their data available, including differences in requirements for research awards and data sets (including scientific software artifacts) of different scales.


Section III: Scope (limit: 8000 characters)


Recommendation III.a: Include Scientific Software Artifacts as an Asset to be Managed/Shared

The first sentence in this Policy notes “NIH’s longstanding commitment to making the results and outputs of the research that it funds and conducts available to the public.” Scientific software artifacts (as defined in the response to the Definitions section, above) are outputs as much as data, equally determinative of research findings. Thus, managing and sharing the means of manipulating data from one form to another, transforming raw inputs into valuable outputs, is also important to the end goal of rigorous, reproducible, and reusable science.

Furthermore, it is possible to technically share data while withholding key artifacts necessary to make those data valuable for reuse. These key artifacts could then be exchanged for authorship, position on proposals, or other scientific currencies, thus circumventing a major desired outcome of this policy: removing the unfair advantages of already funded investigators. As such, we recommend that the Scope section include the following statement:

“NIH funded research produces new scientific data and metadata, as well as new scientific software artifacts (e.g. the code of algorithms and models used to manipulate data). Software artifacts are outputs of research as much as data, and it is just as important to manage and share them in the interest of rigor, reproducibility, and re-use. NIH’s commitment to responsible sharing of data extends to scientific software artifacts. As such, throughout this policy, the use of the term “data” should be understood to include scientific software artifacts, per the definition established in Section II.”


Section IV: Effective Date(s) (limit: 8000 characters)

No recommendations planned

Section V: Requirements (limit: 8000 characters)


Recommendation V.a: Tier the Sharing Date Requirement

This policy will require cultural and practice changes for most funded researchers, as well as a nimble reaction to the realities of implementations by NIH. Failing to anticipate the implications of those changes could cause a severe backlash to the policy, undermining its purpose. As such, investigators of those projects least able to redistribute resources necessary to abide by this policy should be given more time to do so. We recommend that NIH adopt AMIA’s 2018 tiered proposal for establishing sharing date requirements based on the size of funding. Projects funded over $500,000 per year would have to comply within one year of approval of the DMSP, those between $250,000-$500,000 within two years, and those below $250,000 within three years.

Recommendation V.b: Create DMSP Templates

We do not expect most researchers to know how to structure a Data Management and Sharing Plan. Furthermore, grants structure into different categories: the funding mechanisms behind the Cancer Genome Atlas and the AllofUs Research Program are different than early career researcher grants and most R01s. We therefore recommend the ICs create templates for at least four categories of funding: grants intended to create reference resources for the scientific community, grants that create collaborative networks of multiple laboratories, grants that form “traditional” research but integrate at least two institutions, and grants that only flow to a single institution.

These templates will facilitate understanding of the DSMP obligations by researchers (a form of learning by bootstrapping), as well as facilitate review by standardizing the essential elements and layout of the DSMPs across submissions. Researchers who do not use the standard template would not be penalized, but any DSMP they submit should clearly mark how and where their essential elements map onto the templates provided by NIH. Segmenting these templates by class of resources expected to be shared will make it easier for researchers to understand expectations (and can be tied to kinds of funding mechanism, e.g. U24) and will also make like-to-like evaluation easier for the NIH in evaluation over time.


Section VI: Data Management and Sharing Plans (limit: 8000 characters)


Recommendation VI.a: Make Data Sharing a Requirement

This section states, “NIH encourages shared scientific data to be made available as long as it is deemed useful to the research community or the public.” However, the future utility of data is often unknown at the time it would be required for deposit, and it is unclear who would be responsible for deeming data as useful. We recommend that NIH require, not encourage, data to be shared. The NIH should also provide both alternate “sharing” mechanisms and opt-out processes for the situations when data sharing is either impossible or inadvisible (i.e. when sharing data would compromise participant privacy or harm a vulnerable group.

Alternate mechanisms could include a private cloud where users “visit” the data and are surveilled in their uses or “model-to-data” approaches where a data steward runs models on behalf of the community. Opt-outs should be rare but achievable, and patterns of opt-out usage should be tracked at the researcher and institution level to assist in evaluation of their use and impact.

Recommendation VI.b: Distinguish Between Purposes of Sharing

The requirements for data sharing should be different for data whose value to the community is realized in different ways. There is a difference between data that are generated with the explicit intent of creating a shared resource for the research community (e.g. TCGA), and data that are generated within the context of an investigator-initiated research project and are to be shared to promote transparency, rigor, and to support emergent long-term reuse. In the former case, a description of a detailed curation, integration, synthesis, and knowledge artifact plan should be present. In the latter case, a description of file format, simple annotation, and long-term storage should be front and center. We recommend that this section explicitly distinguish between these two purposes of sharing, and that different formats be used for developing and assessing DMSPs with respect to these different purposes.

Recommendation VI.c: Require the DMSP as a Scorable Part of the Application

In this policy, the DMSP will be submitted on a Just-in-Time basis. This signals that the plan is not a valued part of the application and is, in fact, an afterthought. NIH should factor the quality of the DMSP in its funding decision process. We recommend that the DMSP be required as a scorable part of the application so that appropriate sharing costs can be budgeted for at the time of application, and the plan can be included as part of the review process.

Recommendation VI.d: Make DMSPs Publicly Available

This section states that, “NIH may make Plans publicly available.” We believe that NIH should ensure transparency with the public who has funded the work, and take advantage of transparency as a means for encouraging compliance. As such, we recommend that this section state that “NIH will make Plans publicly available.”


Section VII: Compliance and Enforcement (limit: 8000 characters)


Recommendation VII.a: Give Investigators Time to Share

Judging an application based on performance on past DMSPs is only fair if the investigators have had sufficient time to implement that plan. Per Recommendation V.a (above) to tier the sharing date requirement, we recommend that application reviewers begin using evidence of past data and software artifact sharing starting between one and three years after the adoption of the DMSP, depending on the size of the prior award. Those with a prior award of $500,000 per year could be judged after one year of approval of the DMSP, those with a prior award between $250,000-$500,000 after two years, and those with a prior award of below $250,000 after three years.

Recommendation VII.b: Use Existing Annual Review Forms for Proof of Compliance

Compliance with this policy should be integrated with current annual review processes for funded research projects. Proof of compliance should not require more than a single line in existing documentation, otherwise proof of compliance, itself, becomes an unnecessary burden of compliance. We recommend that NIH add a URL to a FAIR data file in annual review forms, alongside those lines for publications resulting from the data. This would provide an incentive to encourage a broad array of DMS practices and make it as simple as “filling the blank” on the form. We also recommend that NIH create an evaluation checklist as part of DSMP annual review to be filled out by the investigator and shared alongside the existing annual review forms.

Recommendation VII.c: Certify “Safe Spaces” for DMSP Compliance

Compliance and enforcement will also be significantly easier if NIH develops a process to certify data commons, knowledge bases, and repositories as “safe spaces” for DMSP compliance.

Such a process could analyze the long-term sustainability of a database, its capacity to support analytic or other reuse, its support of FAIR principles, and more. Such a network would significantly “raise the floor” for the broad swath of researchers unfamiliar with FAIR concepts, for researchers at institutions without significant local resources to make data FAIRly available, and more. Accordingly, we recommend that this section include language detailing an NIH certification process for these resources.

Recommendation VII.d: Add data sharing and management experts to review panels

The composition of review panels is a key part of using DSMPs in award decisions. Ensuring data sharing and management expertise is represented as part of baseline review panel competency will increase both initial review and also encourage long-term compliance with the key goals of DSMPs.


Supplemental DRAFT Guidance: Allowable Costs for Data Management and Sharing (limit: 8000 characters)


Recommendation VIII.a: Detail the Duration of Covered Costs for Preservation

The funding period for a research project is relatively short compared to the period after the research is complete wherein its outputs might be replicated or reused. Ideally, research outputs would be preserved indefinitely, but preservation has costs. The draft guidance does not specify whether costs to preserve data beyond the duration of the funded grant are allowed or encouraged. We recommend that this section provide detail as to whether NIH will cover data preservation costs after the funding period and, if so, for how long.

Recommendation VIII.b: Detail the Covered Costs for Personnel

DMS costs are not limited to the acquisition of tools, infrastructure, and the procurement of services; they also entail the time and effort of research staff internal to the investigating institution. The draft guidance does not specify whether personnel costs are allowable expenses related to data sharing. We recommend that this section provide detail as to whether NIH will cover such personnel costs – data sharing and management, done well, imposes a short term cost in anticipation of longer term benefit. NIH should clarify where that cost comes from as part of the Policy.

Recommendation VIII.c: Detail How Cost Levels Will Affect Funding Decisions

The Policy does not state whether a higher cost for better DMS might penalize (or advantage) a proposal in an IC’s funding decisions. If potential recipients A and B propose to do the same research with the same traditional research costs, but A budgets for a robust “Cadillac” DMS plan, whereas B budgets for a bare-minimum “Chevy” plan, which does NIH choose? All things equal, should they choose the costlier, more robust option? Is it OK that it is a “tax” on the research proper? Is there an ideal ratio of traditional research costs to DMS costs? Is there a standard way to compare costs with benefits? We recommend that NIH provide detail in this section regarding how and if DMS costs will affect funding decisions.


Supplemental DRAFT Guidance: Elements of a NIH Data Management and Sharing Plan (limit: 8000 characters)


Recommendation IX.a: Address Different ‘Community Practices’ Across Disciplines

Section 1 of this supplemental guidance states that, “Providing a rationale for decisions about which scientific data are to be preserved and made available for sharing, taking into consideration…consistency with community practices.” However, different disciplinary fields can have different community standards. Some disciplines have a culture of sharing more, while in others it is less or not at all. Should all disciplines be held to the same DMS standards, or will investigators of different disciplines be expected to adhere to different community practices? If the former, how will this standard be established and what are the ramifications for compliance in disciplines currently outside of this standard? We recommend that NIH provide additional detail in this section (or, if necessary, in separate supplemental guidance) as to what the DMS expectations are within and across scientific disciplines.

Recommendation IX.b: Direct the Use of Existing Repositories

Section 4 of this supplemental guidance states, “If an existing data repository(ies) will not be used, consider indicating why not…” We recommend that the word “consider” be removed. This policy should recommend the use of established repositories and, if this is not feasible, then the investigator should justify their decision with a specific reason. We understand that many scientists are unaware of the infrastructure already in place, so we also recommend that NIH provide a list of existing data repositories with a certification of compliance to increase their use. Additionally, NIH may wish to provide guidance and build associated resources to assist investigators choose which of these repositories to use. If there are repositories that they must use (e.g., or that NIH would prefer them to use, or that NIH has no preference (i.e., it would like the “market” to arrive at the best option), then NIH should make these degrees of requirement plain to investigators and make tools and infrastructure available to help them to decide.

Recommendation IX.c: Clarify Sharing Requirements for Data at Different Degrees of Processing and Curation

Section 1 requires investigators to describe “the degree of data processing that has occurred (i.e., how raw or processed the data will be).” This raises the question as to whether the investigator can choose the level of processing and/or curation of the data to share, or if the investigator must share data at all levels of processing/curation. For purposes of reproducibility, we should encourage — or require — not only the sharing of data, but descriptions of data processing at each level (per Section 2: Related Tools, Software and/or Code). This may, of course, increase the costs of DMS, so additional guidance would also be needed on what thresholds there may be and, the NIH should designate where the investigator has freedom to choose the levels of data shared and how the investigator should make tradeoffs.

Recommendation IX.d: Expand the Requirements and Guidance for Rationale

Section 1 requires a rationale of which data to preserve or share based on the criteria of “scientific utility, validation of results, availability of suitable data repositories, privacy and confidentiality, cost, consistency with community practices, and data security.” This rationale is limited to the choice of which data to share, while there are other important DMS decisions that warrant rationales. We recommend NIH require a rationale on where to share it and how long it will be available (Section 4), in what format it is shared (Section 3), and what other things might be shared, such as algorithms (Per Section 2). As with the choice of which data to preserve and share, NIH should offer criteria for decisions in each of these areas as well.

For choices regarding data preservation and sharing, as well as these other choices, if NIH has any preferences on how to weigh and balance criteria, we recommend it make those plain through additional guidance. Further, it should develop tools and infrastructure to help investigators to weigh and balance them, and conduct periodic audits/evaluations to understand how investigators across fields, over time, are making these judgements, if those judgements are in the best interest of the scientific community, and what additional incentives/requirements might be put in place.


Other Considerations Relevant to this DRAFT Policy Proposal (limit: 8000 characters)


Recommendation X.a: Detail how NIH will Monitor and Evaluate the Implementation of this Policy

A planning mechanism without an evaluation mechanism is only half complete. This policy should establish an adaptive system that improves DMS over time though feedback and learning. We recommend that this policy contain a new section that details how NIH will monitor and evaluate performance toward individual DMSPs during the funding period and after, to the extent that data are planned to be preserved after. Further, we recommend this new section also detail how NIH will monitor and evaluate implementation of this policy across all DMSPs, using evidence to illustrate how its purpose is or is not being achieved and what changes might be made to improve it. Policy-wide monitoring and evaluation information and reports should be made publicly available. Publicizing measures (e.g., usage rates and impact of previously shared data) is also a way to promote a culture where investigators are incentivized to produce datasets that are valuable, reusable, and available.

The Sage Perspective on the American Research Environment

tags: biomedical science, open science, computational science

A Response to the Request for Information on the American Research Environment Issued by the Office of Science and Technology Policy

// Submitted by Lara Mangravite and John Wilbanks on Behalf of Sage Bionetworks



Innovations in digital technology expand the means by which researchers can collect data, create algorithms, and make scientific inferences. The potential benefit is enormous: we can develop scientific knowledge more quickly and precisely. But there are risks. These new capabilities do not, by themselves, create reliable scientific insights; researchers can easily run afoul of data rights, misuse data and algorithms, and get lost in a sea of potential resources; and the larger scientific community can barricade themselves into silos of particular interests.

Improving discovery and innovation though the sharing of data and code requires new forms of practice, refined through real world experience. Science is a practice of collective sense-making, and updates to our tools demand updates to our sense-making practices. At Sage Bionetworks, we believe that these practices are a part of the American Research Environment. As such, our response to this Request for Information (RFI) focuses on implementing scientific practices that promote responsible resource sharing and the objective, independent evaluation of research claims.

We begin with two vignettes that illustrate the power of open science practices to deal with Alzheimer’s Disease and colorectal cancer. Next, we assess the American Research Environment, given our aims: more (and more responsible) sharing; data and algorithms that can be trusted; and evidence collection that is practical and ethical.

Finally, we offer recommendations under the Research Rigor and Integrity section of the RFI. Our conclusion is that to improve digital practices across the scientific community, we must explicitly support transdisciplinary practices as important efforts in their own right, while integrating them into domain-specific scientific projects.

Summary of recommendations

  • Develop — and fund over time — platforms for storing, organizing, and making discoverable a wide variety of types of data to a wide variety of stakeholders.
  • Change the institutional incentives towards using cloud platforms over local high-performance computing (HPC).
  • Develop clear community standards for implementing, evaluating, and articulating algorithm benchmarks.
  • Create or acquire training, workshops, or other forms of education on the ethics of computational science.
  • Develop systemic practices for identifying risks, potential harms, benefits, and other key elements of conducting studies with data from mobile devices.
  • Require federal researchers to preregister their research prior to conducting work (e.g., via to ensure their results are published, even if their hypotheses are not validated.



Research — including its reproduction — can be a complex, systems-of-systems phenomenon. Incentives, impediments, and opportunities exist at multiple interacting layers. It is often helpful to understand issues such as these in context. The following two examples show how technology-centered collaborative practices can yield stronger scientific claims, which in turn increase returns on investment in science.

Accelerating Medicines Partnership for Alzheimer’s Disease
Alzheimer’s disease (AD) and dementia are a public health crisis. The financial toll of dementia is already staggering. In the U.S. alone, the costs of caring for people over age 70 with dementia were estimated to be as high as $215 billion in 2010. Drugs for dementia are hard to find, such that the cost of finding even an ineffective medicine for AD sits at $5.7 billion.

The question is, what can we do to make it easier? One way is to change our scientific practice – the way we discover drugs and their targets. The Accelerating Medicines Partnership for Alzheimer’s Disease is a test of this idea. Twelve companies joined the National Institutes of Health (NIH) in this pre- competitive collaboration, forming a community of scientists that use the cloud to work together, share early and often, and improve both public and private returns on investments in AD drug discovery.

Within AMP-AD, Sage Bionetworks coordinates the Target Discovery and Preclinical Validation project. The project’s goal is to shorten the time between discovery of potential drug targets to development of new drugs for Alzheimer’s treatment and prevention. It brings together six multi-institution academic teams, four industry partners, and four non-profit organizations. The project tests the use of Artificial Intelligence/Machine Learning (AI/ML) analysis on high-dimensional human brain data to identify AD drug targets. Because these methods were untested, these AMP-AD groups work together to identify effective research methods — and outcomes. In this way, expert groups take independent approaches at solving this problem and then collectively identify repeatable observations. This requires early sharing of data, methods, and results. All the scientists operate inside Synapse, a Sage-built cloud platform with services that document the data science process. Using Synapse makes data and code widely reusable, with quarterly data releases to the public. Another Sage portal, Agora, allows any researcher to explore curated genomic analyses and target nominations from AMP-AD and associated consortia.

AMP-AD has already paid off. Over five years, AMP identified over 500 new drug targets for Alzheimer’s disease for under $100 million. The next phase is already underway, with Alzheimer Centers for the Discovery of New Medicines set to diversify and reinvigorate the Alzheimer’s disease drug development pipeline at a cost of just $73 million.

Colorectal Subtyping Consortium
Colorectal cancer (CRC) is a frequently lethal disease with complex, mixed outcomes and drug responses. In the early 2010s, a number of independent groups reported different genetic “subtypes” for CRC — these subtypes were designed to help doctors understand how different kinds of colorectal cancer will respond to different drugs.

Subtyping is harder than it needs to be because different researchers and labs process data differently, use different data to create their algorithms, and more. Even the way researchers convert tumor samples into digital data affects the process. So, to actually benefit patients, the colorectal cancer research community needed to bring it all together and compare notes.

The Colorectal Cancer Subtyping Consortium (CRCSC) was formed to identify a consensus among the divergent scientific results through large scale data sharing and meta-analysis. The CRCSC began with 6 academic groups from 15+ institutions. It collected and analyzed more than 30 patient groups with gene expression data, spanning multiple platforms and sample preparation methods. Each of the 6 AI/ML models was applied to the collection of public and proprietary datasets encompassing over 4,000 samples, mostly stage II-III cancer. An independent team centrally assessed the accuracy of subtype calls and associations with clinical, molecular and pathway features. Compared to how long it would take for each research team to publish a peer reviewed paper, read the papers of the other teams, and conduct additional research, this process produced results at an incredible rate.

Despite significant diversity in patients studied and AI/ML methods, the consortium came to a clear consensus on 4 CRC molecular subtypes (CMS1-4), with significant interconnectivity among the work from the participating groups. This was the first example of a large-scale, community-based comparison of cancer subtypes, and we consider the outcome the most robust way to classify colorectal cancer for targeted drugs based on genetics. It is the kind of work that typically can take a decade or more to reach consensus in the field through publication and conferences – whereas our approach led to publication of the consensus model within three years of the first of the divergent papers being published. Furthermore, our aim was to establish an important scientific practice for collaborative, community-based cancer subtyping that will facilitate the translation of molecular subtypes into the clinic.

Assessment of the American Research Environment 


Medical progress is hindered by many challenges. Consider the fact that health conditions are often defined – imprecisely – by symptoms rather than by biology, or that disease onset and treatment responses vary across populations, or our inability to effectively tailor care to the needs of individuals. Advances in information technology have provided us with an opportunity to address limitations such as these. Over the past two decades, new tools have emerged to collect, share, and combine data of many different types and scales, as have the algorithms to process them to uncover new knowledge in a wide variety of domains. The growing power, affordability, and ubiquity of computational tools in biomedical science has made them an indispensable component of the research environment.

Yet computational discovery has suffered from the same failures of translation and reproducibility that have plagued traditional approaches to discovery. We have new tools to generate and process vast quantities of information, but we often lack validated practices to turn that information into reliable insights. We need methodologies, processes, and baseline data that reliably and reproducibly generate trustable knowledge out of large-scale data. The AMP-AD and CRC vignettes above demonstrate how this can reduce the cost and the time of creating the reliable scientific insights on which treatments are based.

Unfortunately, there are market failures and public value failures around new scientific practices. Most incentives instead lead towards data withholding, black-box algorithms, and force reliable knowledge to emerge over artificially long time periods. Businesses fund research that results in private, appropriable intellectual property; they tend not to fund work with results that anyone can use, including the meta-work on how data science affects research reliability. Even when research is publicly funded, the individuals and institutions conducting it have the incentive to bolster their reputations by keeping data and code to themselves. The scientific funding, publishing, and promotion systems prefer papers claiming insights over methods development, and original research over replication. These perverse incentives prevent the scientific community from sharing effectively across organizations to perform effective computational research. They make it more likely that innovation will create value for research producers than for patients.

Open science practices can address these market failures and public value failures. As we saw in the AMP-AD example, the secret to the lower cost and higher throughput is the implementation of collaborative practices, mediated through a scientific research software platform. The transparency, replication, and reuse of data and code can be increased by an evolving set of rules and cultural norms to promote appropriate interpretation of data and code and to speed information flow. These practices are essential for rigorous science, given the tools we have at our disposal and the unique complexities that have been introduced by computational research.

Sharing Research Data

Over the past 10 years, the scale and scope of data used for biomedical research has expanded. We have observed an explosion in community-based data sharing practices to allow independent teams across institutions to access each other’s data, to generate federated data resources that combine databases, and to integrate large-scale, multi- modal data — including many from non-traditional sources, such as electronic health records and real-world data streams from increasingly pervasive smart devices. There is a great opportunity to improve the quality, reproducibility, and replicability of research by making these practices widely known, and these data resources interoperable. As was shown in the CRC vignette above, large scale data sharing and meta-analysis across more than 15 institutions yielded extraordinary results in a fraction of the time of a publication-mediated process. Science progresses faster, farther, and more surely though the wisdom of crowds – including crowds of researchers connected by technology.

However, there are impediments to realizing these benefits: data scale, data rights, and data omission. These impediments are magnified when science occurs across organizational boundaries, i.e. between federal agencies, universities, and private firms. The sheer size and diversity of data sets can limit their effective use. Also impeding use are the complexities of data protection; proprietary and/or sensitive data (e.g., patient clinical records) are only allowed to exist on certain networks — for good reasons like protecting privacy or preventing harm, they’re out of reach for those on other networks.

Finally, data that are not codified in any system in the first place cannot be shared; those who collect data do not always publish all of the data they collect, which can distort scientific claims through a perceived absence of evidence.

To overcome these limitations, and mitigate the costs of overcoming them, two approaches have emerged. In the sandbox approach, data are secured in a private environment to which only qualified researchers gain access. Analysis happens inside the sandbox, so that data cannot be misused externally. In the model-to-data approach, qualified researchers may send algorithms to be run in protected environments with data that they cannot access, which can allow for crowd-based access to data that is itself never exposed. Increasingly, the field is also considering federated extensions to these sharing models for situations where data must remain under the external control of data contributors. These types of solutions balance collaboration with the needs of various parties to control resources.

Benchmarking Algorithms

Just as there are potential pitfalls of sharing data, so too are there potential pitfalls for sharing the code used to build quantitative models. In typical practice, algorithm evaluations are conducted by those who developed them. Thus, most developers fall into the self-assessment trap, such that their algorithms outperform others at a rate that suggests that all methods are better than average. This can be inadvertent — a result of information leaks from, or over-fitting to, the data at hand — or it can be intentional — a result of selective reporting, where authors choose the metric or the data in which their algorithm shines, but hide those metrics and data that show sub-par performance.

The risks from using the wrong algorithm at the wrong time can be more arcane to the casual observer than the risks of bad data, but they are every bit as significant.

Algorithms make predictions, and the self-assessment trap means a lot of those predictions will be wrong. Making the wrong predictions can cost scientists – and the taxpayer who funds them – years of misdirected research. If we don’t have a standard way to decide if an algorithm is above, at, or below average, we won’t even know how to start when faced with a new one. We believe that the self-assessment trap is a major block for algorithms that hope to translate into actually helping patients. We therefore need frameworks inside the research environment that can separate the algorithm’s developer from its evaluator – to see if it works the way it’s supposed to work.

Using Real World Evidence

Digital devices make it very easy to collect data from a vastly larger group of people than was possible before. This can blur the line between traditional research study and consumer data collection methods. Real world evidence (RWE) is data that are collected out in the wild, and their collection will increasingly be driven by mobile devices and sensors. Much RWE will indeed come from devices that people own – bought in the context of consumer technology, not regulated research.

But consumer tools prioritize adoption. They use one-click buttons to obtain consent, and don’t worry about bioethics concepts like autonomy, respect for persons, beneficence. Compared to consumer devices and apps, ethical collection of RWE will require slowness and attention from both researchers and potential participants. This may hurt raw enrollment numbers compared to consumer technology, which creates temptation to abandon bioethics in favor of consumer surveillance approaches.

Our research environment needs to acknowledge this reality: we need consumer technology to collect RWE, but consumer technology is often legally and ethically contracted at odds with ethical research protections. Few stakeholders in the space build ethical, practical governance for RWE as a result. The increasing availability of RWE thus creates the need for new research ethics protections for the digital surveillance era.



Different organizations across different sectors have different strengths, and open science practices should help them make the most of their strengths individually, and collectively. Some organizations have the resources that others do not. Some have a comparative advantage in producing quality data and code, while others have an advantage in access to facilities and equipment. Some organizations have fast networks with ample storage, while others have to budget their computing resources more strictly. Some organizations are moving towards an open approach from closed approaches, while others are moving there from very (possibly irresponsibly) open approaches.

Given the complexity of biomedical data sharing across the biomedical field, and the different starting points of different organizations, we require a flexible spectrum of open science approaches.

As such, there are no one-size-fits-all recommendations. Each organization and research domain must be addressed as a unique case. However, given the incentives, impediments, and opportunities described above, we offer the following general recommendations in response to questions 1, 2, 3, and 4 in the “Research Rigor and Integrity” section of the Request for Information.

Q1. What actions can Federal agencies take to facilitate the reproducibility, replicability, and quality of research? What incentives currently exist to (1) conduct and report research so that it can be reproduced, replicated, or generalized more readily, and (2) reproduce and replicate or otherwise confirm or generalize publicly reported research findings?

Develop — and fund over time — platforms for storing, organizing, and making discoverable a wide variety of types of data to a wide variety of stakeholders. For example, Synapse and Agora (highlighted in the AMP-AD vignette above), allow researchers to share data, evaluate hypotheses, and make collective decisions about research directions. These sharing platforms should support efficient and responsible data sharing through integrated approaches for data governance, data management, and data access. They should be able to accommodate large numbers of users, adapt to heterogeneous and complex data types and compute environments, and incentivize wider participation in a data and benchmarking ecosystem. Finally, they should be designed to capitalize on the power of cognitive diversity resident in the American research environment by drawing upon the perspectives and experiences of the researchers who will use them, and upon the lessons of the emerging science of team science.

Change the institutional incentives toward using cloud platforms over local high-performance computing (HPC). Many institutions have built local HPC high performance computing resources over time. These resources support scientists locally but can serve as a disincentive for researchers to move into cloud platforms that facilitate collaboration and reuse of data. Funding should shift from supporting local HPC to supporting standard cloud platforms, and specific funds — separate from research grants — should be dedicated to support public clouds run as utilities in addition to supporting research computing on corporate clouds at Amazon, Google, and so on. Public cloud utilities would act as a nimble form of market regulator, keeping prices low and creating user-friendly features that might not line up with corporate revenue maximization.

Q2. How can Federal agencies best work with the academic community, professional societies, and the private sector to enhance research quality, reproducibility, and replicability? What are current impediments and how can institutions, other stakeholders, and Federal agencies collaboratively address them?

Develop clear community standards for implementing, evaluating, and articulating algorithm benchmarks. An emerging paradigm for the development and unbiased assessment of tools and algorithms is crowd-sourced challenge-based benchmarking. By distributing problems to large communities of expert volunteers, complex questions can be addressed efficiently and quickly, while incentivizing adoption of new standards. Challenges provide a successful resolution to the “self- assessment trap” through robust and objective benchmarks. Moreover, a successful challenge model can be an effective way for motivating research teams to solve complex problems.

Q3. How do we ensure that researchers, including students, are aware of the ethical principles of integrity that are fundamental to research?

Create or acquire training, workshops, or other forms of education on the ethics of computational science. Computational biomedicine will only improve human health when conducted in a reliable and responsible manner. It is, therefore, critical to establish and implement community norms for responsible data sharing and reliable computational data analysis. Training and workshops can help instill in researchers the knowledge — and the conscience — needed to effectively and ethically navigate the evolving landscape of computational science. Educational modules should cover topics including: 1) efficient and responsible methods for sharing of biomedical research data; 2) Industry standards for objective benchmarking of algorithms used to derive insight from evaluate that data; and 3) the reliable and responsible integration of real-world evidence (RWE) — from electronic health records and smart devices — into research programs.

Develop systemic practices for identifying risks, harms, benefits, and other key elements of conducting studies with data from mobile devices. This necessarily involves understanding how to design clinical protocols, informed consent, and data sharing processes for anything from low risk surveys up to full genomes and biospecimens. It could also involve developing a methodology that borrows from software development, including version control, analytic dashboards, user experience design, and more to support efficiency increases in protocol management.

Q4. What incentives can Federal agencies provide to encourage reporting of null or negative research findings? How can agencies best work with publishers to facilitate reporting of null or negative results and refutations, constraints on reporting experimental methods, failure to fully report caveats and limitations of published research, and other issues that compromise reproducibility and replicability?

Require federal researchers to preregister their research prior to conducting work (e.g., via to ensure their results are published, even if their hypotheses are not validated. If 9 out of 10 studies do not validate a hypothesis, but the only one that does gets published, then the scientific community will have an inaccurate record of evidence to substantiate a claim. Moreover, what are negative results for the hypotheses of the researcher initiating the study may be positive results for the hypotheses of other researchers in the community.




Thank you for the opportunity to provide our perspective on how to improve the American research environment. We believe that open computational science practices can vastly improve the speed and efficacy of the research enterprise and must be applied responsibly. Furthermore, to improve digital practices across the scientific community, we must explicitly support these transdisciplinary practices as important efforts in their own right, while integrating them into domain-specific scientific projects. They should not be ancillary efforts, tagged onto research primarily aimed at particular discoveries.

In this response, we focused on the present state of the enterprise, but it is also helpful to consider the future. The growth and trajectory of AI and machine learning guarantee that new challenges and possibilities with sharing data and code will emerge as time passes. The assessment and recommendations offered here address the impediments and opportunities we currently face, but they also set us up to avoid the worst consequence of increasingly powerful information and knowledge technology, and set us up to more aptly seize the chances they provide.

The Value of Team Science in Alzheimer’s Disease Research

Related News: Bringing Open Science to Drug Discovery for Alzheimer’s

The Sage Perspective

Silos in research are slowing us down. This isn’t a revelation, but it is a rallying call for many of us who hope to overcome barriers to advancing research, especially for a disease like Alzheimer’s.

In the study of Alzheimer’s, there has been a spectacular failure in the development of therapies. All the drugs that are allegedly disease-modifying have failed in late-stage clinical trials. The thinking around what causes the disease has not moved beyond a few hypotheses that have taken root.

This has occurred because the scientific community has fallen for the streetlight effect: We continue to expend resources to generate new data on hypotheses that have existing promising data because it is viewed as a safe bet. Given the repeated failure of clinical applications of these hypotheses (e.g. the Amyloid hypothesis), we face the stark reality that the true nature of the disease is a quagmire of uncertainty.

Fundamental shift

Yet there are rational strategies that have been successful in other domains such as finance that the community can use to mitigate that uncertainty. Instead of continuing to accrue data on what isn’t working, we ought to systematically explore the boundaries of our collective knowledge about Alzheimer’s Disease and balance the distribution of resources across low-, medium-, and high-risk ideas. This requires a fundamental shift in how we think about doing science, because no individual contributor can perform all of the tasks necessary to expand our collective knowledge in a meaningful manner.

There are so many silos that a lot of data, new ideas, and hypotheses don’t get shared. There also is some level of distrust in the community by researchers who want to guard proprietary information for the sake of a “magic bullet.” But there is no magic bullet. If we don’t collaborate strategically and diversify our research portfolio, we will continue to fail.

We are at a critical stage in Alzheimer’s Disease research where the community and individual researchers must put aside their individual reservations and work together. We have to let go of what’s not working and acknowledge that there are potentially other factors that affect how the disease behaves. It’s imperative that fresh ideas are given adequate space to succeed and to disrupt current structures to facilitate this exchange. We have to hedge our bets.

Radically open

At Sage, I lead a team that works across several programs that are identifying new drug targets to treat Alzheimer’s disease. There are many different academic institutions that are generating high-dimensional molecular data that can be used to try to identify new genes and pathways that could be fresh drug targets. We, in the spirit of open science, help orchestrate the analytic and data coordination efforts associated with that endeavor.

Our goal is to use a data-driven approach to better understand the underlying molecular mechanisms of the disease. It’s not something that any individual group would have the resources to do effectively. So it really requires a community-driven approach. Sage is positioned to conduct the scientific coordination that can help researchers work more effectively to get at these new ideas that might lead to a successful treatment.

Our primary project is AMP-AD (Accelerating Medicine Partnership in Alzheimer’s Disease), which is a public-private partnership supported by the National Institutes on Aging. We serve as a hub for all the data that’s being generated across the project. It’s a radically open model where all the data become open once they have gone through quality control. You don’t have any publication embargoes or restrictions on data use – aside from adhering to governance standards associated with sensitive human data.

We play a role in trying to increase the transparency of all the analyses that become available. We’re also building partnerships with academic investigators to streamline how we reach a consensus about what the data are telling us about the potential causes of this disease. We want to make sure that any conclusions are consistent across different research teams, because the more generalizable a solution is, the more likely it will lead to a successful treatment.

The long view

In addition to this scientific coordination work, my group is also performing original research on Alzheimer’s Disease. In all of our research, we operate under the same open model as all of our collaborators. Practicing this open approach in our own work is important at Sage. By holding ourselves to the same standard that we ask the community to live by, we can understand and work through any pain points. In this way, we hope to lead by example. At Sage, we do have the benefit of a culture and incentive structure that emphasize the long view versus, say, maximizing revenue in the short term. Being able to think on a longer time scale affords us the ability to make decisions that improve science more materially than if we were to focus on solo – and siloed – projects.

Any approach to tackling how science is done needs to be systematic in order to have long-lasting impact. For Alzheimer’s disease, we have an opportunity to improve how therapeutic development happens. Our vision and hope is that any future compounds that may result from open research we support would be achieved faster and more efficiently, and be made available in an affordable and equitable manner.

Being radically open and collaborative isn’t easy, but operating in a silo won’t get us far enough. We have to be more intentional about team science. Lives depend on it.

BLOG: Reflections from the Mobile Health App Developer Workshop

Editor’s note: This post originally appeared on on Sept. 16, 2019.

By Kimberly Milla
Elektra Labs

Sage Bionetworks co-hosted their first Mobile Research App Developer Workshop on Sept. 12 at the New York Genome Center. The event, supported by a NIH grant for ethical, legal, and social implications (ELSI) in health research using mobile technologies, sought to engage stakeholders (app developers, patient advocate groups, and security researchers) to begin conversation on governance, reliability, and privacy of data collected from connected digital tools (i.e. digital specimens). In this post, I’ll share a summary of the workshop proceedings along with reflections and resources (e.g., developer toolkits and legal resources).

John Wilbanks, CCO at Sage, and Mark Rothstein, Founding Director of the Institute for Bioethics, Health Policy, and Law at the University of Louisville School of Medicine introduced the workshop by highlighting that current practices are not regulating apps collecting data for research through governing bodies such as an Institutional Review Board (IRB). This lack of governance can lead to detriments by relying on incorrect information as well as privacy risks, such as apps selling information to third parties.

Andy Coravos gave a keynote talk that addressed the gap in overseeing digital products and regulating data governance & rights, discussing how governing organizations such as the FDA regulate based on what a company claims their products do, not necessarily based on what the products actually do (slides).

The Sage Bionetworks team then held a series of talks addressing consent, privacy, design, and engagement & enrollment. Megan Doerr talked about consent and the importance of using accessible vocabulary to inform people as well as the ethical imperative to utilize the consent to inform rather than as a pseudo-contract. This lead to the creation of Sage’s consent toolkit geared to guide researchers, app developers, and designers.

Vanessa Barone reviewed a study done in data privacy that identified people’s perception on agency in data sharing. The results showed that people’s perceptions fall in 3 groups where they feel they have no other options than providing their data (no agency), or like their data is already out there and there is nothing they can do about it (apathy), or that they have an active role in deciding what data and how it is shared (agency). Based on this study, Sage created another toolkit to help biomedical researchers, app developers, and designers improve data privacy practices in digital studies. Check it out! Thursday marked the public launch of the Sage Privacy Toolkit — it’s an excellent resource for both start-ups and big companies.

Source: Sage’s Privacy Toolkit

Likewise, Woody MacDuffie presented on digital systems for mobile research, where the Sage design team standardized their app design systems to have the same way to execute an action, such as clicking “next”, and utilize animations rather than lengthy text to consent participants and explain procedures in digital studies. Abhi Pratap then presented preliminary data from his research work on engagement & enrollment in digital health studies, which will be publicly shared later this month — be on the lookout for the report!

Woody MacDuffie presenting on digital systems for mobile health.

An expert panel featuring Deborah Estrin, JP Pollak, Adrian Gropper, and Maria Ebling, discussed privacy considerations in app development. The conversation focused on challenges such as accessible end-user license agreements, separation of concerns, selective sharing of data, and agency vs. paternalism. Cindy Geoghegan highlighted the important point of helping people understand what role is played by each entity that has access to their data, for example some organizations might only be responsible for storing data rather than using such data.

Panelist discuss current issues in data privacy. Left to right: JP Pollack, Deborah Estrin, Adrian Gropper, Maria Ebling, and John Wilbanks.

Next up were four lightning talks from workshop participants. Talks covered difficulties accessing pre-processed data from commercial connected tools, adapting current wearable devices to specific populations, and Miso Technologies’s work developing the “first research AI assistant”. Elektra Labs team member, Christine Manta, covered the current landscape of health IT standards for the digital medicine community, sparking conversation on whether popular standards such as FHIR are appropriate for the format and exchange of data collected on digital tools. In particular, FHIR may not (yet) be capable of storing data sampled at greater than 1Hz (1 sample per second), which is critical for remote monitoring (h/t Jameson Rogers and Mark Shervey).

Christine Manta discusses the current standards in health IT. To learn more, check out Digital Medicine Standards 101.

Lastly, closing remarks by John Wilbanks and Mark Rothstein emphasized the importance of education and collaboration between security and biomedical researchers, app developers, and designers to protect the privacy of people’s data and uphold ethical practices in collecting and using such data. As someone new to the field of digital medicine, I was left with the urgency to shift our paradigm towards being data stewards, not data owners, particularly while the gap in regulation from governing organizations continues.

Elektra Labs and Sage Bionetworks team members: Left to right: Megan Doerr, Mark Shervey, Andrea Coravos, Christine Manta, John Wilbanks, Kimberly Milla.


Bringing Open Science to Neuroinformatics

The amazing people at Sage Bionetworks have kindly asked me to contribute to their series of posts about open science. The first thing that occurred to me was that it’s nice to be recognized as part of the open-science movement. The second thing that occurred was that more people should be part of it.

When I was a grad student and postdoc many years ago, open science wasn’t really something we talked about. We collected our data, wrote our papers, rewrote our papers, celebrated when they were published, and that was pretty much it. In 2010, I started working at INCF (International Neuroinformatics Coordinating Facility), and suddenly, data sharing, interoperability, and standards were all we were talking about. I remember that first discussion with then executive director Sten Grillner, where he outlined all the reasons why people should share their data and all the hurdles we needed to jump in order to optimize the process. All I could think was why on earth are there not central facilities doing this for all the other scientific domains, as well?

In the last decade, several major brain initiatives around the world have been launched and are starting to produce a vast amount of data. In order to integrate these diverse data and address issues of transparency and reproducibility, widely adopted standards and best practices around data sharing will be key to achieve an infrastructure which supports open and reproducible neuroscience. At INCF, we’re working at a fundamental level of open science: how to make data shareable, tools interoperable, and researchers at all career levels trained in data management. One of our main activities is to vet and endorse FAIR (Findable Accessible Interoperable Reusable) standards and best practices for neuroscience data (“neuroinformatics” is in our name, after all) [1]. We also support the development of new standards, as well as the extension of existing standards to support additional data types.

Standards should serve as aspirations and be accessible to this and future generations of scientists as tools for thriving in an open-science environment. Support for an open-science environment in turn facilitates collaborations and idea exchange, which enable mutual growth in striving toward scientific goals. Different expertise can be brought to bear on difficult problems, leading to new solutions, and also training a new and more robust scientific enterprise.

I think we can all agree that there’s zero success in announcing a “standard” and thinking that it will be widely adopted (cue the herding cats analogy). Open science is about choice and providing the mechanisms to facilitate open collaborations – like Irene Pasquetto pointed out in another post in this series: “collaborations are the holy grail of reuse.” One of Pasquetto’s main observations is that reputation, trust, and pre-existing networks have as much impact on reuse as how well the data is curated. In our experience, this perspective extends to standards and best practices, as well. With this in mind, we spent some time in 2017 working out a process for vetting and endorsing standards and best practices where the community itself did the vetting and endorsing [2]. The process was opened for submissions in early 2018, and we currently have three endorsed standards with eight more in the pipeline.

To promote uptake of standards and best practices, and implementation of other neuroinformatics methods, INCF has built TrainingSpace, an online hub which provides informatics educational resources for the global neuroscience community. TrainingSpace offers multimedia content from courses, conference lectures, and laboratory exercises from some of the world’s leading neuroscience institutes and societies. As complements to TrainingSpace, INCF also manages NeuroStars, an online Q&A forum, and KnowledgeSpace, a neuroscience encyclopedia that provides users with access to over a million publicly available datasets and links to literature references and scientific abstracts.

We at INCF hope that our efforts to promote open neuroscience will spark some interest in those of you are just now hearing about us, and join in the fun of developing, vetting, and endorsing FAIR standards and best practices. After all, herding cats is easier if there’s more food to choose from!

This post was written with helpful comments from Randy McIntosh, neuroinformatics expert and brainstormer extraordinaire.


1 Abrams, Mathew, et al. “A Standards Organization for Open and FAIR Neuroscience: The International Neuroinformatics Coordinating Facility.” OSF Preprints, 17 Jan. 2019.

2 Martone M. The importance of community feedback in the INCF standards and best practices endorsement process [version 1; not peer reviewed]. F1000Research 2018, 7:1452 (document)

About this series: In February 2019, Sage Bionetworks hosted a workshop called Critical Assessment of Open Science (CAOS). This series of blog posts by some of the participants delves into a few of the themes that drove discussions – and debates – during the workshop. Read the series.

On the Ethics of Open Science

Ethics is an essential consideration for any research initiative that collects, uses, or reuses human data or samples. In data activities of these types, ethics concerns – whether raised by institutional review boards (IRBs) tasked with reviewing projects, or in the published literature – seem to focus almost entirely on questions of adequate privacy practices and security protections. Questions are raised about what fields are kept, what permissions have been provided for future access, and whether potential participants have adequately understood relevant parameters.

And yet considerations of ethics go beyond who gives permission for or has access to individuals’ data. Open science promises considerable ethical good: speeding up medical discovery, avoiding unnecessary duplication, creating efficiencies, and encouraging more democratic science. These are unquestionably ethical goods. But leaving discussions of ethics and open science simply to the good that can come and the need for privacy protections is concerningly narrow.

Open-science projects should also actively evaluate how the potential benefit of helping others actually will be realized, and what they can build into their structures to increase the chance that this will occur. Will open science simply allow users to do what they want with data as long as those whose data are included have been informed that the data will be widely shared? Or will the ethics of open-science support work to help ensure that goals, such as benefit and fairness, are more likely to be realized?

This might lead to a few questions that each open-science program might consider:

How open do you want open-data sharing to be?

Are open-science data repositories open by definition for use by everyone regardless of motives or intent? Or is any vetting of the user, the purpose, the commercial interests, or any screening for something nefarious ever part of the equation? Clearly there are risks and benefits from being truly open vs. open with strings attached. But the key is recognizing that there are risks and benefits to each strategy. Ignoring either of these does a disservice to considering the ethics of the enterprise.

What benefits come from open science?

The potential ethical good of open science lies in allowing a stunning increase in the number of discoveries that can be made, and the efficiency with which they can occur relative to traditional science. A critical empirical question is whether anyone is keeping track of the degree to which the hypothesized great potential is really occurring. And whose job it is, not only to set up open data, but to help make sure that the benefit – better diagnoses, better treatments, etc. – of having such data be open are actually realized and not just that more analyses are conducted? Because open science is designed with fewer guard rails, it is essential to ensure that the potential benefit of working in this manner is actively facilitated and not left to chance. While individual participants may choose to opt-in or opt-out of various initiatives, thinking intentionally about the ethics of an activity itself is a form of meta ethics that is central to open science.

Who benefits from open science? Does that end up being fair? And can we put small structures in place to increase the chance that broad commitments to benefit and fairness are realized?

Without deliberate attention, it is unlikely that the benefits of open science will result in better care for those who are most disenfranchised, a narrowing of health inequalities, or a targeting of at least a few conditions that disproportionately affect those at the bottom of the barrel. It’s not that anyone is unsympathetic to those questions or would ever suggest that such questions not be addressed through data available through open science platforms. Without a structure or system that ensures at least some percentage of the work that emerges must focus on questions of this sort, it likely won’t happen in any large, concerted way. In the same way that the national genome project required a small percentage of their federal dollars to go to research questions on the ethical, legal, and social implications of genetics research, should open science ever require that some proportion of open-science analyses focus on questions of social justice? The options are countless and, again, leaving this to chance may help to guarantee that disparities will continue to grow wider rather than narrower.

Are there or should there be community partners?

Is there any relevance for having, or encouraging, community partners for at least some numbers of questions that might be asked of open-science data? Is it worth having some pilot projects to see if doing so changes the nature of the questions, the ways in which data are analyzed, or what happens with analyses once the technical component is completed? And partnerships may also need technical experts — someone who can make some sense out of patterns that emerge from data, to help distinguish the gold from the noise.

Open science is a movement to further discovery, increase collaboration, and, ultimately, to wildly magnify the likelihood and frequency of benefit. Building on that vision with additional commitments to achieving benefit – not only allowing platforms for benefit – and to ensuring that there are always a few ongoing projects that relate to inequity or that work on the needs of those with the worst health outcomes or highest needs, could potentially deepen the vision further. It could also lead to some of those individuals asked to provide their permissions with greater personal interest in doing so.

About this series: In February 2019, Sage Bionetworks hosted a workshop called Critical Assessment of Open Science (CAOS). This series of blog posts by some of the participants delves into a few of the themes that drove discussions – and debates – during the workshop. Read the series.

Reproducibility in Computational Analysis

As Lara Mangravite and John Wilbanks noted in their opening salvo, “open science” is such a multifaceted concept that it defies consensus definition — and so I have found it particularly interesting to hear and read the individual definitions that have come out of the workshop and the first round of this blog series. I find myself celebrating with Titus Brown the progress already made in building communities of practice around portability, preprint culture and the Carpentries, and vigorously agreeing with Cyndi Grossman’s call for greater social justice. I feel enlightened by Brian Nosek’s discussion of preregistration of analysis plans, which I had somehow managed to remain ignorant of until now, and Irene Pasquetto’s meticulous reconnaissance of how much open data is actually reused (or not). Now that it’s time for me to add my stone to the cairn of CAOS (instant metal band name!), I’m going to focus on the facet of open science that keeps me up at night: reproducibility in computational analysis.

In the spirit of clear definitions, let’s distinguish reproducibility from replication. Reproducibility focuses on the analysis process. When we say we’re reproducing an analysis, we’re trying to take the same inputs, put them through the same processing, and hopefully get the same results. This is something we may want to do for training purposes or to build on someone else’s technical work.

Replication, on the other hand, is all about confirming (or infirming? sorry Karl) the insights derived from experimental results. To replicate the findings of a study, we typically want to apply orthogonal approaches, preferably to data collected independently, and see whether the results lead us to draw the same conclusions. It basically comes down to plumbing vs. truth: in one case we are trying to verify that “the thing runs as expected” and in the other, “yes this is how this bit of nature works.”

My personal mission is to drive progress on the reproducibility front, because it is critical for enabling the research community to share and reuse tools and methods effectively. Without great reproducibility, we are condemned to waste effort endlessly reimplementing the same wheels. Unfortunately, whether we’re talking about training or building on published work, reproducibility is still an uphill battle. Anyone who has ever tried and failed to get someone else’s hand-rolled, organic, locally-sourced python package to behave nicely knows what I mean: it’s nice to have the code in GitHub, but it’s hardly ever enough.

Certainly, as Brown noted, the bioinformatics community has made great progress toward adopting technological pieces like notebooks, containers, and standards that increase portability. Yet, there is still a sizeable gap between intended portability (“you could run this anywhere”) and actual ease of use for the average computational life scientist (“you can run this where you are”).

Beyond mechanisms for sharing code, we also need a technological environment that supports the easy convergence of data, code and compute. And we need this environment to be open and emphasize interoperability, both for the FAIRness of data and so that researchers never find themselves locked into silos as they seek to collaborate with other groups and move to new institutional homes throughout their research careers.

That is why I am particularly excited by the Data Biosphere* vision laid out by Benedict Patten et al., which introduces an open ecosystem of interoperable platform components that can be bundled into environments with interfaces tailored to specific audiences. Within a given environment, you can create a sandbox containing all code, data, and configurations necessary to reproduce an analysis, so that anyone can go in and reproduce the work as originally performed – or tweak the settings to test whether they can do better! It’s the ultimate methods supplement or interactive technical manual. (* Disclaimer: I work with Anthony Philippakis, a coauthor of the Data Biosphere blog post.)

The final major obstacle to promoting reproducibility is data access. Just looking at genomic data, much of the data generated for human biomedical research is heavily protected, for good reason. There is some open-access data but it is often not sufficient nor appropriate for reproducing specific analyses, which means that we often can’t train researchers in key methodologies until after they have been granted access to specific datasets – if they can get access at all. This is a major barrier for students and trainees in general, and has the potential to hold back entire countries from developing the capabilities to participate in important research.

The good news is that we can solve this by generating synthetic datasets. Recently, my team participated in a FAIR Data hackathon hosted by NCBI, running a project to prototype some resources for democratizing access to synthetic data generation as a tool for reproducibility. The level of interest in the project was highly encouraging. I am now looking for opportunities to nucleate a wider community effort to build out a collection of synthetic datasets that researchers can use off the shelf for general purposes, like testing and training, accompanied by user-friendly tooling to spike-in mutations of interest into the synthetic data for reproducing specific studies. We did something similar with an early version of our tooling in a previous workshop project presented at ASHG 2018.

The technical work that would be involved in making this resource a reality would be fairly modest. The most important lift will be corralling community inputs so that we can make sure we are building the right solution (the right data and the right tooling) to address most researchers’ needs, and, of course, determining who is the right custodian to host and maintain these resources for the long term.

About this series: In February 2019, Sage Bionetworks hosted a workshop called Critical Assessment of Open Science (CAOS). This series of blog posts by some of the participants delves into a few of the themes that drove discussions – and debates – during the workshop. Read the series.

From Open Systems to Trusted Systems: New Approaches to Data Commons

At Sage, we’re always reflecting on the work we’ve done, and how it might need to evolve. Conversations with partners and colleagues at events such as CAOS are a key part of this process. Over the last couple of years, a common theme has been the role that actively collaborating communities play in successful large-scale research projects, and the factors that accelerate or inhibit their formation.

Like any research organization, Sage experiments with different ideas and works in different areas over time. This challenges our technology platform team to build general-purpose systems that not only support the diversity of today’s research, but also pave the way for new types of science to be performed in the future.

Synapse makes it easier for researchers to aggregate, organize, analyze, and share scientific data, code, and insights.

Synapse, our flagship platform, allows us to develop data commons using a set of open APIs and a web portal hosted in the cloud, and programmatic tools that allow integration of Sage’s services into any analytical environment. Together, this platform makes it easier for researchers to aggregate, organize, analyze, and share scientific data, code, and insights.

Initially, our model for Synapse was GitHub – the software platform that has been the locus of the open source software movement over the past decade. Our thinking was that if we just made enough scientific knowledge open and accessible, scientists around the world would organize themselves and just start doing better science. In part, we saw our role as unlocking the potential of junior researchers who were digital natives and, perhaps, willing to work in ways less natural to the established PIs in the context of the established research ecosystem. Our assumption was that a pure technology solution would be sufficient to accelerate progress.

The reality wasn’t as straightforward as we thought.

Over the course of eight years, we’ve had a lot of large scientific collaborations operate on Synapse, some quite successfully and others less so. The main determinant of success has proven to be the level of alignment of incentives among the participating scientists, and their degree of trust in each other. Further, consortium-wide objectives must be aligned with individual and lab-level priorities. If these elements exist, the right technology can catalyze a powerful collaboration across institutional boundaries that would be otherwise difficult to execute. But without these elements, progress stalls while the exact same technology sits unused.

In a recent talk on the panel Evolving Challenges And Directions In Data Commons at BioIT World West (slides here), I shared several case studies to illustrate the aspects of the platform that were most successful in enabling high-impact science, and the characteristics that contributed to that success:

Digital Mammography Dream Challenge

In the Digital Mammography Dream Challenge, we hosted close to 10TB of medical images with clinical annotations in the cloud, and organized an open challenge for anyone in the world to submit machine learning models to predict the need for follow-up screening. Due to patient privacy concerns, we couldn’t directly release this data publicly. Instead, we built a system in which data scientists could submit models runnable in Docker containers, executed training and prediction runs in the AWS and IBM clouds, and returned output summaries. This was a huge shift in workflow for the challenge participants, who are more accustomed to downloading data to their own systems than uploading models to operate on data they cannot see.

The technical infrastructure, developed under the leadership of my colleagues Bruce Hoff and Thomas Schafter, is one of the more impressive things we’ve built in the last couple of years. Imposing such a shift in workflow on the data scientists risked being a barrier. That proved not to be the case: the incentive structure and publicity generated by DREAM generated enormous interest, and we ended up supporting hundreds of thousands of workflows generated by over a thousand different scientists.

mPower Parkinson’s Study

In the area of digital health, Sage has run mPower, a three-year observational study (led by Lara Mangravite and Larsson Omberg) of Parkinson’s disease conducted in a completely remote manner through a smartphone app. This study produced a more open-ended challenge: how to effectively learn from novel datasets, such as phone accelerometry and gyro data, collected while study participants balanced in place or walked. The study leveraged both Synapse as the ultimate repository for mPower data, as well as Bridge – another Sage technology designed to support real-time data collection from studies run through smartphone apps distributed to a remote study population.

We organized a DREAM challenge to compare analytical approaches. This time, we focused on feature extraction rather than machine learning. Challenge participants were able to directly query, access, and analyze a mobile health dataset collected over six months of observations on tens of thousands of individuals. Again, the access to novel data, and to a scientifically challenging and clinically relevant problem was the key to catalyzing a collaboration of several hundred scientists.

Colorectal Cancer Genomic Subtyping

Our computational oncology team, led by Justin Guinney, helped to organize a synthesis of genomic data on colon cancer originally compiled by six different clinical research teams. Each of these groups had published analysis breaking the disease into biologically-distinct sub-populations, but it was impossible to understand how the respective results related to each other or how to use the analysis to guide clinical work

Unlike the previous two examples, this was an unsupervised learning problem, and it required a lot of effort to curate these originally distinct datasets into a unified training set of over 4,000 samples. However, the effort paid off when the teams were able to identify consensus subtypes of colon cancer, linking patterns in genomic data to distinct biological mechanisms of tumor growth. This project operated initially with only the participation of the teams that conducted the initial clinical studies – and it was only in the confines of this initially private group that researchers were really willing to talk openly about issues with their data. It also helped that each group was contributing part of the combined dataset and therefore everyone felt that all the groups were contributing something to the effort. With the publication of the initial consensus classification system, the data and methods have been opened up and seeded further work by a broader set of researchers relating the subtypes to distinct clinical outcomes.

Towards Community-Specific Data Commons

What do these three examples have in common? From a scientific standpoint, not much. The data types, analytical approaches, and scientific contexts are all completely different. In retrospect, it’s perhaps obvious that there’s not much chance of data, code, or other low level assets being used across these projects. The fact that all three projects were supported on the same underlying platform is evidence that we’ve developed some generally-useful services. But, our monolithic GitHub-style front end has not been an effective catalyst for cross-project fertilization.

What has been a common indicator of success is effective scientific leadership that gives structure and support to the hands-on work of more junior team members. This is even more important when these projects are carried out by highly distributed teams that haven’t previously worked together. Developing this sense of trust and building a functional community is often easier to do in smaller, controlled groups, rather than in a completely open system that, at best, is saturated with irrelevant noise, and, at worst, can be hijacked by actors with bad intentions. Starting small and increasing the “circle of trust” over time is an effective strategy.

It’s becoming clearer that these sort of factors are the case, even in software development. Despite what you might think from some of the open-source rhetoric, most of the really large-scale, impactful open-source projects benefit from strong leadership that gives a sense of purpose and organization to a distributed group of developers. And, even GitHub itself is now a part of Microsoft – who would have bet money on that outcome 10 years ago?

In the past year, the Synapse team has been piloting the development of new web interfaces to our services that repackage how we present these capabilities to different communities into a more focused, community-specific interfaces. With the recent launch of the AMP-AD Data Portal, and NF Data Portal the first couple of these experiments are now public. I’m excited to see how our platform continues to evolve as we enter Sage’s second decade, and even more excited to see what new science emerges on top of it.

About this series: In February 2019, Sage Bionetworks hosted a workshop called Critical Assessment of Open Science (CAOS). This series of blog posts by some of the participants delves into a few of the themes that drove discussions – and debates – during the workshop. Read the series.

How Data Commons Can Support Open Science

In the discussion about open science, we refer to the need for having data commons. What are data commons and why might a community develop one? I offer a brief introduction and describe how data commons can support open science.

Data commons are used by projects and communities to create open resources to accelerate the rate of discovery and increase the impact of the data they host. Notice what data commons aren’t: Data commons are not designed for an individual researcher working on an isolated project to ignore FAIR principles and to dump their data to satisfy data management and data sharing requirements.

More formally, data commons are software platforms that co-locate: 1) data, 2) cloud-based computing infrastructure, and 3) commonly used software applications, tools and services to create a resource for managing, analyzing and sharing data with a community.

The key ways that data commons support open science include:

  1. Data commons make data available so that they are open and can be easily accessed and analyzed.
  2. Unlike a data lake, data commons curate data using one or more common data models and harmonize them by processing them with a common set of pipelines so that different datasets can be more easily integrated and analyzed together. In this sense, data commons reduce the cost and effort required for the meaningful analysis of research data.
  3. Data commons save time for researchers by integrating and supporting commonly used software tools, applications and services. Data commons use different strategies for this. The commons themselves can include workspaces that support data analysis, other cloud-based resources can be used to support the data analysis, such as the NCI Cloud Resources that support the GDC, or data analysis can be done via third party applications, such as Jupyter notebooks, that access data through APIs exposed by the data commons.
  4. Data commons also save money and resources for a research community since each research group in the community doesn’t have to create their computing environment and host the same data. Since operating data commons can be expensive, a model that is becoming popular is not charging for accessing data in a commons, but either providing cloud-based credits or allotments to those interested in analyzing data in the commons or passing the charges for data analysis to the users.

A good example of how data commons can support open science is the Genomic Data Commons (GDC) that was launched in 2016 by the National Cancer Institute (NCI). The GDC has over 2.7 PB of harmonized genomic and associated clinical data and is used by over 100,000 researchers each year. In an average month, 1-2 PB or more of data are downloaded or accessed from it.

The GDC supports an open data ecosystem that includes large scale cloud-based workspaces, as well as Jupyter notebooks, RStudio notebooks, and more specialized applications that access GDC data via the GDC API. The GDC saves the research community time and effort since research groups have access to harmonized data that have been curated with respect to a common data model and run with a set of common bioinformatics pipelines. By using a centralized cloud-based infrastructure, the GDC also reduces the total cost for the cancer researchers to work with large genomics data since each research group does not need to set up and operate their own large-scale computing infrastructure.

Based upon this success, a number of other communities are building their own data commons or considering it.

For more information about data commons and data ecosystems that can be built around them, see:

  • Robert L. Grossman, Data Lakes, Clouds and Commons: A Review of Platforms for Analyzing and Sharing Genomic Data, Trends in Genetics 35 (2019) pp. 223-234, Also see: arXiv:1809.01699
  • Robert L. Grossman, Progress Towards Cancer Data Ecosystems, The Cancer Journal: The Journal of Principles and Practice of Oncology, May/June 2018, Volume 24 Number 3, pages 122-126. doi: 10.1097/PPO.0000000000000318

About this series: In February 2019, Sage Bionetworks hosted a workshop called Critical Assessment of Open Science (CAOS). This series of blog posts by some of the participants delves into a few of the themes that drove discussions – and debates – during the workshop. Read the series.

Voices from the Open Science Movement

Open science is an umbrella term used by many people to represent a diverse set of research methods designed to increase the speed, value, and reproducibility of scientific output. In general, these approaches work to achieve their goals through increased sharing of research assets or transparency in research methods. Our own work in this field promotes the effective integration of computational analysis into the life sciences. The challenge: While the advancements in technology now support the generation and analysis of large-scale biological data from human samples in a way that can meaningfully expand our understanding of human biology, the established processes for implementation and independent evaluation of research outcomes are not particularly well suited to these emerging forms of scientific inquiry.

The scientific method defines a scientist as one who seeks new knowledge by developing and testing hypotheses. In this frame, the scientist is a neutral observer who is equally satisfied when a hypothesis is either proven or disproven. However, as with any application of theory, the practical implementation of the scientific method is impacted by the conditions in which it is applied.

A Different Era

The U.S. scientific system has a well-established set of processes that were developed in post-war America with the intention of advancing the kinds of science of that era. This system promotes the pursuit of scientific inquiry within the context of research universities, using funding from the government and distributing knowledge across the research community through papers published in journals and patents acquired by technology transfer offices. While this system can be highly effective, it also incentivizes scientists in a manner that directly impacts the outputs of their work.

The current scientific system rewards our scientists for new discoveries. This is the criterion that is used to gate their ability to obtain funding, their ability to advance their own careers and those of their colleagues, and, in some cases, their ability to remain employed. For this reason, we sometimes skew our experiments towards those that prove rather than disprove the hypothesis. We enter into the self-assessment bias – in which we tend to overvalue the impact and validity of our own outputs.

Now, all is not lost: we have a well-established system of peer-review that uses independent evaluation to assess the appropriateness of research conclusions. To this aim, we as a community, are meant to evaluate the evidence presented, determine the validity of an experiment, and understand how that experiment may support the general hypothesis. The task of turning an individual observation into general knowledge may be led by an individual scientific team, but it is the responsibility of the entire field.

Growing Pains

This system is noble and often quite effective. It’s also been strained by the scale and complexity of the research that is currently being pursued – including the integration of computational sciences into biology. The system is so good at encouraging publication in peer-review journals that more than 400,000 papers on biology were published in 2017. This causes a host of problems.

First, it’s a strain for anyone in the scientific community to balance the time it takes to perform this important task with many other demands. Second, the complexity of our modern experiments are not easily conveyed within the traditional means for scholarly communication, making it difficult for independent scientists to meaningfully evaluate each experiment. Third, the full set of evidence needed to evaluate a general hypothesis is usually spread across a series of communications, making it difficult to perform independent evaluation at the level of that broader hypothesis.

This last point can be particularly problematic as conflicting evidence can arise across papers in a manner that can be difficult to support through comparative evaluation. These issues have exploded into a much-publicized replication crisis, making it hard to translate science into medicine.

Open Methods

So what does this all have to do with open science? The acknowledgement of these imperfections in our current system has led to a desire – across many fronts – for an adapted  system that can better solve these problems. Open science contains lots of elements of a new scientific system. For computational research in life sciences, it works on the cloud where we can document our experimental choices with greater granularity. It provides evidence of the scientific process that helps us decide which papers out of that 400,000 to trust – the ones where we can see the work, and the ones where machines can help us read them.

In our own work, we have seen how the use of open methods can increase the justification of research claims. Working inside a series of scientific programs, we have been able to extract general principles and solutions to support this goal. These are our interventions – ways to support scientists in making real progress towards well-justified research outcomes.

These approaches have been encouraged by the scientists, funders, and policy makers involved in these programs, who are seeking ways to increase the translational impact of their work. We have seen cases across the field where these approaches have allowed exactly that. But these are sometimes at odds with the broader system, causing conflict and reducing their adoption. It may be time to contemplate a more complete, systemic redesign of the life sciences that supports our scientists in their quest for knowledge and that has the potential to directly improve our ability to promote human health.