Bringing Structure and Design to Data Governance

Before COVID-19 took over the world, the Governance team at Sage Bionetworks had started working on an analysis of data governance structures and systems to be published as a “green paper” in late 2020. Today we’re happy to publicly release that paper, Mechanisms to Govern Responsible Sharing of Open Data: A Progress Report.

In the paper, we provide a landscape analysis of models of governance for open data sharing based on our observations in the biomedical sciences. We offer an overview of those observations and show areas where we think this work can expand to supply further support for open data sharing outside the sciences.

The central argument of this paper is that the “right” system of governance is determined by first understanding the nature of the collaborative activities intended. These activities map to types of governance structures, which in turn can be built out of standardized parts — what we call governance design patterns. In this way, governance for data science can be easy to build, follow key laws and ethics regimes, and enable innovative models of collaboration. We provide an initial survey of structures and design patterns, as well as examples of how we leverage this approach to rapidly build out ethics-centered governance in biomedical research.

While there is no one-size-fits-all solution, we argue for learning from ongoing data science collaborations and building on from existing standards and tools. And in so doing, we argue for data governance as a discipline worthy of expertise, attention, standards, and innovation.

We chose to call this report a “green paper” in recognition of its maturity and coverage: it’s a snapshot of our data governance ecosystem in biomedical research, not the world of all data governance, and the entire field of data governance is in its infancy. We have licensed the paper under CC-BY 4.0 and published it in github via Manubot in hopes that the broader data governance community might fill in holes we left, correct mistakes we made, add references and toolkits and reference implementations, and generally treat this as a framework for talking about how we share data.

Sage Perspective: Retention in Remote Digital Health Studies

Editor’s note: This is a Twitter thread from John Wilbanks, Sage’s chief commons officer.


New from Abishek Pratap and a few more of us – Indicators of retention in remote digital health studies: a cross-study evaluation of 100,000 participants

A few thoughts on the paper:

  1. Hurrah for data that’s open enough to cross-compare.
  2. When someone shows you overall enrollment in a digital health study, ask about engagement % on day 2. It’s a way better metric.
  3. Over-recruit the under-represented with intent from the start or your sample won’t be anywhere close to diverse enough.
  4. Design your studies for broad, shallow engagement – your protocol and analytics will be better matched.
  5. Pay for participation and clinician involvement make a huge difference. Follow @hollylynchez who writes very clearly on the payment topic.
  6. Clinician engagement is going to need some COI norms because whew it’s easy to see where that can go sideways.
  7. When your study is flattened down to an app on a screen, the competition is savage for attention and you’ll get deleted really quickly if there isn’t some sense of value emerging from the study.
  8. Meta-conclusion: perhaps start with the question: how does this give value the participant when the app is in airplane mode?
  9. On “pay to participate” – the first time I ever talked to @FearLoathingBTX, he immediately foresaw studies providing a “free” phone for participation, but cutting service off for low engagement. That is, sadly, definitely on track absent some intervention.

Related content and resources:


Democratizing data access

Open Data Sharing in the 21st Century: Sage Bionetworks’ Qualified Research Program and Its Application in mHealth Data Release

As a leading advocate for open science practices, informed consent, and data privacy in biomedical research, Sage actively pilots and tests innovative tools and resources to maximize the scientific value derived from datasets while still ensuring basic contractual protections for research participants. This paper details the rationale, features, and application of Sage’s novel framework for qualifying a diverse pool of solvers from around the world for accessing health and biomedical data within Synapse. The three conceptual mechanisms guiding the development of the framework—transactional cost, exposure, and openness— are identified and illustrated via the case example of mPower, the first study to pilot the qualified researcher framework in 2015. This paper concludes with a cross-sectional snapshot of the current pool of qualified researchers and reveals key challenges and future directions for optimizing the qualified researcher framework in the years to come.  

Read white paper…

The Sage Perspective on Data Management and Sharing

Response to Request for Public Comments on a DRAFT NIH Policy for Data Management and Sharing and Supplemental DRAFT Guidance

// Submitted by Lara Mangravite and John Wilbanks on Behalf of Sage Bionetworks

Editor’s Note: This response to the draft policy involves pasting reply copy into fields in a web form with 8000 characters per field, thus the format. For reference: the AMIA response in 2018 mentioned in the Sage response.

Section I: Purpose (limit: 8000 characters)


Recommendation I.a: Make ‘Timely’ More Specific

The policy states that “shared data should be made accessible in a timely manner…”. Timely should be defined, so that researchers understand the baselines expected of them, and have a boundary beyond which they must share data. These baselines and boundaries should be reflected in the templated DSMPS we recommend elsewhere. More details are provided in our recommendations in the Requirements section, and we further recommend in section VI that such DMSPs are scored elements of applications.

Recommendation I.b: Elevate the Importance of Data Management

We applaud the mention of data management in both the purpose section and even in the title, but it is not given adequate attention in this section. The Purpose text does not address the lessons learned from data sharing within NIH-funded collaborations. From the Cancer Genome Atlas to Clinical Translational Science Awards to the Accelerating Medicines Partnerships, NIH has committed billions annually to such programs. Data sharing sits at the heart of these collaborative networks, but our experience indicates that simply “sharing” data sets is not sufficient to meet the stated purpose. Data management is rarely elevated to a role commensurate with its importance in data reuse. As such, we recommend adding the following text to the end of the first paragraph to delineate the importance of management to achieving the purpose of the policy:

“Data management is an ongoing process that starts well before and goes on well after the deposit of a file under FAIR principles, and NIH encourages practices that have been demonstrated to succeed at promoting data sharing and reuse in previous awards.”

Section II: Definitions (limit: 8000 characters)


Recommendation II.a: Amend the Definition of Data Management and Sharing Plan

The definition of the Data Management and Sharing Plan does not sufficiently capture how DMS is integral to the research process. This Policy should make it clear that the data sharing is not an add-on or checkbox, but an ongoing management process that is integrated into the scientific research process. We recommend adding the following text to the definition of:

“The plan should describe clearly how scientific data will be managed across the entirety of a research grant and specific descriptions of how and when resulting data will be shared, including descriptions of which NIH approved repositories they will be deposited (or, if depositing outside this group, how the proposed repository will be sufficient to meet the requirements).”

Recommendation II.b: Replace the Definition of Data Management

The definition of Data Management does not sufficiently reflect the true extent to which data management must permeate the research process, nor why it is important. Data management is a massive undertaking that improves the quality of shared data. We endorse the 2018 AMIA definition of data management and recommend that the NIH adopt it, replacing the current definition text with the following:

“The upstream management of scientific data that documents actions taken in making research observations, collecting research data, describing data (including relationships between datasets), processing data into intermediate forms as necessary for analysis, integrating distinct datasets, and creating metadata descriptions. Specifically, those actions that would likely have impact on the quality of data analyzed, published, or shared.”

Recommendation II.c: Add a Definition for Scientific Software Artifacts

The stated purpose of this policy is “to promote effective and efficient data management and data sharing.” Per our recommended additions to the Scope section, below, the policy should make clear that what must be managed and shared are not only the “scientific data” and “metadata” created in the course of research, but also the scientific software artifacts created, such as the code underlying the algorithms and models that process data. Accordingly, we echo AMIA’s call for definitions of “scientific software artifacts” and recommend NIH include in this policy the following definition:

“Scientific software artifacts: the code, analytic programs, and other digital, data-related knowledge artifacts created in the conduct of research. These can include quantitative models for prediction or simulation, coded functions written within off-the-shelf software packages such as Matlab, or annotations concerning data or algorithm use as documented in ‘readme’ files.”

Recommendation II.d: Add a Definition for “Covered Period”

Making data available for others to use can pose a significant burden, per the supplemental guidance on Allowable Costs. Investigators will need clear definitions of exactly what will be required of them for data hosting in the short, medium, and long term. As such, we recommend that NIH include a definition in this section for “covered period,” providing as much detail as possible on the expectations for the length of time that investigators must make their data available, including differences in requirements for research awards and data sets (including scientific software artifacts) of different scales.


Section III: Scope (limit: 8000 characters)


Recommendation III.a: Include Scientific Software Artifacts as an Asset to be Managed/Shared

The first sentence in this Policy notes “NIH’s longstanding commitment to making the results and outputs of the research that it funds and conducts available to the public.” Scientific software artifacts (as defined in the response to the Definitions section, above) are outputs as much as data, equally determinative of research findings. Thus, managing and sharing the means of manipulating data from one form to another, transforming raw inputs into valuable outputs, is also important to the end goal of rigorous, reproducible, and reusable science.

Furthermore, it is possible to technically share data while withholding key artifacts necessary to make those data valuable for reuse. These key artifacts could then be exchanged for authorship, position on proposals, or other scientific currencies, thus circumventing a major desired outcome of this policy: removing the unfair advantages of already funded investigators. As such, we recommend that the Scope section include the following statement:

“NIH funded research produces new scientific data and metadata, as well as new scientific software artifacts (e.g. the code of algorithms and models used to manipulate data). Software artifacts are outputs of research as much as data, and it is just as important to manage and share them in the interest of rigor, reproducibility, and re-use. NIH’s commitment to responsible sharing of data extends to scientific software artifacts. As such, throughout this policy, the use of the term “data” should be understood to include scientific software artifacts, per the definition established in Section II.”


Section IV: Effective Date(s) (limit: 8000 characters)

No recommendations planned

Section V: Requirements (limit: 8000 characters)


Recommendation V.a: Tier the Sharing Date Requirement

This policy will require cultural and practice changes for most funded researchers, as well as a nimble reaction to the realities of implementations by NIH. Failing to anticipate the implications of those changes could cause a severe backlash to the policy, undermining its purpose. As such, investigators of those projects least able to redistribute resources necessary to abide by this policy should be given more time to do so. We recommend that NIH adopt AMIA’s 2018 tiered proposal for establishing sharing date requirements based on the size of funding. Projects funded over $500,000 per year would have to comply within one year of approval of the DMSP, those between $250,000-$500,000 within two years, and those below $250,000 within three years.

Recommendation V.b: Create DMSP Templates

We do not expect most researchers to know how to structure a Data Management and Sharing Plan. Furthermore, grants structure into different categories: the funding mechanisms behind the Cancer Genome Atlas and the AllofUs Research Program are different than early career researcher grants and most R01s. We therefore recommend the ICs create templates for at least four categories of funding: grants intended to create reference resources for the scientific community, grants that create collaborative networks of multiple laboratories, grants that form “traditional” research but integrate at least two institutions, and grants that only flow to a single institution.

These templates will facilitate understanding of the DSMP obligations by researchers (a form of learning by bootstrapping), as well as facilitate review by standardizing the essential elements and layout of the DSMPs across submissions. Researchers who do not use the standard template would not be penalized, but any DSMP they submit should clearly mark how and where their essential elements map onto the templates provided by NIH. Segmenting these templates by class of resources expected to be shared will make it easier for researchers to understand expectations (and can be tied to kinds of funding mechanism, e.g. U24) and will also make like-to-like evaluation easier for the NIH in evaluation over time.


Section VI: Data Management and Sharing Plans (limit: 8000 characters)


Recommendation VI.a: Make Data Sharing a Requirement

This section states, “NIH encourages shared scientific data to be made available as long as it is deemed useful to the research community or the public.” However, the future utility of data is often unknown at the time it would be required for deposit, and it is unclear who would be responsible for deeming data as useful. We recommend that NIH require, not encourage, data to be shared. The NIH should also provide both alternate “sharing” mechanisms and opt-out processes for the situations when data sharing is either impossible or inadvisible (i.e. when sharing data would compromise participant privacy or harm a vulnerable group.

Alternate mechanisms could include a private cloud where users “visit” the data and are surveilled in their uses or “model-to-data” approaches where a data steward runs models on behalf of the community. Opt-outs should be rare but achievable, and patterns of opt-out usage should be tracked at the researcher and institution level to assist in evaluation of their use and impact.

Recommendation VI.b: Distinguish Between Purposes of Sharing

The requirements for data sharing should be different for data whose value to the community is realized in different ways. There is a difference between data that are generated with the explicit intent of creating a shared resource for the research community (e.g. TCGA), and data that are generated within the context of an investigator-initiated research project and are to be shared to promote transparency, rigor, and to support emergent long-term reuse. In the former case, a description of a detailed curation, integration, synthesis, and knowledge artifact plan should be present. In the latter case, a description of file format, simple annotation, and long-term storage should be front and center. We recommend that this section explicitly distinguish between these two purposes of sharing, and that different formats be used for developing and assessing DMSPs with respect to these different purposes.

Recommendation VI.c: Require the DMSP as a Scorable Part of the Application

In this policy, the DMSP will be submitted on a Just-in-Time basis. This signals that the plan is not a valued part of the application and is, in fact, an afterthought. NIH should factor the quality of the DMSP in its funding decision process. We recommend that the DMSP be required as a scorable part of the application so that appropriate sharing costs can be budgeted for at the time of application, and the plan can be included as part of the review process.

Recommendation VI.d: Make DMSPs Publicly Available

This section states that, “NIH may make Plans publicly available.” We believe that NIH should ensure transparency with the public who has funded the work, and take advantage of transparency as a means for encouraging compliance. As such, we recommend that this section state that “NIH will make Plans publicly available.”


Section VII: Compliance and Enforcement (limit: 8000 characters)


Recommendation VII.a: Give Investigators Time to Share

Judging an application based on performance on past DMSPs is only fair if the investigators have had sufficient time to implement that plan. Per Recommendation V.a (above) to tier the sharing date requirement, we recommend that application reviewers begin using evidence of past data and software artifact sharing starting between one and three years after the adoption of the DMSP, depending on the size of the prior award. Those with a prior award of $500,000 per year could be judged after one year of approval of the DMSP, those with a prior award between $250,000-$500,000 after two years, and those with a prior award of below $250,000 after three years.

Recommendation VII.b: Use Existing Annual Review Forms for Proof of Compliance

Compliance with this policy should be integrated with current annual review processes for funded research projects. Proof of compliance should not require more than a single line in existing documentation, otherwise proof of compliance, itself, becomes an unnecessary burden of compliance. We recommend that NIH add a URL to a FAIR data file in annual review forms, alongside those lines for publications resulting from the data. This would provide an incentive to encourage a broad array of DMS practices and make it as simple as “filling the blank” on the form. We also recommend that NIH create an evaluation checklist as part of DSMP annual review to be filled out by the investigator and shared alongside the existing annual review forms.

Recommendation VII.c: Certify “Safe Spaces” for DMSP Compliance

Compliance and enforcement will also be significantly easier if NIH develops a process to certify data commons, knowledge bases, and repositories as “safe spaces” for DMSP compliance.

Such a process could analyze the long-term sustainability of a database, its capacity to support analytic or other reuse, its support of FAIR principles, and more. Such a network would significantly “raise the floor” for the broad swath of researchers unfamiliar with FAIR concepts, for researchers at institutions without significant local resources to make data FAIRly available, and more. Accordingly, we recommend that this section include language detailing an NIH certification process for these resources.

Recommendation VII.d: Add data sharing and management experts to review panels

The composition of review panels is a key part of using DSMPs in award decisions. Ensuring data sharing and management expertise is represented as part of baseline review panel competency will increase both initial review and also encourage long-term compliance with the key goals of DSMPs.


Supplemental DRAFT Guidance: Allowable Costs for Data Management and Sharing (limit: 8000 characters)


Recommendation VIII.a: Detail the Duration of Covered Costs for Preservation

The funding period for a research project is relatively short compared to the period after the research is complete wherein its outputs might be replicated or reused. Ideally, research outputs would be preserved indefinitely, but preservation has costs. The draft guidance does not specify whether costs to preserve data beyond the duration of the funded grant are allowed or encouraged. We recommend that this section provide detail as to whether NIH will cover data preservation costs after the funding period and, if so, for how long.

Recommendation VIII.b: Detail the Covered Costs for Personnel

DMS costs are not limited to the acquisition of tools, infrastructure, and the procurement of services; they also entail the time and effort of research staff internal to the investigating institution. The draft guidance does not specify whether personnel costs are allowable expenses related to data sharing. We recommend that this section provide detail as to whether NIH will cover such personnel costs – data sharing and management, done well, imposes a short term cost in anticipation of longer term benefit. NIH should clarify where that cost comes from as part of the Policy.

Recommendation VIII.c: Detail How Cost Levels Will Affect Funding Decisions

The Policy does not state whether a higher cost for better DMS might penalize (or advantage) a proposal in an IC’s funding decisions. If potential recipients A and B propose to do the same research with the same traditional research costs, but A budgets for a robust “Cadillac” DMS plan, whereas B budgets for a bare-minimum “Chevy” plan, which does NIH choose? All things equal, should they choose the costlier, more robust option? Is it OK that it is a “tax” on the research proper? Is there an ideal ratio of traditional research costs to DMS costs? Is there a standard way to compare costs with benefits? We recommend that NIH provide detail in this section regarding how and if DMS costs will affect funding decisions.


Supplemental DRAFT Guidance: Elements of a NIH Data Management and Sharing Plan (limit: 8000 characters)


Recommendation IX.a: Address Different ‘Community Practices’ Across Disciplines

Section 1 of this supplemental guidance states that, “Providing a rationale for decisions about which scientific data are to be preserved and made available for sharing, taking into consideration…consistency with community practices.” However, different disciplinary fields can have different community standards. Some disciplines have a culture of sharing more, while in others it is less or not at all. Should all disciplines be held to the same DMS standards, or will investigators of different disciplines be expected to adhere to different community practices? If the former, how will this standard be established and what are the ramifications for compliance in disciplines currently outside of this standard? We recommend that NIH provide additional detail in this section (or, if necessary, in separate supplemental guidance) as to what the DMS expectations are within and across scientific disciplines.

Recommendation IX.b: Direct the Use of Existing Repositories

Section 4 of this supplemental guidance states, “If an existing data repository(ies) will not be used, consider indicating why not…” We recommend that the word “consider” be removed. This policy should recommend the use of established repositories and, if this is not feasible, then the investigator should justify their decision with a specific reason. We understand that many scientists are unaware of the infrastructure already in place, so we also recommend that NIH provide a list of existing data repositories with a certification of compliance to increase their use. Additionally, NIH may wish to provide guidance and build associated resources to assist investigators choose which of these repositories to use. If there are repositories that they must use (e.g., or that NIH would prefer them to use, or that NIH has no preference (i.e., it would like the “market” to arrive at the best option), then NIH should make these degrees of requirement plain to investigators and make tools and infrastructure available to help them to decide.

Recommendation IX.c: Clarify Sharing Requirements for Data at Different Degrees of Processing and Curation

Section 1 requires investigators to describe “the degree of data processing that has occurred (i.e., how raw or processed the data will be).” This raises the question as to whether the investigator can choose the level of processing and/or curation of the data to share, or if the investigator must share data at all levels of processing/curation. For purposes of reproducibility, we should encourage — or require — not only the sharing of data, but descriptions of data processing at each level (per Section 2: Related Tools, Software and/or Code). This may, of course, increase the costs of DMS, so additional guidance would also be needed on what thresholds there may be and, the NIH should designate where the investigator has freedom to choose the levels of data shared and how the investigator should make tradeoffs.

Recommendation IX.d: Expand the Requirements and Guidance for Rationale

Section 1 requires a rationale of which data to preserve or share based on the criteria of “scientific utility, validation of results, availability of suitable data repositories, privacy and confidentiality, cost, consistency with community practices, and data security.” This rationale is limited to the choice of which data to share, while there are other important DMS decisions that warrant rationales. We recommend NIH require a rationale on where to share it and how long it will be available (Section 4), in what format it is shared (Section 3), and what other things might be shared, such as algorithms (Per Section 2). As with the choice of which data to preserve and share, NIH should offer criteria for decisions in each of these areas as well.

For choices regarding data preservation and sharing, as well as these other choices, if NIH has any preferences on how to weigh and balance criteria, we recommend it make those plain through additional guidance. Further, it should develop tools and infrastructure to help investigators to weigh and balance them, and conduct periodic audits/evaluations to understand how investigators across fields, over time, are making these judgements, if those judgements are in the best interest of the scientific community, and what additional incentives/requirements might be put in place.


Other Considerations Relevant to this DRAFT Policy Proposal (limit: 8000 characters)


Recommendation X.a: Detail how NIH will Monitor and Evaluate the Implementation of this Policy

A planning mechanism without an evaluation mechanism is only half complete. This policy should establish an adaptive system that improves DMS over time though feedback and learning. We recommend that this policy contain a new section that details how NIH will monitor and evaluate performance toward individual DMSPs during the funding period and after, to the extent that data are planned to be preserved after. Further, we recommend this new section also detail how NIH will monitor and evaluate implementation of this policy across all DMSPs, using evidence to illustrate how its purpose is or is not being achieved and what changes might be made to improve it. Policy-wide monitoring and evaluation information and reports should be made publicly available. Publicizing measures (e.g., usage rates and impact of previously shared data) is also a way to promote a culture where investigators are incentivized to produce datasets that are valuable, reusable, and available.

The Sage Perspective on the American Research Environment

tags: biomedical science, open science, computational science

A Response to the Request for Information on the American Research Environment Issued by the Office of Science and Technology Policy

// Submitted by Lara Mangravite and John Wilbanks on Behalf of Sage Bionetworks



Innovations in digital technology expand the means by which researchers can collect data, create algorithms, and make scientific inferences. The potential benefit is enormous: we can develop scientific knowledge more quickly and precisely. But there are risks. These new capabilities do not, by themselves, create reliable scientific insights; researchers can easily run afoul of data rights, misuse data and algorithms, and get lost in a sea of potential resources; and the larger scientific community can barricade themselves into silos of particular interests.

Improving discovery and innovation though the sharing of data and code requires new forms of practice, refined through real world experience. Science is a practice of collective sense-making, and updates to our tools demand updates to our sense-making practices. At Sage Bionetworks, we believe that these practices are a part of the American Research Environment. As such, our response to this Request for Information (RFI) focuses on implementing scientific practices that promote responsible resource sharing and the objective, independent evaluation of research claims.

We begin with two vignettes that illustrate the power of open science practices to deal with Alzheimer’s Disease and colorectal cancer. Next, we assess the American Research Environment, given our aims: more (and more responsible) sharing; data and algorithms that can be trusted; and evidence collection that is practical and ethical.

Finally, we offer recommendations under the Research Rigor and Integrity section of the RFI. Our conclusion is that to improve digital practices across the scientific community, we must explicitly support transdisciplinary practices as important efforts in their own right, while integrating them into domain-specific scientific projects.

Summary of recommendations

  • Develop — and fund over time — platforms for storing, organizing, and making discoverable a wide variety of types of data to a wide variety of stakeholders.
  • Change the institutional incentives towards using cloud platforms over local high-performance computing (HPC).
  • Develop clear community standards for implementing, evaluating, and articulating algorithm benchmarks.
  • Create or acquire training, workshops, or other forms of education on the ethics of computational science.
  • Develop systemic practices for identifying risks, potential harms, benefits, and other key elements of conducting studies with data from mobile devices.
  • Require federal researchers to preregister their research prior to conducting work (e.g., via to ensure their results are published, even if their hypotheses are not validated.



Research — including its reproduction — can be a complex, systems-of-systems phenomenon. Incentives, impediments, and opportunities exist at multiple interacting layers. It is often helpful to understand issues such as these in context. The following two examples show how technology-centered collaborative practices can yield stronger scientific claims, which in turn increase returns on investment in science.

Accelerating Medicines Partnership for Alzheimer’s Disease
Alzheimer’s disease (AD) and dementia are a public health crisis. The financial toll of dementia is already staggering. In the U.S. alone, the costs of caring for people over age 70 with dementia were estimated to be as high as $215 billion in 2010. Drugs for dementia are hard to find, such that the cost of finding even an ineffective medicine for AD sits at $5.7 billion.

The question is, what can we do to make it easier? One way is to change our scientific practice – the way we discover drugs and their targets. The Accelerating Medicines Partnership for Alzheimer’s Disease is a test of this idea. Twelve companies joined the National Institutes of Health (NIH) in this pre- competitive collaboration, forming a community of scientists that use the cloud to work together, share early and often, and improve both public and private returns on investments in AD drug discovery.

Within AMP-AD, Sage Bionetworks coordinates the Target Discovery and Preclinical Validation project. The project’s goal is to shorten the time between discovery of potential drug targets to development of new drugs for Alzheimer’s treatment and prevention. It brings together six multi-institution academic teams, four industry partners, and four non-profit organizations. The project tests the use of Artificial Intelligence/Machine Learning (AI/ML) analysis on high-dimensional human brain data to identify AD drug targets. Because these methods were untested, these AMP-AD groups work together to identify effective research methods — and outcomes. In this way, expert groups take independent approaches at solving this problem and then collectively identify repeatable observations. This requires early sharing of data, methods, and results. All the scientists operate inside Synapse, a Sage-built cloud platform with services that document the data science process. Using Synapse makes data and code widely reusable, with quarterly data releases to the public. Another Sage portal, Agora, allows any researcher to explore curated genomic analyses and target nominations from AMP-AD and associated consortia.

AMP-AD has already paid off. Over five years, AMP identified over 500 new drug targets for Alzheimer’s disease for under $100 million. The next phase is already underway, with Alzheimer Centers for the Discovery of New Medicines set to diversify and reinvigorate the Alzheimer’s disease drug development pipeline at a cost of just $73 million.

Colorectal Subtyping Consortium
Colorectal cancer (CRC) is a frequently lethal disease with complex, mixed outcomes and drug responses. In the early 2010s, a number of independent groups reported different genetic “subtypes” for CRC — these subtypes were designed to help doctors understand how different kinds of colorectal cancer will respond to different drugs.

Subtyping is harder than it needs to be because different researchers and labs process data differently, use different data to create their algorithms, and more. Even the way researchers convert tumor samples into digital data affects the process. So, to actually benefit patients, the colorectal cancer research community needed to bring it all together and compare notes.

The Colorectal Cancer Subtyping Consortium (CRCSC) was formed to identify a consensus among the divergent scientific results through large scale data sharing and meta-analysis. The CRCSC began with 6 academic groups from 15+ institutions. It collected and analyzed more than 30 patient groups with gene expression data, spanning multiple platforms and sample preparation methods. Each of the 6 AI/ML models was applied to the collection of public and proprietary datasets encompassing over 4,000 samples, mostly stage II-III cancer. An independent team centrally assessed the accuracy of subtype calls and associations with clinical, molecular and pathway features. Compared to how long it would take for each research team to publish a peer reviewed paper, read the papers of the other teams, and conduct additional research, this process produced results at an incredible rate.

Despite significant diversity in patients studied and AI/ML methods, the consortium came to a clear consensus on 4 CRC molecular subtypes (CMS1-4), with significant interconnectivity among the work from the participating groups. This was the first example of a large-scale, community-based comparison of cancer subtypes, and we consider the outcome the most robust way to classify colorectal cancer for targeted drugs based on genetics. It is the kind of work that typically can take a decade or more to reach consensus in the field through publication and conferences – whereas our approach led to publication of the consensus model within three years of the first of the divergent papers being published. Furthermore, our aim was to establish an important scientific practice for collaborative, community-based cancer subtyping that will facilitate the translation of molecular subtypes into the clinic.

Assessment of the American Research Environment 


Medical progress is hindered by many challenges. Consider the fact that health conditions are often defined – imprecisely – by symptoms rather than by biology, or that disease onset and treatment responses vary across populations, or our inability to effectively tailor care to the needs of individuals. Advances in information technology have provided us with an opportunity to address limitations such as these. Over the past two decades, new tools have emerged to collect, share, and combine data of many different types and scales, as have the algorithms to process them to uncover new knowledge in a wide variety of domains. The growing power, affordability, and ubiquity of computational tools in biomedical science has made them an indispensable component of the research environment.

Yet computational discovery has suffered from the same failures of translation and reproducibility that have plagued traditional approaches to discovery. We have new tools to generate and process vast quantities of information, but we often lack validated practices to turn that information into reliable insights. We need methodologies, processes, and baseline data that reliably and reproducibly generate trustable knowledge out of large-scale data. The AMP-AD and CRC vignettes above demonstrate how this can reduce the cost and the time of creating the reliable scientific insights on which treatments are based.

Unfortunately, there are market failures and public value failures around new scientific practices. Most incentives instead lead towards data withholding, black-box algorithms, and force reliable knowledge to emerge over artificially long time periods. Businesses fund research that results in private, appropriable intellectual property; they tend not to fund work with results that anyone can use, including the meta-work on how data science affects research reliability. Even when research is publicly funded, the individuals and institutions conducting it have the incentive to bolster their reputations by keeping data and code to themselves. The scientific funding, publishing, and promotion systems prefer papers claiming insights over methods development, and original research over replication. These perverse incentives prevent the scientific community from sharing effectively across organizations to perform effective computational research. They make it more likely that innovation will create value for research producers than for patients.

Open science practices can address these market failures and public value failures. As we saw in the AMP-AD example, the secret to the lower cost and higher throughput is the implementation of collaborative practices, mediated through a scientific research software platform. The transparency, replication, and reuse of data and code can be increased by an evolving set of rules and cultural norms to promote appropriate interpretation of data and code and to speed information flow. These practices are essential for rigorous science, given the tools we have at our disposal and the unique complexities that have been introduced by computational research.

Sharing Research Data

Over the past 10 years, the scale and scope of data used for biomedical research has expanded. We have observed an explosion in community-based data sharing practices to allow independent teams across institutions to access each other’s data, to generate federated data resources that combine databases, and to integrate large-scale, multi- modal data — including many from non-traditional sources, such as electronic health records and real-world data streams from increasingly pervasive smart devices. There is a great opportunity to improve the quality, reproducibility, and replicability of research by making these practices widely known, and these data resources interoperable. As was shown in the CRC vignette above, large scale data sharing and meta-analysis across more than 15 institutions yielded extraordinary results in a fraction of the time of a publication-mediated process. Science progresses faster, farther, and more surely though the wisdom of crowds – including crowds of researchers connected by technology.

However, there are impediments to realizing these benefits: data scale, data rights, and data omission. These impediments are magnified when science occurs across organizational boundaries, i.e. between federal agencies, universities, and private firms. The sheer size and diversity of data sets can limit their effective use. Also impeding use are the complexities of data protection; proprietary and/or sensitive data (e.g., patient clinical records) are only allowed to exist on certain networks — for good reasons like protecting privacy or preventing harm, they’re out of reach for those on other networks.

Finally, data that are not codified in any system in the first place cannot be shared; those who collect data do not always publish all of the data they collect, which can distort scientific claims through a perceived absence of evidence.

To overcome these limitations, and mitigate the costs of overcoming them, two approaches have emerged. In the sandbox approach, data are secured in a private environment to which only qualified researchers gain access. Analysis happens inside the sandbox, so that data cannot be misused externally. In the model-to-data approach, qualified researchers may send algorithms to be run in protected environments with data that they cannot access, which can allow for crowd-based access to data that is itself never exposed. Increasingly, the field is also considering federated extensions to these sharing models for situations where data must remain under the external control of data contributors. These types of solutions balance collaboration with the needs of various parties to control resources.

Benchmarking Algorithms

Just as there are potential pitfalls of sharing data, so too are there potential pitfalls for sharing the code used to build quantitative models. In typical practice, algorithm evaluations are conducted by those who developed them. Thus, most developers fall into the self-assessment trap, such that their algorithms outperform others at a rate that suggests that all methods are better than average. This can be inadvertent — a result of information leaks from, or over-fitting to, the data at hand — or it can be intentional — a result of selective reporting, where authors choose the metric or the data in which their algorithm shines, but hide those metrics and data that show sub-par performance.

The risks from using the wrong algorithm at the wrong time can be more arcane to the casual observer than the risks of bad data, but they are every bit as significant.

Algorithms make predictions, and the self-assessment trap means a lot of those predictions will be wrong. Making the wrong predictions can cost scientists – and the taxpayer who funds them – years of misdirected research. If we don’t have a standard way to decide if an algorithm is above, at, or below average, we won’t even know how to start when faced with a new one. We believe that the self-assessment trap is a major block for algorithms that hope to translate into actually helping patients. We therefore need frameworks inside the research environment that can separate the algorithm’s developer from its evaluator – to see if it works the way it’s supposed to work.

Using Real World Evidence

Digital devices make it very easy to collect data from a vastly larger group of people than was possible before. This can blur the line between traditional research study and consumer data collection methods. Real world evidence (RWE) is data that are collected out in the wild, and their collection will increasingly be driven by mobile devices and sensors. Much RWE will indeed come from devices that people own – bought in the context of consumer technology, not regulated research.

But consumer tools prioritize adoption. They use one-click buttons to obtain consent, and don’t worry about bioethics concepts like autonomy, respect for persons, beneficence. Compared to consumer devices and apps, ethical collection of RWE will require slowness and attention from both researchers and potential participants. This may hurt raw enrollment numbers compared to consumer technology, which creates temptation to abandon bioethics in favor of consumer surveillance approaches.

Our research environment needs to acknowledge this reality: we need consumer technology to collect RWE, but consumer technology is often legally and ethically contracted at odds with ethical research protections. Few stakeholders in the space build ethical, practical governance for RWE as a result. The increasing availability of RWE thus creates the need for new research ethics protections for the digital surveillance era.



Different organizations across different sectors have different strengths, and open science practices should help them make the most of their strengths individually, and collectively. Some organizations have the resources that others do not. Some have a comparative advantage in producing quality data and code, while others have an advantage in access to facilities and equipment. Some organizations have fast networks with ample storage, while others have to budget their computing resources more strictly. Some organizations are moving towards an open approach from closed approaches, while others are moving there from very (possibly irresponsibly) open approaches.

Given the complexity of biomedical data sharing across the biomedical field, and the different starting points of different organizations, we require a flexible spectrum of open science approaches.

As such, there are no one-size-fits-all recommendations. Each organization and research domain must be addressed as a unique case. However, given the incentives, impediments, and opportunities described above, we offer the following general recommendations in response to questions 1, 2, 3, and 4 in the “Research Rigor and Integrity” section of the Request for Information.

Q1. What actions can Federal agencies take to facilitate the reproducibility, replicability, and quality of research? What incentives currently exist to (1) conduct and report research so that it can be reproduced, replicated, or generalized more readily, and (2) reproduce and replicate or otherwise confirm or generalize publicly reported research findings?

Develop — and fund over time — platforms for storing, organizing, and making discoverable a wide variety of types of data to a wide variety of stakeholders. For example, Synapse and Agora (highlighted in the AMP-AD vignette above), allow researchers to share data, evaluate hypotheses, and make collective decisions about research directions. These sharing platforms should support efficient and responsible data sharing through integrated approaches for data governance, data management, and data access. They should be able to accommodate large numbers of users, adapt to heterogeneous and complex data types and compute environments, and incentivize wider participation in a data and benchmarking ecosystem. Finally, they should be designed to capitalize on the power of cognitive diversity resident in the American research environment by drawing upon the perspectives and experiences of the researchers who will use them, and upon the lessons of the emerging science of team science.

Change the institutional incentives toward using cloud platforms over local high-performance computing (HPC). Many institutions have built local HPC high performance computing resources over time. These resources support scientists locally but can serve as a disincentive for researchers to move into cloud platforms that facilitate collaboration and reuse of data. Funding should shift from supporting local HPC to supporting standard cloud platforms, and specific funds — separate from research grants — should be dedicated to support public clouds run as utilities in addition to supporting research computing on corporate clouds at Amazon, Google, and so on. Public cloud utilities would act as a nimble form of market regulator, keeping prices low and creating user-friendly features that might not line up with corporate revenue maximization.

Q2. How can Federal agencies best work with the academic community, professional societies, and the private sector to enhance research quality, reproducibility, and replicability? What are current impediments and how can institutions, other stakeholders, and Federal agencies collaboratively address them?

Develop clear community standards for implementing, evaluating, and articulating algorithm benchmarks. An emerging paradigm for the development and unbiased assessment of tools and algorithms is crowd-sourced challenge-based benchmarking. By distributing problems to large communities of expert volunteers, complex questions can be addressed efficiently and quickly, while incentivizing adoption of new standards. Challenges provide a successful resolution to the “self- assessment trap” through robust and objective benchmarks. Moreover, a successful challenge model can be an effective way for motivating research teams to solve complex problems.

Q3. How do we ensure that researchers, including students, are aware of the ethical principles of integrity that are fundamental to research?

Create or acquire training, workshops, or other forms of education on the ethics of computational science. Computational biomedicine will only improve human health when conducted in a reliable and responsible manner. It is, therefore, critical to establish and implement community norms for responsible data sharing and reliable computational data analysis. Training and workshops can help instill in researchers the knowledge — and the conscience — needed to effectively and ethically navigate the evolving landscape of computational science. Educational modules should cover topics including: 1) efficient and responsible methods for sharing of biomedical research data; 2) Industry standards for objective benchmarking of algorithms used to derive insight from evaluate that data; and 3) the reliable and responsible integration of real-world evidence (RWE) — from electronic health records and smart devices — into research programs.

Develop systemic practices for identifying risks, harms, benefits, and other key elements of conducting studies with data from mobile devices. This necessarily involves understanding how to design clinical protocols, informed consent, and data sharing processes for anything from low risk surveys up to full genomes and biospecimens. It could also involve developing a methodology that borrows from software development, including version control, analytic dashboards, user experience design, and more to support efficiency increases in protocol management.

Q4. What incentives can Federal agencies provide to encourage reporting of null or negative research findings? How can agencies best work with publishers to facilitate reporting of null or negative results and refutations, constraints on reporting experimental methods, failure to fully report caveats and limitations of published research, and other issues that compromise reproducibility and replicability?

Require federal researchers to preregister their research prior to conducting work (e.g., via to ensure their results are published, even if their hypotheses are not validated. If 9 out of 10 studies do not validate a hypothesis, but the only one that does gets published, then the scientific community will have an inaccurate record of evidence to substantiate a claim. Moreover, what are negative results for the hypotheses of the researcher initiating the study may be positive results for the hypotheses of other researchers in the community.




Thank you for the opportunity to provide our perspective on how to improve the American research environment. We believe that open computational science practices can vastly improve the speed and efficacy of the research enterprise and must be applied responsibly. Furthermore, to improve digital practices across the scientific community, we must explicitly support these transdisciplinary practices as important efforts in their own right, while integrating them into domain-specific scientific projects. They should not be ancillary efforts, tagged onto research primarily aimed at particular discoveries.

In this response, we focused on the present state of the enterprise, but it is also helpful to consider the future. The growth and trajectory of AI and machine learning guarantee that new challenges and possibilities with sharing data and code will emerge as time passes. The assessment and recommendations offered here address the impediments and opportunities we currently face, but they also set us up to avoid the worst consequence of increasingly powerful information and knowledge technology, and set us up to more aptly seize the chances they provide.

The Value of Team Science in Alzheimer’s Disease Research

Related News: Bringing Open Science to Drug Discovery for Alzheimer’s

The Sage Perspective

Silos in research are slowing us down. This isn’t a revelation, but it is a rallying call for many of us who hope to overcome barriers to advancing research, especially for a disease like Alzheimer’s.

In the study of Alzheimer’s, there has been a spectacular failure in the development of therapies. All the drugs that are allegedly disease-modifying have failed in late-stage clinical trials. The thinking around what causes the disease has not moved beyond a few hypotheses that have taken root.

This has occurred because the scientific community has fallen for the streetlight effect: We continue to expend resources to generate new data on hypotheses that have existing promising data because it is viewed as a safe bet. Given the repeated failure of clinical applications of these hypotheses (e.g. the Amyloid hypothesis), we face the stark reality that the true nature of the disease is a quagmire of uncertainty.

Fundamental shift

Yet there are rational strategies that have been successful in other domains such as finance that the community can use to mitigate that uncertainty. Instead of continuing to accrue data on what isn’t working, we ought to systematically explore the boundaries of our collective knowledge about Alzheimer’s Disease and balance the distribution of resources across low-, medium-, and high-risk ideas. This requires a fundamental shift in how we think about doing science, because no individual contributor can perform all of the tasks necessary to expand our collective knowledge in a meaningful manner.

There are so many silos that a lot of data, new ideas, and hypotheses don’t get shared. There also is some level of distrust in the community by researchers who want to guard proprietary information for the sake of a “magic bullet.” But there is no magic bullet. If we don’t collaborate strategically and diversify our research portfolio, we will continue to fail.

We are at a critical stage in Alzheimer’s Disease research where the community and individual researchers must put aside their individual reservations and work together. We have to let go of what’s not working and acknowledge that there are potentially other factors that affect how the disease behaves. It’s imperative that fresh ideas are given adequate space to succeed and to disrupt current structures to facilitate this exchange. We have to hedge our bets.

Radically open

At Sage, I lead a team that works across several programs that are identifying new drug targets to treat Alzheimer’s disease. There are many different academic institutions that are generating high-dimensional molecular data that can be used to try to identify new genes and pathways that could be fresh drug targets. We, in the spirit of open science, help orchestrate the analytic and data coordination efforts associated with that endeavor.

Our goal is to use a data-driven approach to better understand the underlying molecular mechanisms of the disease. It’s not something that any individual group would have the resources to do effectively. So it really requires a community-driven approach. Sage is positioned to conduct the scientific coordination that can help researchers work more effectively to get at these new ideas that might lead to a successful treatment.

Our primary project is AMP-AD (Accelerating Medicine Partnership in Alzheimer’s Disease), which is a public-private partnership supported by the National Institutes on Aging. We serve as a hub for all the data that’s being generated across the project. It’s a radically open model where all the data become open once they have gone through quality control. You don’t have any publication embargoes or restrictions on data use – aside from adhering to governance standards associated with sensitive human data.

We play a role in trying to increase the transparency of all the analyses that become available. We’re also building partnerships with academic investigators to streamline how we reach a consensus about what the data are telling us about the potential causes of this disease. We want to make sure that any conclusions are consistent across different research teams, because the more generalizable a solution is, the more likely it will lead to a successful treatment.

The long view

In addition to this scientific coordination work, my group is also performing original research on Alzheimer’s Disease. In all of our research, we operate under the same open model as all of our collaborators. Practicing this open approach in our own work is important at Sage. By holding ourselves to the same standard that we ask the community to live by, we can understand and work through any pain points. In this way, we hope to lead by example. At Sage, we do have the benefit of a culture and incentive structure that emphasize the long view versus, say, maximizing revenue in the short term. Being able to think on a longer time scale affords us the ability to make decisions that improve science more materially than if we were to focus on solo – and siloed – projects.

Any approach to tackling how science is done needs to be systematic in order to have long-lasting impact. For Alzheimer’s disease, we have an opportunity to improve how therapeutic development happens. Our vision and hope is that any future compounds that may result from open research we support would be achieved faster and more efficiently, and be made available in an affordable and equitable manner.

Being radically open and collaborative isn’t easy, but operating in a silo won’t get us far enough. We have to be more intentional about team science. Lives depend on it.

From Open Systems to Trusted Systems: New Approaches to Data Commons

At Sage, we’re always reflecting on the work we’ve done, and how it might need to evolve. Conversations with partners and colleagues at events such as CAOS are a key part of this process. Over the last couple of years, a common theme has been the role that actively collaborating communities play in successful large-scale research projects, and the factors that accelerate or inhibit their formation.

Like any research organization, Sage experiments with different ideas and works in different areas over time. This challenges our technology platform team to build general-purpose systems that not only support the diversity of today’s research, but also pave the way for new types of science to be performed in the future.

Synapse makes it easier for researchers to aggregate, organize, analyze, and share scientific data, code, and insights.

Synapse, our flagship platform, allows us to develop data commons using a set of open APIs and a web portal hosted in the cloud, and programmatic tools that allow integration of Sage’s services into any analytical environment. Together, this platform makes it easier for researchers to aggregate, organize, analyze, and share scientific data, code, and insights.

Initially, our model for Synapse was GitHub – the software platform that has been the locus of the open source software movement over the past decade. Our thinking was that if we just made enough scientific knowledge open and accessible, scientists around the world would organize themselves and just start doing better science. In part, we saw our role as unlocking the potential of junior researchers who were digital natives and, perhaps, willing to work in ways less natural to the established PIs in the context of the established research ecosystem. Our assumption was that a pure technology solution would be sufficient to accelerate progress.

The reality wasn’t as straightforward as we thought.

Over the course of eight years, we’ve had a lot of large scientific collaborations operate on Synapse, some quite successfully and others less so. The main determinant of success has proven to be the level of alignment of incentives among the participating scientists, and their degree of trust in each other. Further, consortium-wide objectives must be aligned with individual and lab-level priorities. If these elements exist, the right technology can catalyze a powerful collaboration across institutional boundaries that would be otherwise difficult to execute. But without these elements, progress stalls while the exact same technology sits unused.

In a recent talk on the panel Evolving Challenges And Directions In Data Commons at BioIT World West (slides here), I shared several case studies to illustrate the aspects of the platform that were most successful in enabling high-impact science, and the characteristics that contributed to that success:

Digital Mammography Dream Challenge

In the Digital Mammography Dream Challenge, we hosted close to 10TB of medical images with clinical annotations in the cloud, and organized an open challenge for anyone in the world to submit machine learning models to predict the need for follow-up screening. Due to patient privacy concerns, we couldn’t directly release this data publicly. Instead, we built a system in which data scientists could submit models runnable in Docker containers, executed training and prediction runs in the AWS and IBM clouds, and returned output summaries. This was a huge shift in workflow for the challenge participants, who are more accustomed to downloading data to their own systems than uploading models to operate on data they cannot see.

The technical infrastructure, developed under the leadership of my colleagues Bruce Hoff and Thomas Schafter, is one of the more impressive things we’ve built in the last couple of years. Imposing such a shift in workflow on the data scientists risked being a barrier. That proved not to be the case: the incentive structure and publicity generated by DREAM generated enormous interest, and we ended up supporting hundreds of thousands of workflows generated by over a thousand different scientists.

mPower Parkinson’s Study

In the area of digital health, Sage has run mPower, a three-year observational study (led by Lara Mangravite and Larsson Omberg) of Parkinson’s disease conducted in a completely remote manner through a smartphone app. This study produced a more open-ended challenge: how to effectively learn from novel datasets, such as phone accelerometry and gyro data, collected while study participants balanced in place or walked. The study leveraged both Synapse as the ultimate repository for mPower data, as well as Bridge – another Sage technology designed to support real-time data collection from studies run through smartphone apps distributed to a remote study population.

We organized a DREAM challenge to compare analytical approaches. This time, we focused on feature extraction rather than machine learning. Challenge participants were able to directly query, access, and analyze a mobile health dataset collected over six months of observations on tens of thousands of individuals. Again, the access to novel data, and to a scientifically challenging and clinically relevant problem was the key to catalyzing a collaboration of several hundred scientists.

Colorectal Cancer Genomic Subtyping

Our computational oncology team, led by Justin Guinney, helped to organize a synthesis of genomic data on colon cancer originally compiled by six different clinical research teams. Each of these groups had published analysis breaking the disease into biologically-distinct sub-populations, but it was impossible to understand how the respective results related to each other or how to use the analysis to guide clinical work

Unlike the previous two examples, this was an unsupervised learning problem, and it required a lot of effort to curate these originally distinct datasets into a unified training set of over 4,000 samples. However, the effort paid off when the teams were able to identify consensus subtypes of colon cancer, linking patterns in genomic data to distinct biological mechanisms of tumor growth. This project operated initially with only the participation of the teams that conducted the initial clinical studies – and it was only in the confines of this initially private group that researchers were really willing to talk openly about issues with their data. It also helped that each group was contributing part of the combined dataset and therefore everyone felt that all the groups were contributing something to the effort. With the publication of the initial consensus classification system, the data and methods have been opened up and seeded further work by a broader set of researchers relating the subtypes to distinct clinical outcomes.

Towards Community-Specific Data Commons

What do these three examples have in common? From a scientific standpoint, not much. The data types, analytical approaches, and scientific contexts are all completely different. In retrospect, it’s perhaps obvious that there’s not much chance of data, code, or other low level assets being used across these projects. The fact that all three projects were supported on the same underlying platform is evidence that we’ve developed some generally-useful services. But, our monolithic GitHub-style front end has not been an effective catalyst for cross-project fertilization.

What has been a common indicator of success is effective scientific leadership that gives structure and support to the hands-on work of more junior team members. This is even more important when these projects are carried out by highly distributed teams that haven’t previously worked together. Developing this sense of trust and building a functional community is often easier to do in smaller, controlled groups, rather than in a completely open system that, at best, is saturated with irrelevant noise, and, at worst, can be hijacked by actors with bad intentions. Starting small and increasing the “circle of trust” over time is an effective strategy.

It’s becoming clearer that these sort of factors are the case, even in software development. Despite what you might think from some of the open-source rhetoric, most of the really large-scale, impactful open-source projects benefit from strong leadership that gives a sense of purpose and organization to a distributed group of developers. And, even GitHub itself is now a part of Microsoft – who would have bet money on that outcome 10 years ago?

In the past year, the Synapse team has been piloting the development of new web interfaces to our services that repackage how we present these capabilities to different communities into a more focused, community-specific interfaces. With the recent launch of the AMP-AD Data Portal, and NF Data Portal the first couple of these experiments are now public. I’m excited to see how our platform continues to evolve as we enter Sage’s second decade, and even more excited to see what new science emerges on top of it.

About this series: In February 2019, Sage Bionetworks hosted a workshop called Critical Assessment of Open Science (CAOS). This series of blog posts by some of the participants delves into a few of the themes that drove discussions – and debates – during the workshop. Read the series.

Open Science: To What End?

By Cyndi Grossman

I have a hard time defining open science. This was confirmed for me during a Critical Assessment of Open Science (CAOS) meeting hosted by Sage Bionetworks where a definition was proposed and elicited strong, immediate pushback from attendees. It even sparked disagreement about what constitutes “science.”

At least I am in good company.

That said, when it comes to how science is disseminated to and consumed by the public, I can easily define what I would like to see fixed:

  • Publication paywalls that limit access to cutting-edge findings
  • Scientific meetings that lack diversity among speakers
  • Abstracts that fail to describe findings in language accessible to non-scientists or non-topic area experts
  • Inefficiencies in the process of discovering and addressing false scientific claims or misconduct

These issues not only reflect an unhealthy exclusivity that pervades science, they also contribute to public mistrust in science.

As a career-long advocate for engaging communities and individuals in research, I have seen research designed by youth to support the mental health needs of their peers, sex workers advocate for HIV vaccine research, and parents with children who are living with a rare disease discover their child’s genetic mutation. Public engagement in science, especially life sciences, is essential to its impact on society. Conducting science more openly is an important component of fostering greater engagement, but the focus of open science is too much on engaging other scientists and not enough on engaging the larger community of non-scientists.

During the CAOS meeting, we discussed the difference between “bolted-on” and “built-in” solutions. The current approaches to address publication paywalls, structure datasets for reuse, and share code and algorithms, are important yet bolted-on solutions. They address each element in isolation rather than redesigning across the way science is incentivized, conducted, and disseminated to build-in the perspectives and needs of non-scientists when it comes to how science intersects with society.

Open science could offer a new way of conducting science in the 21st Century where incentives are restructured toward greater openness and sharing, collaboration and purposeful competition, and structural support for dissemination of scientific tools and results. But this new system must be designed by bringing scientists and non-scientists together if the barriers between science and society are to be broken down.

I don’t know if this is putting too much onto open science, but there are some organizations, like 500 Women Scientists, that support open science, diversity, and social justice, and connect scientists to society through education and volunteerism. We know that the scientific enterprise reflects elements of societal inequity, yet there are relatively few efforts aimed at self-reflection and self-correction. My hope is that open science, however we define it, can be an example of a more inclusive way to conduct science.

About: Cynthia (Cyndi) Grossman is a social and behavioral scientist by training. Most recently, she was director at FasterCures, a center of the Milken Institute where she led efforts to integrate patients’ perspectives in biomedical research and medical product development. She has spent her career supporting research to address unmet needs such as mental health, stigma, and other social determinants of health. She is currently obsessed with the potential of health data to advance research and well-being by connecting individuals, communities and systems.

About this series: In February 2019, Sage Bionetworks hosted a workshop called Critical Assessment of Open Science (CAOS). This series of blog posts by some of the participants delves into a few of the themes that drove discussions – and debates – during the workshop.[

Voice from this series:

Read here or on

Introduction by Lara Mangravite and John Wilbanks

Voices From the Open Science Movement by Lara Mangravite and John Wilbanks

Recognizing the Successes of Open Science by Titus Brown

Open Science is About the Outcomes, Content, and Process of Research by Brian Nosek

Do Scientists Reuse Open Data? by Irene Pasquetto

Open Science: To What End? by Cyndi Grossman

A Critical Assessment of Open Science

We are creating a framework for discussing open science and we need your feedback


By Lara Mangravite and John Wilbanks

At Sage Bionetworks, we think a lot about open science. Our organization was founded explicitly to use open practices to promote integration of large-scale data analytics into the life sciences. We were guided by a very specific definition of open science: the idea that scientific “teams of teams” working together on a growing commons of open data can unleash substantial increases in scientific throughput and capacity.

We have learned a lot over the past 10 years regarding best practices for successful application of open science in this context. We are curious to understand how our observations may overlap with those from others in the field. Is there a common set of guidelines that can help support the effective use of openness in the life sciences? On the flip side, is there a set of common mistakes that keep getting repeated? We quickly realized in our work, for example, that each project has a window of time where “openness” is optimally effective. Early in a project, openness can sometimes hinder creativity as people hold back on sharing ideas that are still immature.

With this in mind, we decided to convene a few collaborators for a small workshop to critically assess openness in the life sciences. The goal wasn’t to speak for the field or to declare consensus – we know that we have an unrepresentative sample and can speak no universal truths. The goal was to get candid and clear feedback from this initial group about what’s working and what’s not.

What did we learn?

  1. Open science is a general term that is used to represent many different ideas. While most open-science approaches are based on a common premise that sharing, transparency, and/or collaboration will lead to better science, there is neither a universal definition of “open” nor a universal way to apply “open approaches.” Diverse groups use the term to represent a wide variety of activities designed to achieve distinct goals. These are not well distinguished in the language used to describe them, and this can lead to confusion amongst open-science proponents who don’t always feel represented in each other’s ideas. It becomes very difficult to define best practices or to evaluate the success of the open science movement as a whole.
  2. Open science is not an isolated movement. These open approaches are but one part of a much larger scientific – and social – ecosystem. They don’t operate in isolation. Neither should we. We will work with the workshop attendees to provide a full briefing of this meeting – to formally publish as an open-access paper later this year.

In the meantime, we’d like to use this channel to work through some of these topics with a larger group of people. We’re going to think critically about what we mean when we talk about open science in the life sciences and we urge you to engage with us – tell us what you see, how it’s working, what could be done better. We hope (and expect) you to challenge us so that together we can get to a more cohesive and compelling way of talking about open science.

We will post here regularly as we develop our thinking. Some workshop attendees will contribute guest posts, drawing out themes that are important to their work, such as reuse, transparency, collaboration, policy implications, and more. We also want to know your perspectives on open science and encourage you to write a blog post and publish on your preferred platform. Tweet us the link to your post, and stay tuned as we continue this conversation.

About this series: In February 2019, Sage Bionetworks hosted a workshop called Critical Assessment of Open Science (CAOS). This series of blog posts by some of the participants delves into a few of the themes that drove discussions – and debates – during the workshop.

Voices from this series:

Read here or on

Introduction by Lara Mangravite and John Wilbanks

Voices From the Open Science Movement by Lara Mangravite and John Wilbanks

Recognizing the Successes of Open Science by Titus Brown

Open Science is About the Outcomes, Content, and Process of Research by Brian Nosek

Do Scientists Reuse Open Data? by Irene Pasquetto

Open Science: To What End? by Cyndi Grossman