January 14, 2020

The Sage Perspective on Data Management and Sharing

The Sage Perspective on Data Management and Sharing

Response to Request for Public Comments on a DRAFT NIH Policy for Data Management and Sharing and Supplemental DRAFT Guidance

// Submitted by Lara Mangravite and John Wilbanks on Behalf of Sage Bionetworks

Editor’s Note: This response to the draft policy involves pasting reply copy into fields in a web form with 8000 characters per field, thus the format. For reference: the AMIA response in 2018 mentioned in the Sage response.

Section I: Purpose (limit: 8000 characters)


Recommendation I.a: Make ‘Timely’ More Specific

The policy states that “shared data should be made accessible in a timely manner…”. Timely should be defined, so that researchers understand the baselines expected of them, and have a boundary beyond which they must share data. These baselines and boundaries should be reflected in the templated DSMPS we recommend elsewhere. More details are provided in our recommendations in the Requirements section, and we further recommend in section VI that such DMSPs are scored elements of applications.

Recommendation I.b: Elevate the Importance of Data Management

We applaud the mention of data management in both the purpose section and even in the title, but it is not given adequate attention in this section. The Purpose text does not address the lessons learned from data sharing within NIH-funded collaborations. From the Cancer Genome Atlas to Clinical Translational Science Awards to the Accelerating Medicines Partnerships, NIH has committed billions annually to such programs. Data sharing sits at the heart of these collaborative networks, but our experience indicates that simply “sharing” data sets is not sufficient to meet the stated purpose. Data management is rarely elevated to a role commensurate with its importance in data reuse. As such, we recommend adding the following text to the end of the first paragraph to delineate the importance of management to achieving the purpose of the policy:

“Data management is an ongoing process that starts well before and goes on well after the deposit of a file under FAIR principles, and NIH encourages practices that have been demonstrated to succeed at promoting data sharing and reuse in previous awards.”

Section II: Definitions (limit: 8000 characters)


Recommendation II.a: Amend the Definition of Data Management and Sharing Plan

The definition of the Data Management and Sharing Plan does not sufficiently capture how DMS is integral to the research process. This Policy should make it clear that the data sharing is not an add-on or checkbox, but an ongoing management process that is integrated into the scientific research process. We recommend adding the following text to the definition of:

“The plan should describe clearly how scientific data will be managed across the entirety of a research grant and specific descriptions of how and when resulting data will be shared, including descriptions of which NIH approved repositories they will be deposited (or, if depositing outside this group, how the proposed repository will be sufficient to meet the requirements).”

Recommendation II.b: Replace the Definition of Data Management

The definition of Data Management does not sufficiently reflect the true extent to which data management must permeate the research process, nor why it is important. Data management is a massive undertaking that improves the quality of shared data. We endorse the 2018 AMIA definition of data management and recommend that the NIH adopt it, replacing the current definition text with the following:

“The upstream management of scientific data that documents actions taken in making research observations, collecting research data, describing data (including relationships between datasets), processing data into intermediate forms as necessary for analysis, integrating distinct datasets, and creating metadata descriptions. Specifically, those actions that would likely have impact on the quality of data analyzed, published, or shared.”

Recommendation II.c: Add a Definition for Scientific Software Artifacts

The stated purpose of this policy is “to promote effective and efficient data management and data sharing.” Per our recommended additions to the Scope section, below, the policy should make clear that what must be managed and shared are not only the “scientific data” and “metadata” created in the course of research, but also the scientific software artifacts created, such as the code underlying the algorithms and models that process data. Accordingly, we echo AMIA’s call for definitions of “scientific software artifacts” and recommend NIH include in this policy the following definition:

“Scientific software artifacts: the code, analytic programs, and other digital, data-related knowledge artifacts created in the conduct of research. These can include quantitative models for prediction or simulation, coded functions written within off-the-shelf software packages such as Matlab, or annotations concerning data or algorithm use as documented in ‘readme’ files.”

Recommendation II.d: Add a Definition for “Covered Period”

Making data available for others to use can pose a significant burden, per the supplemental guidance on Allowable Costs. Investigators will need clear definitions of exactly what will be required of them for data hosting in the short, medium, and long term. As such, we recommend that NIH include a definition in this section for “covered period,” providing as much detail as possible on the expectations for the length of time that investigators must make their data available, including differences in requirements for research awards and data sets (including scientific software artifacts) of different scales.


Section III: Scope (limit: 8000 characters)


Recommendation III.a: Include Scientific Software Artifacts as an Asset to be Managed/Shared

The first sentence in this Policy notes “NIH’s longstanding commitment to making the results and outputs of the research that it funds and conducts available to the public.” Scientific software artifacts (as defined in the response to the Definitions section, above) are outputs as much as data, equally determinative of research findings. Thus, managing and sharing the means of manipulating data from one form to another, transforming raw inputs into valuable outputs, is also important to the end goal of rigorous, reproducible, and reusable science.

Furthermore, it is possible to technically share data while withholding key artifacts necessary to make those data valuable for reuse. These key artifacts could then be exchanged for authorship, position on proposals, or other scientific currencies, thus circumventing a major desired outcome of this policy: removing the unfair advantages of already funded investigators. As such, we recommend that the Scope section include the following statement:

“NIH funded research produces new scientific data and metadata, as well as new scientific software artifacts (e.g. the code of algorithms and models used to manipulate data). Software artifacts are outputs of research as much as data, and it is just as important to manage and share them in the interest of rigor, reproducibility, and re-use. NIH’s commitment to responsible sharing of data extends to scientific software artifacts. As such, throughout this policy, the use of the term “data” should be understood to include scientific software artifacts, per the definition established in Section II.”


Section IV: Effective Date(s) (limit: 8000 characters)

No recommendations planned

Section V: Requirements (limit: 8000 characters)


Recommendation V.a: Tier the Sharing Date Requirement

This policy will require cultural and practice changes for most funded researchers, as well as a nimble reaction to the realities of implementations by NIH. Failing to anticipate the implications of those changes could cause a severe backlash to the policy, undermining its purpose. As such, investigators of those projects least able to redistribute resources necessary to abide by this policy should be given more time to do so. We recommend that NIH adopt AMIA’s 2018 tiered proposal for establishing sharing date requirements based on the size of funding. Projects funded over $500,000 per year would have to comply within one year of approval of the DMSP, those between $250,000-$500,000 within two years, and those below $250,000 within three years.

Recommendation V.b: Create DMSP Templates

We do not expect most researchers to know how to structure a Data Management and Sharing Plan. Furthermore, grants structure into different categories: the funding mechanisms behind the Cancer Genome Atlas and the AllofUs Research Program are different than early career researcher grants and most R01s. We therefore recommend the ICs create templates for at least four categories of funding: grants intended to create reference resources for the scientific community, grants that create collaborative networks of multiple laboratories, grants that form “traditional” research but integrate at least two institutions, and grants that only flow to a single institution.

These templates will facilitate understanding of the DSMP obligations by researchers (a form of learning by bootstrapping), as well as facilitate review by standardizing the essential elements and layout of the DSMPs across submissions. Researchers who do not use the standard template would not be penalized, but any DSMP they submit should clearly mark how and where their essential elements map onto the templates provided by NIH. Segmenting these templates by class of resources expected to be shared will make it easier for researchers to understand expectations (and can be tied to kinds of funding mechanism, e.g. U24) and will also make like-to-like evaluation easier for the NIH in evaluation over time.


Section VI: Data Management and Sharing Plans (limit: 8000 characters)


Recommendation VI.a: Make Data Sharing a Requirement

This section states, “NIH encourages shared scientific data to be made available as long as it is deemed useful to the research community or the public.” However, the future utility of data is often unknown at the time it would be required for deposit, and it is unclear who would be responsible for deeming data as useful. We recommend that NIH require, not encourage, data to be shared. The NIH should also provide both alternate “sharing” mechanisms and opt-out processes for the situations when data sharing is either impossible or inadvisible (i.e. when sharing data would compromise participant privacy or harm a vulnerable group.

Alternate mechanisms could include a private cloud where users “visit” the data and are surveilled in their uses or “model-to-data” approaches where a data steward runs models on behalf of the community. Opt-outs should be rare but achievable, and patterns of opt-out usage should be tracked at the researcher and institution level to assist in evaluation of their use and impact.

Recommendation VI.b: Distinguish Between Purposes of Sharing

The requirements for data sharing should be different for data whose value to the community is realized in different ways. There is a difference between data that are generated with the explicit intent of creating a shared resource for the research community (e.g. TCGA), and data that are generated within the context of an investigator-initiated research project and are to be shared to promote transparency, rigor, and to support emergent long-term reuse. In the former case, a description of a detailed curation, integration, synthesis, and knowledge artifact plan should be present. In the latter case, a description of file format, simple annotation, and long-term storage should be front and center. We recommend that this section explicitly distinguish between these two purposes of sharing, and that different formats be used for developing and assessing DMSPs with respect to these different purposes.

Recommendation VI.c: Require the DMSP as a Scorable Part of the Application

In this policy, the DMSP will be submitted on a Just-in-Time basis. This signals that the plan is not a valued part of the application and is, in fact, an afterthought. NIH should factor the quality of the DMSP in its funding decision process. We recommend that the DMSP be required as a scorable part of the application so that appropriate sharing costs can be budgeted for at the time of application, and the plan can be included as part of the review process.

Recommendation VI.d: Make DMSPs Publicly Available

This section states that, “NIH may make Plans publicly available.” We believe that NIH should ensure transparency with the public who has funded the work, and take advantage of transparency as a means for encouraging compliance. As such, we recommend that this section state that “NIH will make Plans publicly available.”


Section VII: Compliance and Enforcement (limit: 8000 characters)


Recommendation VII.a: Give Investigators Time to Share

Judging an application based on performance on past DMSPs is only fair if the investigators have had sufficient time to implement that plan. Per Recommendation V.a (above) to tier the sharing date requirement, we recommend that application reviewers begin using evidence of past data and software artifact sharing starting between one and three years after the adoption of the DMSP, depending on the size of the prior award. Those with a prior award of $500,000 per year could be judged after one year of approval of the DMSP, those with a prior award between $250,000-$500,000 after two years, and those with a prior award of below $250,000 after three years.

Recommendation VII.b: Use Existing Annual Review Forms for Proof of Compliance

Compliance with this policy should be integrated with current annual review processes for funded research projects. Proof of compliance should not require more than a single line in existing documentation, otherwise proof of compliance, itself, becomes an unnecessary burden of compliance. We recommend that NIH add a URL to a FAIR data file in annual review forms, alongside those lines for publications resulting from the data. This would provide an incentive to encourage a broad array of DMS practices and make it as simple as “filling the blank” on the form. We also recommend that NIH create an evaluation checklist as part of DSMP annual review to be filled out by the investigator and shared alongside the existing annual review forms.

Recommendation VII.c: Certify “Safe Spaces” for DMSP Compliance

Compliance and enforcement will also be significantly easier if NIH develops a process to certify data commons, knowledge bases, and repositories as “safe spaces” for DMSP compliance.

Such a process could analyze the long-term sustainability of a database, its capacity to support analytic or other reuse, its support of FAIR principles, and more. Such a network would significantly “raise the floor” for the broad swath of researchers unfamiliar with FAIR concepts, for researchers at institutions without significant local resources to make data FAIRly available, and more. Accordingly, we recommend that this section include language detailing an NIH certification process for these resources.

Recommendation VII.d: Add data sharing and management experts to review panels

The composition of review panels is a key part of using DSMPs in award decisions. Ensuring data sharing and management expertise is represented as part of baseline review panel competency will increase both initial review and also encourage long-term compliance with the key goals of DSMPs.


Supplemental DRAFT Guidance: Allowable Costs for Data Management and Sharing (limit: 8000 characters)


Recommendation VIII.a: Detail the Duration of Covered Costs for Preservation

The funding period for a research project is relatively short compared to the period after the research is complete wherein its outputs might be replicated or reused. Ideally, research outputs would be preserved indefinitely, but preservation has costs. The draft guidance does not specify whether costs to preserve data beyond the duration of the funded grant are allowed or encouraged. We recommend that this section provide detail as to whether NIH will cover data preservation costs after the funding period and, if so, for how long.

Recommendation VIII.b: Detail the Covered Costs for Personnel

DMS costs are not limited to the acquisition of tools, infrastructure, and the procurement of services; they also entail the time and effort of research staff internal to the investigating institution. The draft guidance does not specify whether personnel costs are allowable expenses related to data sharing. We recommend that this section provide detail as to whether NIH will cover such personnel costs – data sharing and management, done well, imposes a short term cost in anticipation of longer term benefit. NIH should clarify where that cost comes from as part of the Policy.

Recommendation VIII.c: Detail How Cost Levels Will Affect Funding Decisions

The Policy does not state whether a higher cost for better DMS might penalize (or advantage) a proposal in an IC’s funding decisions. If potential recipients A and B propose to do the same research with the same traditional research costs, but A budgets for a robust “Cadillac” DMS plan, whereas B budgets for a bare-minimum “Chevy” plan, which does NIH choose? All things equal, should they choose the costlier, more robust option? Is it OK that it is a “tax” on the research proper? Is there an ideal ratio of traditional research costs to DMS costs? Is there a standard way to compare costs with benefits? We recommend that NIH provide detail in this section regarding how and if DMS costs will affect funding decisions.


Supplemental DRAFT Guidance: Elements of a NIH Data Management and Sharing Plan (limit: 8000 characters)


Recommendation IX.a: Address Different ‘Community Practices’ Across Disciplines

Section 1 of this supplemental guidance states that, “Providing a rationale for decisions about which scientific data are to be preserved and made available for sharing, taking into consideration…consistency with community practices.” However, different disciplinary fields can have different community standards. Some disciplines have a culture of sharing more, while in others it is less or not at all. Should all disciplines be held to the same DMS standards, or will investigators of different disciplines be expected to adhere to different community practices? If the former, how will this standard be established and what are the ramifications for compliance in disciplines currently outside of this standard? We recommend that NIH provide additional detail in this section (or, if necessary, in separate supplemental guidance) as to what the DMS expectations are within and across scientific disciplines.

Recommendation IX.b: Direct the Use of Existing Repositories

Section 4 of this supplemental guidance states, “If an existing data repository(ies) will not be used, consider indicating why not…” We recommend that the word “consider” be removed. This policy should recommend the use of established repositories and, if this is not feasible, then the investigator should justify their decision with a specific reason. We understand that many scientists are unaware of the infrastructure already in place, so we also recommend that NIH provide a list of existing data repositories with a certification of compliance to increase their use. Additionally, NIH may wish to provide guidance and build associated resources to assist investigators choose which of these repositories to use. If there are repositories that they must use (e.g. clinicaltrials.gov), or that NIH would prefer them to use, or that NIH has no preference (i.e., it would like the “market” to arrive at the best option), then NIH should make these degrees of requirement plain to investigators and make tools and infrastructure available to help them to decide.

Recommendation IX.c: Clarify Sharing Requirements for Data at Different Degrees of Processing and Curation

Section 1 requires investigators to describe “the degree of data processing that has occurred (i.e., how raw or processed the data will be).” This raises the question as to whether the investigator can choose the level of processing and/or curation of the data to share, or if the investigator must share data at all levels of processing/curation. For purposes of reproducibility, we should encourage — or require — not only the sharing of data, but descriptions of data processing at each level (per Section 2: Related Tools, Software and/or Code). This may, of course, increase the costs of DMS, so additional guidance would also be needed on what thresholds there may be and, the NIH should designate where the investigator has freedom to choose the levels of data shared and how the investigator should make tradeoffs.

Recommendation IX.d: Expand the Requirements and Guidance for Rationale

Section 1 requires a rationale of which data to preserve or share based on the criteria of “scientific utility, validation of results, availability of suitable data repositories, privacy and confidentiality, cost, consistency with community practices, and data security.” This rationale is limited to the choice of which data to share, while there are other important DMS decisions that warrant rationales. We recommend NIH require a rationale on where to share it and how long it will be available (Section 4), in what format it is shared (Section 3), and what other things might be shared, such as algorithms (Per Section 2). As with the choice of which data to preserve and share, NIH should offer criteria for decisions in each of these areas as well.

For choices regarding data preservation and sharing, as well as these other choices, if NIH has any preferences on how to weigh and balance criteria, we recommend it make those plain through additional guidance. Further, it should develop tools and infrastructure to help investigators to weigh and balance them, and conduct periodic audits/evaluations to understand how investigators across fields, over time, are making these judgements, if those judgements are in the best interest of the scientific community, and what additional incentives/requirements might be put in place.


Other Considerations Relevant to this DRAFT Policy Proposal (limit: 8000 characters)


Recommendation X.a: Detail how NIH will Monitor and Evaluate the Implementation of this Policy

A planning mechanism without an evaluation mechanism is only half complete. This policy should establish an adaptive system that improves DMS over time though feedback and learning. We recommend that this policy contain a new section that details how NIH will monitor and evaluate performance toward individual DMSPs during the funding period and after, to the extent that data are planned to be preserved after. Further, we recommend this new section also detail how NIH will monitor and evaluate implementation of this policy across all DMSPs, using evidence to illustrate how its purpose is or is not being achieved and what changes might be made to improve it. Policy-wide monitoring and evaluation information and reports should be made publicly available. Publicizing measures (e.g., usage rates and impact of previously shared data) is also a way to promote a culture where investigators are incentivized to produce datasets that are valuable, reusable, and available.