Introducing NLPSandbox.io

By Jiaxin Zheng and Thomas Schaffter 

Natural language processing, or NLP, is a technology used in many ways to help computers understand human language. This is particularly impactful in biomedical research, where hospitals have millions of unstructured notes they need to de-identify before sharing with researchers. Manually de-identifying them would put significant strain on healthcare systems, presenting an excellent use case for the application of  NLP.

There are two key challenges  that NLP developers currently face. One is the lack of access to biomedical data on which to test the performance of their models. Given the size and sensitivity of the data, critical patient information is typically off limits for traditional model development. Another hurdle  is a lack of frameworks for assessing performance and generalizability. NLPSandbox.io can help on both fronts.

NLPSandbox.io is one of the first tool-benchmarking platforms that securely connects developers to healthcare data providers. The platform streamlines your development process and the assessment of tools that are re-usable, reproducible, portable and cloud-ready. The NLP Sandbox adopts the model-to-data architecture to enable NLP developers to assess the performance of their tools on public and private datasets. When a developer submits a tool, data partners automatically download the tool and evaluate its performance against their private data. This architecture enables our partners to fully control their data and ensure no sensitive information leaves their secure environment.

In addition to overcoming data access hurdles, NLP Sandbox also provides a competitive framework for assessing the performance of various NLP tasks. The first series of NLP Sandbox tasks supported by the NLP Sandbox are the annotation and de-identification of protected health information (PHI) in clinical notes. With Medical College of Wisconsin onboarded as our first data provider, developers can benchmark their de-identification tools on clinical notes.  Additional data from Mayo Clinic and University of Washington will soon follow, enabling developers to evaluate the generalizability of their tool’s performance across multiple datasets.

De-identification of PHI is only one of many tasks that NLP Sandbox will support in the future. We are also partnering with Mayo Clinic to enable the community to benchmark tools that automatically extract information about COVID-19 symptoms from clinical notes. We welcome suggestions for other NLP tasks, especially from partners who can provide data to support these tasks.

To get started, please check out NLPSandbox.io where you will find data schema, GitHub repositories, and a link to our Tuesday Discord office hours. If you are a data provider and would like to contribute,  please reach out at team@nlpsandbox.io. Lastly, we will also give a live introduction of the service later this month. Register here to hold your spot.

NLP Sandbox is the result of a collaboration by Sage Bionetworks, CD2H, NCATS, MCW, and Mayo Clinic. We hope you will join our growing list of collaborators, and look forward to building and innovating with you.

Cancer Complexity Knowledge Portal Launched

The Cancer Complexity Knowledge Portal has officially launched. Funded by the National Cancer Institute Division of Cancer Biology, the portal enables researchers to submit multi-faceted queries related to the latest discoveries, data, tools, methods, and publications from three cancer research communities:

  • Cancer Systems Biology Consortium (CSBC), which aims to address the challenges of complexity in cancer research through a combination of experimental biology and computational modeling, multi-dimensional data analysis and systems engineering.
  • Physical Sciences in Oncology Network (PS-ON), which supports research programs that connect cancer biologists and oncologists with scientists from the fields of physics, mathematics, chemistry, and engineering to address some of the major questions and barriers in cancer research.
  • Cancer Tissue Engineering Collaborative (TEC), which supports the development and characterization of biomimetic tissue-engineered technologies.

Sage Bionetworks serves as the resource coordinating center for CSBC and PS-ON and developed and maintains the portal.

Sample queries:

Explore the Cancer Complexity Knowledge Portal

Sage Platform Team Officially Launches Data Portals

The AMP-AD Knowledge Portal has officially launched. More than 270 researchers have already contributed multi -omic data. Congrats to the Platform, Systems Biology, and Neurodegenerative Research teams!

The Platform Team has had an ongoing effort to create community-specific tools to engage researchers. The NF Data Portal and the CSBC-PSON Data Portal are two other examples.

Read Sage CTO Mike Kellen’s blog post explaining the rationale behind moving toward data commons – and portals.

Sage Bionetworks launches interactive, web-based explorer, Agora in conjunction with NIH-led AMP-AD Target Discovery and Preclinical Validation Project

Full Article  from Business Wire

 

Sage Bionetworks announces the launch of the Agora platform, an interactive, web-based tool that allows researchers to share and explore curated genomic analyses of Alzheimer’s Disease (AD).The analyses accessible through Agora represent the culmination of over five years of research from the dozens of scientists that are part of the NIH-led Accelerating Medicines Partnership – Alzheimer’s Disease (AMP-AD) Target Discovery and Preclinical Validation Project. AMP-AD is a precompetitive public private partnership led by NIH’s National Institute on Aging (NIA) and managed by the Foundation for the NIH (FNIH), bringing together the government, industry and non-profit sectors to transform the way disease-relevant therapeutic targets are discovered and validated.

The AMP-AD program has generated a wealth of genomic, RNA expression, proteomic, and metabolomic data from over 3000 human brain and plasma samples collected in several NIA-supported AD cohorts and brain banks. The raw and processed data have been made widely accessible to qualified researchers through the AMP-AD Knowledge Portal. The datasets available through the Knowledge Portal have been used by the AMP-AD consortia members to produce hundreds of novel scientific research papers. In addition to AMP-AD investigators, external researchers have benefitted from the data sharing policy mandating rapid and broad sharing of data and have made critical new observations, including a recently published study that highlighted a previously uncharacterized relationship of human herpes virus with AD.

Although use of primary data is typically limited to investigators with bioinformatic expertise, AMP-AD investigators have also generated analyses that should be useful to a broader set of researchers.The launch of the Agora portal represents the first time that the analyses have been shared outside of the AMP-AD consortia members, which should enable additional groundbreaking discoveries.

“Agora enables researchers to leverage AMP-AD analyses to advance their own scientific questions,” says Ben Logsdon, Director of Neurodegenerative Disease Research at Sage Bionetworks and the lead investigator on the Agora project. “These results were developed to answer questions posed by AMP-AD researchers but they are broadly useful. Agora provides an easy tool to enable the exploration and reuse of these results by anyone.”

“The most exciting results featured in this early release of Agora is the AMP-AD nominated targets list – a set of genes and proteins derived from unbiased computational analyses of rich human multi-omics data,” said Suzana Petanceska, Ph.D., Program Director at the NIA, overseeing the AMP-AD Target Discovery Consortium. “These molecular signals could illuminate new disease biology or serve as novel therapeutic targets, she explained. “We are purposely releasing them at an early stage of the target evaluation process to allow us to integrate the input of external researchers and to crowdsource the follow-on evaluation,” added Petanceska.

“Because AMP-AD operates under open science principles, researchers rapidly disclose data and results and, in turn, they receive early peer review to help guide research decisions,” said Lara Mangravite, President of Sage Bionetworks and an AMP-AD Principal Investigator. “Agora extends this approach so that external investigators can get actively involved in evaluation of the AMP-AD targets as well as further their own research by evaluating the performance of their genes of interest against multiple computational meta-analyses.”

“AMP-AD is a clear response to the National Plan to Address Alzheimer’s Disease,” said Eliezer Masliah, M.D., director of the Division of Neuroscience at NIA. “It is enabling precision medicine and facilitating the principles of open science and rapid dissemination of new targets with more shots on goal for AD.”

Agora will be frequently updated to incorporate the latest analyses from AMP-AD and its affiliate AD consortia. The initial release includes differential expression and co-expression network meta-analyses across four human RNA-sequence data sets. Future releases will expand to include human proteomic and metabolomic analyses, comparative evaluations of disease signatures across species, and integration of druggability and tractability information to guide selection of targets for early drug discovery. Analyses developed within other consortia in the NIA’s AD Translational Research portfolio including MODEL-AD, M2OVE-AD, and AD Resilience will also be integrated into future iterations.

Continue reading “Sage Bionetworks launches interactive, web-based explorer, Agora in conjunction with NIH-led AMP-AD Target Discovery and Preclinical Validation Project”