By Jiaxin Zheng and Thomas Schaffter
Natural language processing, or NLP, is a technology used in many ways to help computers understand human language. This is particularly impactful in biomedical research, where hospitals have millions of unstructured notes they need to de-identify before sharing with researchers. Manually de-identifying them would put significant strain on healthcare systems, presenting an excellent use case for the application of NLP.
There are two key challenges that NLP developers currently face. One is the lack of access to biomedical data on which to test the performance of their models. Given the size and sensitivity of the data, critical patient information is typically off limits for traditional model development. Another hurdle is a lack of frameworks for assessing performance and generalizability. NLPSandbox.io can help on both fronts.
NLPSandbox.io is one of the first tool-benchmarking platforms that securely connects developers to healthcare data providers. The platform streamlines your development process and the assessment of tools that are re-usable, reproducible, portable and cloud-ready. The NLP Sandbox adopts the model-to-data architecture to enable NLP developers to assess the performance of their tools on public and private datasets. When a developer submits a tool, data partners automatically download the tool and evaluate its performance against their private data. This architecture enables our partners to fully control their data and ensure no sensitive information leaves their secure environment.
In addition to overcoming data access hurdles, NLP Sandbox also provides a competitive framework for assessing the performance of various NLP tasks. The first series of NLP Sandbox tasks supported by the NLP Sandbox are the annotation and de-identification of protected health information (PHI) in clinical notes. With Medical College of Wisconsin onboarded as our first data provider, developers can benchmark their de-identification tools on clinical notes. Additional data from Mayo Clinic and University of Washington will soon follow, enabling developers to evaluate the generalizability of their tool’s performance across multiple datasets.
De-identification of PHI is only one of many tasks that NLP Sandbox will support in the future. We are also partnering with Mayo Clinic to enable the community to benchmark tools that automatically extract information about COVID-19 symptoms from clinical notes. We welcome suggestions for other NLP tasks, especially from partners who can provide data to support these tasks.
To get started, please check out NLPSandbox.io where you will find data schema, GitHub repositories, and a link to our Tuesday Discord office hours. If you are a data provider and would like to contribute, please reach out at firstname.lastname@example.org. Lastly, we will also give a live introduction of the service later this month. Register here to hold your spot.
NLP Sandbox is the result of a collaboration by Sage Bionetworks, CD2H, NCATS, MCW, and Mayo Clinic. We hope you will join our growing list of collaborators, and look forward to building and innovating with you.