From Open Systems to Trusted Systems: New Approaches to Data Commons

April 23, 2019

From Open Systems to Trusted Systems: New Approaches to Data Commons

By Michael Kellen

At Sage, we’re always reflecting on the work we’ve done, and how it might need to evolve. Conversations with partners and colleagues at events such as CAOS are a key part of this process. Over the last couple of years, a common theme has been the role that actively collaborating communities play in successful large-scale research projects, and the factors that accelerate or inhibit their formation.

Like any research organization, Sage experiments with different ideas and works in different areas over time. This challenges our technology platform team to build general-purpose systems that not only support the diversity of today’s research, but also pave the way for new types of science to be performed in the future.

Synapse makes it easier for researchers to aggregate, organize, analyze, and share scientific data, code, and insights.

Synapse, our flagship platform, allows us to develop data commons using a set of open APIs and a web portal hosted in the cloud, and programmatic tools that allow integration of Sage’s services into any analytical environment. Together, this platform makes it easier for researchers to aggregate, organize, analyze, and share scientific data, code, and insights.

Initially, our model for Synapse was GitHub – the software platform that has been the locus of the open source software movement over the past decade. Our thinking was that if we just made enough scientific knowledge open and accessible, scientists around the world would organize themselves and just start doing better science. In part, we saw our role as unlocking the potential of junior researchers who were digital natives and, perhaps, willing to work in ways less natural to the established PIs in the context of the established research ecosystem. Our assumption was that a pure technology solution would be sufficient to accelerate progress.

The reality wasn’t as straightforward as we thought.

Over the course of eight years, we’ve had a lot of large scientific collaborations operate on Synapse, some quite successfully and others less so. The main determinant of success has proven to be the level of alignment of incentives among the participating scientists, and their degree of trust in each other. Further, consortium-wide objectives must be aligned with individual and lab-level priorities. If these elements exist, the right technology can catalyze a powerful collaboration across institutional boundaries that would be otherwise difficult to execute. But without these elements, progress stalls while the exact same technology sits unused.

In a recent talk on the panel Evolving Challenges And Directions In Data Commons at BioIT World West (slides here), I shared several case studies to illustrate the aspects of the platform that were most successful in enabling high-impact science, and the characteristics that contributed to that success:

Digital Mammography Dream Challenge

In the Digital Mammography Dream Challenge, we hosted close to 10TB of medical images with clinical annotations in the cloud, and organized an open challenge for anyone in the world to submit machine learning models to predict the need for follow-up screening. Due to patient privacy concerns, we couldn’t directly release this data publicly. Instead, we built a system in which data scientists could submit models runnable in Docker containers, executed training and prediction runs in the AWS and IBM clouds, and returned output summaries. This was a huge shift in workflow for the challenge participants, who are more accustomed to downloading data to their own systems than uploading models to operate on data they cannot see.

The technical infrastructure, developed under the leadership of my colleagues Bruce Hoff and Thomas Schafter, is one of the more impressive things we’ve built in the last couple of years. Imposing such a shift in workflow on the data scientists risked being a barrier. That proved not to be the case: the incentive structure and publicity generated by DREAM generated enormous interest, and we ended up supporting hundreds of thousands of workflows generated by over a thousand different scientists.

mPower Parkinson’s Study

In the area of digital health, Sage has run mPower, a three-year observational study (led by Lara Mangravite and Larsson Omberg) of Parkinson’s disease conducted in a completely remote manner through a smartphone app. This study produced a more open-ended challenge: how to effectively learn from novel datasets, such as phone accelerometry and gyro data, collected while study participants balanced in place or walked. The study leveraged both Synapse as the ultimate repository for mPower data, as well as Bridge – another Sage technology designed to support real-time data collection from studies run through smartphone apps distributed to a remote study population.

We organized a DREAM challenge to compare analytical approaches. This time, we focused on feature extraction rather than machine learning. Challenge participants were able to directly query, access, and analyze a mobile health dataset collected over six months of observations on tens of thousands of individuals. Again, the access to novel data, and to a scientifically challenging and clinically relevant problem was the key to catalyzing a collaboration of several hundred scientists.

Colorectal Cancer Genomic Subtyping

Our computational oncology team, led by Justin Guinney, helped to organize a synthesis of genomic data on colon cancer originally compiled by six different clinical research teams. Each of these groups had published analysis breaking the disease into biologically-distinct sub-populations, but it was impossible to understand how the respective results related to each other or how to use the analysis to guide clinical work

Unlike the previous two examples, this was an unsupervised learning problem, and it required a lot of effort to curate these originally distinct datasets into a unified training set of over 4,000 samples. However, the effort paid off when the teams were able to identify consensus subtypes of colon cancer, linking patterns in genomic data to distinct biological mechanisms of tumor growth. This project operated initially with only the participation of the teams that conducted the initial clinical studies – and it was only in the confines of this initially private group that researchers were really willing to talk openly about issues with their data. It also helped that each group was contributing part of the combined dataset and therefore everyone felt that all the groups were contributing something to the effort. With the publication of the initial consensus classification system, the data and methods have been opened up and seeded further work by a broader set of researchers relating the subtypes to distinct clinical outcomes.

Towards Community-Specific Data Commons

What do these three examples have in common? From a scientific standpoint, not much. The data types, analytical approaches, and scientific contexts are all completely different. In retrospect, it’s perhaps obvious that there’s not much chance of data, code, or other low level assets being used across these projects. The fact that all three projects were supported on the same underlying platform is evidence that we’ve developed some generally-useful services. But, our monolithic GitHub-style front end has not been an effective catalyst for cross-project fertilization.

What has been a common indicator of success is effective scientific leadership that gives structure and support to the hands-on work of more junior team members. This is even more important when these projects are carried out by highly distributed teams that haven’t previously worked together. Developing this sense of trust and building a functional community is often easier to do in smaller, controlled groups, rather than in a completely open system that, at best, is saturated with irrelevant noise, and, at worst, can be hijacked by actors with bad intentions. Starting small and increasing the “circle of trust” over time is an effective strategy.

It’s becoming clearer that these sort of factors are the case, even in software development. Despite what you might think from some of the open-source rhetoric, most of the really large-scale, impactful open-source projects benefit from strong leadership that gives a sense of purpose and organization to a distributed group of developers. And, even GitHub itself is now a part of Microsoft – who would have bet money on that outcome 10 years ago?

In the past year, the Synapse team has been piloting the development of new web interfaces to our services that repackage how we present these capabilities to different communities into a more focused, community-specific interfaces. With the recent launch of the AMP-AD Data Portal, and NF Data Portal the first couple of these experiments are now public. I’m excited to see how our platform continues to evolve as we enter Sage’s second decade, and even more excited to see what new science emerges on top of it.

About: Michael Kellen is the chief technology officer at Sage, where he leads the technology platforms and services team. He has over 10 years of experience developing software for academic and corporate users in the life sciences. He has a doctorate in bioengineering from the University of Washington with a focus in computational biology.

About this series: In February 2019, Sage Bionetworks hosted a workshop called Critical Assessment of Open Science (CAOS). This series of blog posts by some of the participants delves into a few of the themes that drove discussions – and debates – during the workshop.