Data Engineer – Informatics Workflows
Sage Bionetworks, Seattle WA
At Sage Bionetworks, we believe that we can learn more by learning from each other. By improving the way scientists collaborate, we help to make science more effective. We partner with researchers, patients, and healthcare innovators to drive collaborative data-driven science to improve health. Making science more open, collaborative, and inclusive ultimately advances biomedicine.
Do you have expertise in data integration and a passion for mission-driven work? Do you want to be an important contributor to a team that includes computational biologists, software engineers, and data curators? If so, you could be our next Data Engineer.
As part of the Computational Oncology team, you’ll work with scientists and developers to manage genomics and informatics pipelines to support high-throughput data ingress and standardization for cancer and clinical research. These workflows will be used in cloud-based environments to process data contributed by large external research consortia. Results will be aggregated and shared through publications, portals, and interactive applications, to accelerate discovery in the field.
What you’ll be doing:
- Implement software tools to enable automated ingestion of data and resources into central data repositories.
- Build genomic, proteomic, metabolomic, and other “omic” analysis pipelines using cloud-based platforms.
- Standardize and deploy pipelines as reproducible workflows for scientific compute on terabytes of data using cloud infrastructure.
- Develop scalable and secure solutions for distributed workflow execution.
- Write documentation and provide training for researchers in the use of workflows
We’d love to hear from you if:
- You have a PhD in Computer Science, Bioinformatics, Statistics, Computer Engineering or related computational field, or an MS and 3+ years of relevant job experience.
- You’re enthusiastic about open science, collaboration, and reproducible research.
- You have experience developing and deploying genomics pipelines in cloud-based environments (preferably AWS).
- You have experience processing and managing large datasets, especially those generated by genomics and next-generation sequencing technologies.
- You’re proficient in scripting and experienced with package development in R or Python.
- You can work in Unix environments and with Unix-based scripting tools (sed, awk, grep, shell scripts, etc.).
- You’re proficient in container technologies such as Docker.
- You’re familiar with community standard or domain-specific workflow specifications such as Common Workflow Language (CWL), Workflow Definition Language (WDL), Nextflow, Snakemake, Galaxy, or Apache Airflow.
- You have experience with collaborative development and version control systems (e.g. git).
- You have experience with software development life cycles and familiarity with continuous integration (CI), continuous development (CD), and testing frameworks.