Metapipeline-DNA Automates and Standardizes Genome Sequencing Analysis

DNA double helix — Credit: TanyaJoy / iStock / Getty Images Plus

In a single experiment, scientists can decipher the entire genomes of many patient samples, animal models or cultured cells. To fully realize the potential to study biology at this unprecedented scale, researchers must be equipped to analyze the massive amounts of data generated by these new methods.

Scientists headed by a team at Sanford Burnham Prebys Medical Discovery Institute and the University of California Los Angeles have now reported on the construction and testing of a new computational tool for tackling massive and complex sequencing datasets. The new resource, named metapipeline-DNA, offers an automated, extensible, and cloud-compatible DNA sequencing analysis pipeline for DNA sequencing data that can derive genetic characteristics and evolutionary features from raw sequencing reads. The new platform may also make sequencing data analysis more standardized across different research labs.

Yash Patel, MSc, a cloud and AI infrastructure architect at Sanford Burnham Prebys, is co-first author of the team’s published paper in Cell Reports Methods, titled “Metapipeline-DNA: A comprehensive germline and somatic genomics Nextflow pipeline,” in which they wrote, “Metapipeline-DNA was designed to facilitate analysis of DNA sequencing data at scale while retaining the configurability and flexibility needed in academic environments.”

The sequence of a single human genome represents about 100 gigabytes of raw data, the rough equivalent of 20,000 smartphone photos. The sheer scale of experimental data increases significantly as tens or hundreds of genomes are added into the mix.

Yash Patel, MSc, is a cloud and AI infrastructure architect at Sanford Burnham Prebys. [Sanford Burnham Prebys]

As the technology to produce this data has rapidly advanced over the last 10-15 years and become more affordable and accessible, many labs have built their own software to use for analysis, or customized open-access tools shared freely by colleagues.

Some of these resources only work on specific supercomputing or cloud computing systems. “The growing availability of DNA sequencing has been paralleled by rapid development and adoption of both specific algorithms and workflow software,” the authors wrote in their paper. “New discoveries often rely heavily on complex workflows comprising a mixture of established and novel algorithms.” The authors point out that this use of complex workflows is increasing the emphasis on standardization, extensibility, quality control, and compute infrastructure. “Workflow implementations routinely differ across research groups, with many groups creating their own,” the team continued.

This fragmented software landscape can also complicate collaboration across institutions, add difficulties when labs move to new institutions or institutions switch to new computing solutions, and contribute to a lack of standardization as well as challenges reproducing studies with different tools. “Bioinformatics pipelines for genomic sequencing data such as metapipeline-DNA are designed to standardize analysis of all this data to make sure it is processed in a uniform way, and in a reproducible way,” said Patel. “The goal is to automate quality control, determination of genetic variants and all the other analysis steps to make it much easier so that researchers do not need to write their own code to process their data.”

This illustration demonstrates how metapipeline-DNA processes raw genome sequencing data. It begins by aligning the sequence of DNA base pairs to a reference genome. Then produces sets of detected variants and other genetic and evolutionary features. [Yash Patel, Sanford Burnham Prebys]

The team now reports on their development of metapipeline-DNA to automate genomic analyses, and address what they describe as the need for a flexible, robust and highly scalable framework that accommodates diverse sequencing methods, that is also highly scalable and adaptable across computational environments. “To address the need for a robust open-source DNA sequencing analysis pipeline, we created metapipeline-DNA,” the authors explained. The new platform can work on multiple compute systems and clouds, facilitating analyses at any scale. “This Nextflow metapipeline is highly customizable and is capable of processing data from any stage of analysis … It analyses targeted and whole-genome sequencing data from raw reads through preprocessing, feature detection by multiple algorithms, quality control, and data-visualization.”

Extensive quality control, testing, and data-visualization are built into each individual step and into the full metapipeline. “Metapipeline-DNA includes a range of quality control steps and pipelines to assess data at each step, including raw reads, alignments, and variant calls.” The scientists emphasize the software’s ability to detect and recover from common errors.

Even with the powerful supercomputing clusters scientists use to analyze sequencing data, failed runs can cost days of computing time and delay new discoveries. “In designing the software, we focused on making sure that the choices we present to the users are fully validated before the pipeline runs,” said senior and corresponding author Paul Boutros, PhD, MBA, director and professor in the NCI-Designated Cancer Center at Sanford Burnham Prebys and senior vice president of Data Sciences. “In our lab, we don’t want to suffer a setback due to a preventable configuration error, and we don’t want it to happen to anyone using our pipelines.”

To improve the ability of metapipeline-DNA to determine where changes in the genome have occurred, the scientists worked with the Genome in a Bottle Consortium led by the U.S. Department of Commerce’s National Institute of Standards and Technology. By incorporating this public-private-academic consortium’s meticulously validated resources, the researchers reduced the rate of false positives without reducing the tool’s precision in finding true genetic variants.

The researchers also produced two case studies demonstrating the pipeline’s capabilities for cancer research. The investigators used metapipeline-DNA to analyze sequencing data from five patients that donated both normal tissue and tumor samples to the Pan-Cancer Analysis of Whole Genomes dataset, as well as another five from The Cancer Genome Atlas.

Paul Boutros, PhD, MBA, director and professor in the NCI-Designated Cancer Center at Sanford Burnham Prebys and senior vice president of Data Sciences. [Sanford Burnham Prebys]

The next step is to get metapipeline-DNA into more labs to accelerate discoveries, and to continue improving the resource with more user feedback. “This tool should enable labs to process data without needing a lot of background in computation or computer infrastructure, and without having to optimize for their specific computing environment,” said Patel. “Metapipeline-DNA fills this key niche of supporting the rapidly expanding volume of sequencing data, supporting a range of existing tools and algorithms and remaining flexible for ongoing expansion,” the authors wrote. “By facilitating the integration of diverse tools and supporting the rapid development of new methodologies, it positions itself as a versatile platform for future enhancements as novel DNA sequencing and analysis methods are developed.”

In addition, the authors plan to build upon this foundation to create automated, end-to-end solutions for analyzing sequencing of other biological molecules such as RNA and proteins. “Workflows across different biomolecules can share the architecture, automation and quality control methods of metapipeline-DNA such that improvements to any single pipeline can improve the others,” commented Boutros, who is also corresponding author of the manuscript. “We’re excited to expand to other data-intensive high-throughput sequencing techniques to continue improving the pace and efficiency of discovery in our lab, at Sanford Burnham Prebys and throughout the research community.”

The collaborative development process included 43 contributors making 1,408 pull requests to enhance the underlying code, and 46 individuals submitting 1,124 suggestions, requests for features and/or reports of issues. In their paper the authors state that Metapipeline- DNA is open-source under the GPLv2 license. It is available at https://github.com/TheBoutrosLab/metapipeline-DNA.