arc-institute’s-ai-model-evo-2-designs-the-genetic-code-across-all-domains-of-life
Arc Institute’s AI Model Evo 2 Designs the Genetic Code Across All Domains of Life

Arc Institute’s AI Model Evo 2 Designs the Genetic Code Across All Domains of Life

“Today, we can for all practical purposes read, write, and edit any sequence of DNA, but we cannot compose it. Maybe we can cut and paste pieces from nature’s compositions, but we don’t know how to write the bars for a single enzymatic passage. However, evolution does.” Frances Arnold, PhD (Nobel Prize Lecture 2018)

Evo, the genome foundation model developed by the Arc Institute published last November that generalizes across the languages of biology DNA, RNA, and proteins for both predictive and generative capabilities has received a major update. 

In a new preprint that is not yet peer-reviewed and first published on Arc’s website, Evo 2 moves beyond single-cell genomes of bacteria and archaea to include information from humans, plants, and other more complex single-celled and multi-cellular species in the eukaryotic domain of life.

The model’s resulting research applications span a diverse array of scientific fields including drug discovery, agriculture, industrial biotechnology, and material science. The multimodal and multiscale work is a collaboration with Nvidia along with contributors from Stanford University, UC Berkeley, and UC San Francisco. 

“The recipe for life is entirely present in the genetic information contained in our DNA,” said Kimberly Powell, vice president of healthcare at Nvidia. “We’re seeking a deeper understanding of biological complexity. Evolution has solved this problem over millions of years, and Evo 2 aims to learn from this knowledge.”

In healthcare, understanding which gene variants are tied to a disease is an invaluable tool for therapeutics. Early validation of Evo 2’s capabilities showed that the model can identify how genetic mutations affect protein, RNA, and organismal fitness. In tests with variants of BRCA1, a gene associated with breast and ovarian cancer risk, Evo 2 achieved greater than 90% accuracy in predicting which mutations are benign versus disease-causing. 

Patrick Hsu, PhD, Arc Institute co-founder and an assistant professor of bioengineering at UC Berkeley, stated that Evo 2 is the only model that can predict the effects of both coding and noncoding mutations.

“It is the second-best model for coding mutations, but it is state-of-the-art for noncoding mutations, which other variant effect prediction methods, such as AlphaMissense from DeepMind, cannot score,” said Hsu. 

Hsu also described Evo 1 as a “blurry picture of single-cell life” because it was trained on a corpus of 300 billion nucleotides derived from prokaryotic genomes. The team “wanted to be much more ambitious” in this collaboration with Nvidia. 

Evo 2 was built on NVIDIA’s DGX Cloud platform and is trained on more than 9.3 trillion nucleotides from the genomes of more than 128,000 species across the tree of life. The model uses a novel architecture called StripedHyena 2, which enabled training that was “nearly three times faster than optimized transformer models,” according to Dave Burke, PhD, chief technology officer at Arc Institute. The model also has 40 billion parameters and is similar in scale to the current generation of large language models released from Meta, DeepMind, or OpenAI. 

Evo 2 can process DNA sequences of up to 1 million nucleotides at once, allowing it to understand relationships between distant parts of the genome. Hsu stated that this long context length unlocks multiple molecular scales, from short biological molecules, such as tRNA, or clusters of genes (e.g., operons), to entire bacterial genomes or eukaryotic chromosomes. 

Arc Institute and Nvidia describe Evo 2 as the “largest publicly available AI model for biology to date.” Evo 2 is available for public use on the NVIDIA BioNeMo platform and as an interactive user-friendly interface called Evo Designer. In addition, the authors have made its training data, training and inference code, and model weights open source. 

Evo 2 is trained on over 9.3 trillion nucleotides from over 128,000 genomes across the three domains of life (visualized here as points clustered by similarity). [Arc Institute]

Biology’s app store 

Understanding biology as a “language” is not a new concept. Advances in genome sequencing have allowed us to “read” the human genome, while the invention of CRISPR technology expanded our toolbox to gene “editing.”  

In 2023, Hsu and Brian Hie, PhD, assistant professor of chemical engineering at Stanford University, began thinking about designing or “writing” biological sequences, including proteins, by starting at the foundational layer of DNA itself. “After all, proteins themselves are encoded directly by the genome,” emphasized Hsu.  

“Machine learning started to revolutionize biology, and models such as AlphaFold or ESMFold enabled protein structure prediction and design. Despite these advances, the complexity of these molecules is dwarfed by the overall complexity of an entire cell,” Hsu continued. 

Given that biological functions are not accomplished by a single protein molecule in isolation, constructing synthetic genomes can provide a valuable research tool to investigate broader biological context, a feat that Evo 2 is tackling head-on. 

“A lot of biological design until now has focused on the molecular level because that’s all that we could control. If we have a powerful model that lets us generate at the scale of complete organisms, then that unlocks a lot of downstream tasks [with a wide array of use cases],” said Hie. 

The Evo 2 preprint described three design tasks that span different levels of genomic complexity: 1) mitochondrial genome 2) prokaryotic genome of Mycoplasma genitalium, a commonly used model of the minimum genome, and 3) yeast chromosome, which represents eukaryotic organisms.

For all three design tasks, the preprint showed evidence supporting genome coherence, such as the construction of genes that code for all the components of the electron transport chain (as predicted by AlphaFold 3) in the case of the mitochondrial genome, and the presence of natural homologs and more complex genomic architecture, such as introns, in the case of the yeast chromosome. 

The preprint also presented a workflow for “generative epigenomics,” which designed DNA sequences with desirable chromatin accessibility profiles to simulate eukaryotic gene regulation.  

When asked about plans for experimental validation, Hie stated that a collaboration with large DNA synthesis and assembly experts from the University of Washington is underway to insert the chromatin accessibility designs into mouse cells for validation studies. 

Looking ahead, the Arc Institute is interested in building on this biological complexity by constructing the virtual cell.  

“The bottleneck to drug discovery is that we don’t know what causes the disease to begin with,” said Hie. “If we have a very capable model of the genome and we couple this with information from the environment through RNA sequencing, gene regulatory networks, and cell signaling networks, then this combined multimodal framework will let us answer these fundamental questions about disease.”

Hie sees Evo 2 as an “operating system”, or a foundational layer, that provides a platform for broad generative functional genomics. While Evo 2 “might not solve all questions in biology,” the model offers a wider breadth of applicability compared to task-specific predecessors, such as AlphaFold for protein structure prediction. 

“We want to empower the research community to build on top of these foundation models. That’s why we put in so much effort with Nvidia to make this fully open source,” weighed in Hsu. “We’re really looking forward to how scientists and engineers build on this ‘app store’ for biology.” 

Fay Lin, PhD, is senior editor for GEN Biotechnology.