Trillion Gene Atlas Expands Evolutionary Datasets for Next-Generation AI Therapeutics

The Trillion Gene Atlas, an initiative to generate and model biological data at the trillion-gene scale, has been launched by Basecamp Research in collaboration with Anthropic, Ultima Genomics, and PacBio. Powered by NVIDIA AI infrastructure, the Trillion Gene Atlas aims to expand known evolutionary genetic diversity 100-fold by collecting genomic data from more than 100 million species across thousands of sites worldwide. The initiative, which was unveiled during the Health Track at SXSW and the NVIDIA GTC conference in San Jose, is made possible by Basecamp’s growing network of global biodiversity partners.

The initiative is built on three pillars: large-scale DNA sequencing, global data supply partnerships, and advanced computing. Together with AI systems capable of reasoning across complex data, these foundations can help turn vast datasets into therapeutic discoveries. By increasing the evolutionary data available to AI by another 100x, Basecamp Research aims to make drug design faster and more systematic.

The project’s goal is to provide the diverse training data required for AI systems to learn from evolution to design new medicines on demand. “Today’s biological AI models are trained on a narrow slice of life on Earth,” said Glen Gowers, co-founder and CEO of Basecamp Research, speaking at SXSW in Austin. “The Trillion Gene Atlas expands the known genetic universe by orders of magnitude beyond what is in public databases. Training models at this scale establishes a new paradigm for programmable therapeutic design.”

With increases in model size and computing power, diverse data is a critical enabler for progress in AI drug development and real-world benchmarks. All current sequence-based foundation models rely on variants of the same public repositories, with 80% of these trained on a public database containing fewer than 250 million sequences.

Basecamp Research’s EDEN foundation models, released in January, bypass the industry’s evolutionary “data wall” by training entirely on BaseData—a proprietary genomic database that is currently more than 10 times larger than all public resources combined. By learning from an unprecedented 10 billion new-to-science genes across one million newly discovered species, EDEN unlocked new scaling laws for AI in biology.

This expansion in dataset diversity moved EDEN beyond simple prediction, making it the first model capable of designing diverse therapeutics directly from a disease prompt. In wet-lab validation, EDEN demonstrated zero-shot activity in primary human T-cells without any human or clinical data needed. The model has successfully generated hits across multiple frontier modalities, notably pioneering AI-Programmable Gene Insertion (aiPGI) to insert healthy genes and designing targeted antimicrobial peptides with a 97% hit rate against priority pathogens.

The Trillion Gene Atlas builds on this approach by greatly expanding the breadth and contextual depth of genomic data in the known “internet of biology” suitable for AI training.

“Biology has been fundamentally data-starved when compared to other fields like language or computer vision as researchers have lacked the tools required to generate data at scale,” said Gilad Almogy, PhD, founder and CEO of Ultima Genomics.

Over the past six years, Basecamp Research has built a network of scientific collaborators across 31 countries, establishing a scalable evolutionary genomics pipeline purpose-built for AI training. The company collects genomic data from ecosystems beyond the reach of traditional laboratories. As part of the Atlas launch, Basecamp is announcing new partnerships in Chile, Argentina, and an expanded collaboration in Antarctica, further extending its global biodiversity network.

The Trillion Gene Atlas will be powered by NVIDIA’s accelerated computing infrastructure to process vast quantities of genetic data at the petabase scale. As part of this effort, Basecamp plans to leverage NVIDIA Parabricks to significantly accelerate metagenomic assembly. Through parallelized data processing, automated annotation, and large-scale model training, the partners expect to compress a task that previously would have required more than 20 years of processing time to less than two years.

Anthropic joins as part of its broader effort to add new capabilities for life sciences: connecting Claude to more scientific platforms. By combining Claude’s advanced reasoning capabilities, EDEN’s therapeutic design capabilities, and NVIDIA’s CUDA-X Libraries to process unstructured data, the initiative aims to create an integrated workflow for interpreting complex clinical data and translating it directly into therapeutic design.