Illumina has launched what it calls the largest genome‑wide genetic perturbation dataset ever assembled, a resource designed to accelerate AI‑driven drug discovery across the pharmaceutical industry. The new Billion Cell Atlas represents the first installment of a planned five‑billion‑cell atlas that the company expects to complete over the next three years. Illumina said the project will ultimately form the most comprehensive map of human disease biology generated to date.
The Atlas is being built in collaboration with AstraZeneca, Merck, and Eli Lilly, which are participating as founding participants. Together, the companies are generating a curated set of cell lines that will be used to validate drug targets, train large‑scale AI models, and probe biological mechanisms that have historically been difficult to study.
Jacob Thaysen, PhD, Illumina’s CEO, said the scale of the initiative is intended to reshape how AI is applied in early discovery. “We believe the cell atlas is a key development that will enable us to significantly scale AI for drug discovery,” he said. “We are building an unparalleled resource for training the next generation of AI models for precision medicine and drug target identification.”
How pharma partners plan to use the Atlas
Merck plans to use the dataset to support its precision medicine approaches across drug discovery pipelines and AI/ML models. The company expects the Atlas to help train its proprietary foundation models and support the development of virtual cell models aimed at improving disease‑indication prediction. “By harnessing advanced genomic patient datasets, Merck scientists are building and leveraging AI models grounded in real biological variation—not just literature text,” said Iya Khalil, PhD, vice president and head of data, AI & genome sciences at Merck. “Through our close collaboration with Illumina, we’re establishing a scalable bridge from genomic insight to therapeutic impact.”
AstraZeneca views the Atlas as a tool for connecting genetic signals to actionable biology. “Translating genetic information into a clear understanding of disease mechanisms—and then ultimately into medicines—remains a core challenge in R&D,” said Slavé Petrovski, PhD, vice president of the company’s Centre for Genomics Research. “By showing how specific genetic perturbations play out inside human cells, we can help turn genetic signals into mechanistic biology we can directly study.”
Eli Lilly emphasized the importance of scale for next‑generation AI models. “The next generation of AI‑driven drug discovery will depend on biological data at a scale never before achieved,” said Ruth Gimeno, PhD, group vice president of cardiometabolic research. “Comprehensive datasets spanning diverse cell types offer the critical foundation needed to generate meaningful insights into human disease.”
Inside the Billion Cell Atlas
The Atlas captures how one billion individual cells respond to CRISPR‑based perturbations across more than 200 disease‑relevant cell lines, spanning immune, oncologic, cardiometabolic, neurological, and rare genetic conditions. By systematically switching genes on or off across these diverse cell types, researchers can observe functional consequences at single‑cell resolution—data that can be used to characterize mechanisms of action, identify new indications, and validate genetically supported targets.
The project is the first major data product from Illumina’s new BioInsight business unit. The company is using its Single Cell 3’ RNA prep platform to capture millions of cells per experiment and expects to generate roughly 20 petabytes of transcriptomic data within the first year. The data are processed through the DRAGEN pipeline and hosted on Illumina’s Connected Analytics cloud platform for large‑scale analysis.
Illumina said the Billion Cell Atlas is the foundation for a broader effort to build multi‑billion‑cell datasets with partners over time, ultimately contributing to a five‑billion‑cell resource announced last year.


