Xaira Therapeutics Releases Largest Perturb-Seq Dataset to Power the Virtual Cell

Artificial Intelligence Pharmaceutical Medical Health Medicine Drug, Pill — Credit: Just_Super / iStock / Getty Images Plus

Ever since artificial intelligence (AI) drug developer Xaira Therapeutics launched in April 2024 with a jaw-dropping $1 billion in committed capital and a star-studded leadership lineup—including David Baker, PhD, the 2024 Nobel Laureate in Chemistry and University of Washington/HHMI investigator, Carolyn Bertozzi, PhD, the 2022 Nobel Laureate in Chemistry, Stanford University, Scott Gottlieb, MD, former FDA head, Alex Gorsky, former CEO of Johnson & Johnson, and more—the field has eagerly awaited equally blockbuster scientific results.

The AI unicorn has delivered, capping its first year by releasing a gift to the virtual cell community: the largest publicly available Perturb-seq dataset, termed X-Atlas/Orion, to interrogate how cells respond to external conditions, such as therapeutic interventions, at large scale.

Announced in the company’s first preprint, X-Atlas/Orion is comprised of eight million cells, targeting all human protein-coding genes, with deep sequencing of over 16,000 unique molecular identifiers (UMIs) per cell. The preprint is posted on bioRxiv and has not yet been peer reviewed.

While researchers have traditionally viewed Perturb-seq gene knockdowns as an “on” or “off” switch, Xaira’s method offers an advance by detecting dose-dependent genetic effects to uncover how gene activity changes with the intensity of a given intervention. Example applications include defining precise percent inhibition at which a drug target produces a desired therapeutic effect.

The dataset was made possible by Xaira’s concurrently introduced Fix-Cryopreserve-ScRNAseq (FiCS) Perturb-seq platform, a scalable technology designed for large-scale data generation.

To promote collaboration within the virtual cell community, X-Atlas/Orion will be made publicly available to the biotech community under a non-commercial license. Xaira is open to discussing data collaborations with commercial entities that express interest.

“When we put datasets like this in the hands of other computational researchers, we’re excited to see what kind of new model architecture and approaches they can come up with,” said Ci Chu, PhD, vice president of early discovery at Xaira and senior author of the preprint, in an interview with GEN Edge.

Led by CEO Marc Tessier-Lavigne, PhD, formerly the president of Stanford University and CSO of Genentech, Xaira has sites in Seattle and San Francisco. In addition to building the virtual cell, the start-up is looking to design protein-based drugs. Much of Xaira’s technology is reported to derive from the University of Washington’s Institute for Protein Design, led by Xaira co-founder and 2024 Nobel Laureate in Chemistry, David Baker, PhD.

Xaira’s additional co-founders include two of biotech’s biggest venture capitalists, Bob Nelsen of ARCH Venture Partners and Vik Bajaj, PhD, at Foresite Labs, an incubator affiliated with Foresite Capital. Other investors include F-Prime Capital, NEA, Sequoia Capital, Lux Capital, Lightspeed Venture Partners, Menlo Ventures, Two Sigma Ventures, and SV Angel.

Beyond observations

While traditional early drug discovery approaches have been limited to making bets on a handful of genes from the literature, high-performing virtual cell models have the potential to de-risk unwanted biological effects before they occur in the discovery pipeline.

Many virtual cell models are trained on observational data, such as CZ CELLxGENE from Chan Zuckerberg Initiative (CZI), which is predominantly drawn from healthy human donors. While observational data are powerful for certain biological research tasks, such as cell type annotation, they fall short in predicting how cells respond to perturbations, such as treatment with a drug.

In Perturb-seq, measuring gene knockdown efficiency has been historically challenging due to the stochasticity and sparsity of single-cell datasets. To address this gap, the Xaira researchers demonstrated that single guide RNA (sgRNA) abundance is detected and expressed at hundreds of copies per cell, a rarity among single-cell measurements, and offers a reliable proxy for how strongly a gene is suppressed.

Xaira will leverage the perturbation information from X-Atlas/Orion to build their own virtual cell models in coordination with the company’s newest hire, Bo Wang, PhD, SVP and head of biomedical AI, who joined the team in April. A renowned AI expert from the University of Toronto, Wang is known as the inventor of scGPT, a foundation model for single-cell multi-omics with downstream capabilities, including cell type annotation, perturbation response prediction and gene network inference.

Xaira is one of many entities building the virtual cell. In April, CZI released TranscriptFormer, a generative AI model that probes cellular biology across species with therapeutic applications, such as interrogating translational properties between humans and model organisms. Concurrently, researchers from the Arc Institute announced a partnership with 10x Genomics and Ultima Genomics to build the Arc Virtual Cell Atlas.

Long-term journey

Ci Chu, PhD, Vice President of Early Discovery at Xaira

Genome-scale experiments the size of X-Atlas/Orion can be extremely time-intensive. Simply sorting to enrich for high-quality cells can take more than 10 hours. By releasing X-Atlas/Orion’s methods, Xaira aims to allow more labs to generate Perturb-seq data at large, high-quality, and standardized scale. Individual labs will also have the ability to test specific hypotheses with large-scale data.

Chu says figuring out what type of data are most useful for building foundation models of biology requires collective thinking. X-Atlas/Orion is a complement to existing public repositories for single cell data, such as the Arc Institute’s scBaseCount, to support collaboration among the virtual cell field.

“We’re in the early days of the AI-powered virtual cell. This is going to be a long-term journey for the entire community,” Chu emphasized during an interview with GEN Edge.

Emma Lundberg, PhD, associate professor at Stanford University and the co-director of the Human Protein Atlas, which is based at the KTH Royal Institute of Technology in Stockholm, told GEN Edge that she is a strong believer in open science and is pleased to see Xaira publicly release a large Perturb-seq dataset. She emphasizes that groups around the world are trying to build AI-powered virtual cell models, capable of simulating cell behavior.

“To build a robust model of any system, perturbation data is critical, and the released X-Atlas/Orion dataset marks a significant contribution to the scientific community,” said Lundberg.

Hani Goodarzi, PhD, core investigator and computational technology center research lead at the Arc Institute, concurs that tapping into the virtual cell’s full potential requires perturbational data at scales that no single organization can generate alone.

“It’s encouraging to see technology-forward companies build platforms for mega-scale and giga-scale data generation and make these resources publicly available,” Goodarzi told GEN Edge. “This dataset provides substantial resources for training foundation models across the community.”

Goodarzi also noted that the next challenge is to build AI models that can generalize beyond the two cancer cell lines described in the X-Atlas/Orion preprint. Along this vein, Chu said the team is working on expanding data generation into induced pluripotent stem cells (iPSCs) and in vivo models.

Taken together, Chu said he’s been working in functional genomics for some time, but scaling what works at a smaller scale is never easy.

“I’m really glad that the team has put in the work to industrialize all the steps to make this a truly scalable process,” he said. “I can’t wait to see what people will do, both on the AI side but also on the hypothesis generation side. I’m sure it will inspire our own work as well.”