xaira’s-first-virtual-cell-model-is-largest-to-date,-toward-complex-biology
Xaira’s First Virtual Cell Model Is Largest To-Date, Toward Complex Biology

Xaira’s First Virtual Cell Model Is Largest To-Date, Toward Complex Biology

Biology concept. Cell division under the microscope.
Credit: urfinguss / iStock / Getty Images Plus

Billion dollar-backed AI drug developer, Xaira Therapeutics, has recently released the largest virtual cell model to-date to predict how cells respond to genetic perturbations in unseen biological contexts. The team asserts that accurately predicting transcriptome-level effects is a powerful translational tool across target and mechanism-of-action discovery, patient stratification, toxicity prediction, and more. 

Named X-Cell, Xaira’s model sizes up to a whopping 4.9 billion parameters and has broken barriers as the first scaling law demonstrator in the virtual cell domain. Results showed that perturbation prediction follows power-law scaling with an exponent matching large language models.  

Marc Tessier-Lavigne, PhD, CEO of Xaira
Marc Tessier-Lavigne, PhD, CEO of Xaira

In performance, X-Cell achieved zero-shot prediction of T cell inactivating perturbations and effectively generalizes to therapeutically relevant contexts not seen in the training data, including iPSC-derived melanocyte progenitors and primary human T cells from multiple donors. The work is posted as a preprint on bioRxiv that has not yet been peer reviewed. 

Xaira launched in 2024 and is led by CEO Marc Tessier-Lavigne, PhD, former president of Stanford and CSO of Genentech. Among the company’s star-studded leadership team are Nobel laureates David Baker, PhD and Carolyn Bertozzi, PhD, former FDA head, Scott Gottlieb, MD, and former CEO of Johnson & Johnson, Alex Gorsky. 

Diffusion evolution 

Many virtual cell models are primarily fueled by observational single-cell RNA-seq expression datasets. Yet, predicting how cells respond to stimuli, such as treatment with a drug, requires large-scale perturbation data that are sparse in the public domain.  

To train X-Cell, Xaira has spent its initial years building what the company describes as “the largest genome-wide CRISPRi Perturb-seq dataset ever reported.” Named X-Atlas/Pisces, the dataset is composed of 25.6 million cells across seven screens and 16 biological contexts, expanding upon X-Atlas/Orion, which was released last June. This unprecedented AI-ready, context-rich dataset enables X-Cell to achieve a parameter scale in the billions. 

X-Cell is the first virtual cell model to systematically integrate biological prior knowledge from the literature, such as information on specific genes, protein–protein interactions, and cellular morphology, using a cross-attention mechanism. 

In architecture, Xaira’s first virtual cell is a diffusion language model that iteratively refines its predictions by replacing control gene expression values with perturbed values. The method contrasts with the autoregressive training approach, used by previous generation models, including scGPT for single-cell multiomics, developed by Bo Wang, PhD, senior vice president and head of biomedical AI at Xaira. 

As an analogy, Wang explained that autoregressive training is like typing from left to right. If the model makes a mistake, the rest of the sentence can fall apart.

Bo Wang, PhD, SVP and Head of Biomedical AI at Xaira
Bo Wang, PhD, SVP and Head of Biomedical AI at Xaira  

In contrast, diffusion language models function like editing a draft. The model starts with a foundation, such as “I like coffee.” The sentence is iteratively refined to “I like decaf coffee,” then “I like finely ground decaf coffee.” With each pass, the model adjusts the output to better match the underlying data distribution. 

“It’s more sophisticated and better at predictions,” Wang told GEN Edge when describing the diffusion-based approach. “Experts from the language-side even find it’s better at reasoning.” 

Engineering discipline 

“The best measure of success is to show that our models can be applied to make medicines and a difference for patients,” said Tessier-Lavigne in an interview with GEN Edge. 

He attributes the trial-error nature of science as the culprit for the long timelines and high attrition of drug discovery. Target identification to drug approval takes an average of thirteen years while 90% of molecules fail at the clinic. Xaira’s mission is to transform discovery and development from an artisanal state to an engineering discipline by serving as a platform and product company for the field. 

In addition to building the virtual cell, Xaira has a molecular design pillar based on protein design technology developed by Baker who won the 2024 Nobel Prize in Chemistry. The team is developing novel antibodies that hit challenging targets, such as multi-pass membrane proteins with minimal accessible regions outside of the cell. These proteins are therapeutically validated but remain undruggable.  

Last November, a Nature study led by Xaira co-founders and Baker lab graduates Nathaniel Bennett, PhD, and Joseph Watson, PhD, generated full length antibodies from scratch that successfully bound user-specified epitopes with atomic precision. In the increasingly crowded de novo antibody space, corresponding advances were demonstrated by Nabla BioChai Discovery, and Absci. 

While Xaira has remained quieter on its molecular design efforts, Tessier-Lavigne asserts that the company has been “very focused on this from Day 1.” More public announcements are anticipated in the coming months. 

“We’re obsessed with cells” 

While virtual cell models that generalize to new contexts provide a valued advance toward understanding biology, predicting patient outcomes is still a step away.  

Ron Alfa, MD, PhD, CEO of Noetik, says building up layers of cell-level experiments to achieve a tissue, and ultimately human representation, is a challenging feat. He argues that developing models that tokenize tissues is a more convincing translational approach. 

“We’re a little obsessed with cells,” said Alfa when speaking at NVIDIA GTC in San Jose last week. “Iwe’re training advanced AI models, we care about tokens that are produced by the underlying data.”  

Noetik takes a human-first approach to predict cancer clinical outcomes using multimodal datasets from patient-derived tumors. This strategy has attracted GSK, which earlier this year entered a five-year licensing partnership to access Noetik’s non–small cell lung cancer and colorectal cancer foundation models. 

Tessier-Lavigne sees X-Cell as just the beginning of a long term program that will be built “brick by brick.” 

“For any company or lab, you balance creating the ultimate model with being pragmatic about gaining insights in the near term.” he said. Notably, investing in large-scale Perturb-seq datasets enables learning underlying gene regulatory networks that drive biological function across the genome.  

Wang says Xaira has started with the virtual cell but looks toward modeling more complex systems, including animals, organoids, and eventually patients. As generating patient level data remains time-consuming and expensive, scalable cellular models can act as a bridge that generates hypotheses to be tested in patient AI models. 

The Xaira team also plans to expand the work to more modalities, such as chemical perturbations that modulate signaling pathways and proteomic data. In this vein, Biohub, Arc Institute, and Tahoe Therapeutics announced a new team effort in January to build a massive chemical perturbation dataset to be released open source. The timeline for data release has not been disclosed. 

“The beauty of AI is creating a foundation where you can add more data and have transfer learning across different dimensions,” says Tessier-Lavigne. “That is our aspiration.”

As single-cell data capture only one slice of biological complexity, diverse data modalities are not competing, but complementary. Each advance adds a new layer of resolution, steadily moving the field toward a more complete and predictive model of biology.