Andrea Califano, PhD, professor of chemical and systems biology at Columbia University, encourages an artificial intelligence (AI) approach to cancer immunotherapy, where AI models predict the genes that need to be toggled to train the immune system to target specific organs on demand.
Califano, who is also president of Chan Zuckerberg Biohub New York, has spent the past 20 years of his faculty career applying computational approaches to interrogate the mechanisms of biological regulation, signal transduction, and cell-to-cell communication that dictate cell behavior. He has pointed his recent work to developing single-cell RNA (scRNA) foundation models with applications in predicting cell response to interventions, such as treatment with a drug.
While many transcriptome AI models have taken inspiration from text-based large language models (LLMs), which are powerful tools to address problems that require understanding positional relationships, such as words in a sentence, they fall short of capturing the complex causality and molecular logic of gene regulatory networks.
To address this gap, Califano and colleagues have now released GREmLN (Gene Regulatory Embedding-based Large Neural model), an AI model that leverages a graph-based architecture to capture long-range gene-gene relationships. The work is described in a new preprint posted on bioRxiv that has not yet been peer reviewed.
“If you take 10,000 genes, you will have about 100 million potential gene network interactions,” explained Califano in an interview with GEN. “Instead of trying to fit biology into a computer science model, we’re fitting the model to biology.”
GREmLN demonstrated strong performance on both cell type annotation and graph structure understanding tasks when benchmarked against current state-of-the-art scRNA foundation models, Geneformer, scGPT, and scFoundation. Additionally, the graph-based approach offered the computational advantage of more parameter-efficient architectures and accelerated training convergence.
Growing family
GREmLN joins CZI’s growing family of virtual cell models, which currently includes TranscriptFormer, a generative tool that probes cellular biology across species. The model release is another step forward for CZI’s virtual cell program, one of four scientific grand challenges that the nonprofit set earlier this April in its effort to transform human health at the intersection of AI and biology. The remaining challenges include developing imaging technologies to map complex biological systems, creating new tools for measuring inflammation in tissues in real time, and harnessing the immune system for early detection, prevention, and treatment of disease.
GREmLN is currently trained on more than 11 million scRNA-seq profiles from Chan Zuckerberg Initiative’s (CZI) CZ CELLxGENE database, which is predominantly composed of observational data from healthy human donors. To improve the model’s capabilities in predicting cell response to therapeutics, Califano said the team is currently pursuing experimental validation studies of GREmLN using a newly generated single-cell genetic perturbation dataset targeting druggable cancer genes.
“There is this feeling among computer scientists that you can just take all the data, throw it into a black box, and then out comes the solution,” Califano told GEN. “That’s not actually the case. Data needs to be generated in a very specific way.”
In this vein, CZI announced the Billion Cells Project in February, a collaboration with 10X Genomics and Ultima Genomics to generate an unprecedented one billion cell dataset that maps genetic perturbations across diverse cell types and tissues to assist AI model performance across new biological contexts.
Taken together, Califano said GREmLN is an early step toward the lofty goal of constructing a truly comprehensive virtual cell. More work needs to be done to introduce additional layers of regulation that make the cell tick “before we can declare victory.”
While GREmLN currently only encompasses transcriptional regulation, the researchers plan to expand the model capabilities to signal transduction, microRNAs, cell-to-cell interactions mediated by ligands, and more. Additionally, GREmLN is available on CZI’s virtual cell platform, which is open and accessible to the global scientific community.