Every cell in a body contains the same genetic sequence, yet each cell expresses only a subset of those genes. These cell-specific gene expression patterns, which ensure that a brain cell is different from a skin cell, are partly determined by the three-dimensional (3D) structure of the genetic material, which controls the accessibility of each gene.
Massachusetts Institute of Technology (MIT) chemists have now developed a new way to determine those 3D genome structures, using generative artificial intelligence (AI). Their model, ChromoGen, can predict thousands of structures in just minutes, making it much speedier than existing experimental methods for structure analysis. Using this technique scientists could more easily study how the 3D organization of the genome affects individual cells’ gene expression patterns and functions.
“Our goal was to try to predict the three-dimensional genome structure from the underlying DNA sequence,” said Bin Zhang, PhD, an associate professor of chemistry “Now that we can do that, which puts this technique on par with the cutting-edge experimental techniques, it can really open up a lot of interesting opportunities.”
In their paper in Science Advances “ChromoGen: Diffusion model predicts single-cell chromatin conformations,” senior author Zhang, together with co-first author MIT graduate students Greg Schuette and Zhuohan Lao, wrote, “…we introduce ChromoGen, a generative model based on state-of-the-art artificial intelligence techniques that efficiently predicts three-dimensional, single-cell chromatin conformations de novo with both region and cell type specificity.”
Inside the cell nucleus, DNA and proteins form a complex called chromatin, which has several levels of organization, allowing cells to cram two meters of DNA into a nucleus that is only one-hundredth of a millimeter in diameter. Long strands of DNA wind around proteins called histones, giving rise to a structure somewhat like beads on a string.
Chemical tags known as epigenetic modifications can be attached to DNA at specific locations, and these tags, which vary by cell type, affect the folding of the chromatin and the accessibility of nearby genes. These differences in chromatin conformation help determine which genes are expressed in different cell types, or at different times within a given cell. “Chromatin structures play a pivotal role in dictating gene expression patterns and regulatory mechanisms,” the authors wrote. “Understanding the three-dimensional (3D) organization of the genome is paramount for unraveling its functional intricacies and role in gene regulation.”
Over the past 20 years, scientists have developed experimental techniques for determining chromatin structures. One widely used technique, known as Hi-C, works by linking together neighboring DNA strands in the cell’s nucleus. Researchers can then determine which segments are located near each other by shredding the DNA into many tiny pieces and sequencing it.
This method can be used on large populations of cells to calculate an average structure for a section of chromatin, or on single cells to determine structures within that specific cell. However, Hi-C and similar techniques are labor intensive, and it can take about a week to generate data from one cell. “Breakthroughs in high-throughput sequencing and microscopic imaging technologies have revealed that chromatin structures vary considerably between cells of the same type,” the team continued. “However, a thorough characterization of this heterogeneity remains elusive due to the labor-intensive and time-consuming nature of these experiments.”
To overcome the limitations of existing methods Zhang and his students developed a model, that takes advantage of recent advances in generative AI to create a fast, accurate way to predict chromatin structures in single cells. The new AI model, ChromoGen (CHROMatin Organization GENerative model), can quickly analyze DNA sequences and predict the chromatin structures that those sequences might produce in a cell. “These generated conformations accurately reproduce experimental results at both the single-cell and population levels,” the researchers further explained. “Deep learning is really good at pattern recognition,” Zhang said. “It allows us to analyze very long DNA segments, thousands of base pairs, and figure out what is the important information encoded in those DNA base pairs.”
ChromoGen has two components. The first component, a deep learning model taught to “read” the genome, analyzes the information encoded in the underlying DNA sequence and chromatin accessibility data, the latter of which is widely available and cell type-specific.
The second component is a generative AI model that predicts physically accurate chromatin conformations, having been trained on more than 11 million chromatin conformations. These data were generated from experiments using Dip-C (a variant of Hi-C) on 16 cells from a line of human B lymphocytes.
When integrated, the first component informs the generative model how the cell type-specific environment influences the formation of different chromatin structures, and this scheme effectively captures sequence-structure relationships. For each sequence, the researchers use their model to generate many possible structures. That’s because DNA is a very disordered molecule, so a single DNA sequence can give rise to many different possible conformations.
“A major complicating factor of predicting the structure of the genome is that there isn’t a single solution that we’re aiming for,” Schuette said. “There’s a distribution of structures, no matter what portion of the genome you’re looking at. Predicting that very complicated, high-dimensional statistical distribution is something that is incredibly challenging to do.”
Once trained, the model can generate predictions on a much faster timescale than Hi-C or other experimental techniques. “Whereas you might spend six months running experiments to get a few dozen structures in a given cell type, you can generate a thousand structures in a particular region with our model in 20 minutes on just one GPU,” Schuette added.
After training their model, the researchers used it to generate structure predictions for more than 2,000 DNA sequences, then compared them to the experimentally determined structures for those sequences. They found that the structures generated by the model were the same or very similar to those seen in the experimental data. “We showed that ChromoGen produced conformations that reproduce a variety of structural features revealed in population Hi-C experiments and the heterogeneity observed in single-cell datasets,” the investigators wrote.
“We typically look at hundreds or thousands of conformations for each sequence, and that gives you a reasonable representation of the diversity of the structures that a particular region can have,” Zhang noted. “If you repeat your experiment multiple times, in different cells, you will very likely end up with a very different conformation. That’s what our model is trying to predict.”
The researchers also found that the model could make accurate predictions for data from cell types other than the one it was trained on. “ChromoGen successfully transfers to cell types excluded from the training data using just DNA sequence and widely available DNase-seq data, thus providing access to chromatin structures in myriad cell types,” the team pointed out
This suggests that the model could be useful for analyzing how chromatin structures differ between cell types, and how those differences affect their function. The model could also be used to explore different chromatin states that can exist within a single cell, and how those changes affect gene expression. “In its current form, ChromoGen can be immediately applied to any cell type with available DNAse-seq data, enabling a vast number of studies into the heterogeneity of genome organization both within and between cell types to proceed.”
Another possible application would be to explore how mutations in a particular DNA sequence change the chromatin conformation, which could shed light on how such mutations may cause disease. “There are a lot of interesting questions that I think we can address with this type of model,” Zhang added. “These achievements come at a remarkably low computational cost,” the team further pointed out. “Therefore, ChromoGen enables the systematic investigation of single-cell chromatin organization, its heterogeneity, and its relationship to sequencing data, all while remaining economical.”
The researchers have made all of their data and the model available to others who wish to use it.