Combining Protein Shapes with Genomic Sequences Improves Reliability of Evolutionary Trees

New research headed by scientists at the Centre for Genomic Regulation (CRG) has demonstrated how the three-dimensional shape of a protein can be used to resolve deep, ancient evolutionary relationships in the tree of life. For the first time, researchers used data from protein shapes and combined it with data from genomic sequences to improve the reliability of evolutionary trees—which they describe as a critical resource used by the scientific community for understanding the history of life—and monitor the spread of pathogens or create new treatments for disease.

Their strategy, which they’ve called “multistrap” can even work with the predicted structures of proteins that have never been experimentally determined. It has implications for the massive amount of structural data being generated by tools like AlphaFold 2 and help open new windows into the ancient history of life on Earth.

Research lead Cedric Notredame, PhD, and colleagues reported on their development in Nature Communications, in a paper titled “multistrap: boosting phylogenetic analyses with structural information.”

There are 210 thousand experimentally determined protein structures but 250 million known protein sequences. Initiatives like the EarthBioGenome project could generate billions more protein sequences in the next few years. The abundance of data opens the door to applying the approach on an unprecedented scale.

For many decades, biologists have been reconstructing evolution by tracing how species and genes diverge from common ancestors. These phylogenetic or evolutionary trees are traditionally built by comparing DNA or protein sequences and counting the similarities and differences to infer relationships. “In a phylogeny, trustworthy reliability branch support estimates are as important as the tree itself,” the team explained.

However, researchers face a significant hurdle—a problem known as saturation. Over vast timescales, genomic sequences can change so much that they no longer resemble their ancestral forms, erasing signals of shared heritage. “The issue of saturation dominates phylogeny and represents the main obstacle for the reconstruction of ancient relationships,” Notredame stated. “It’s like the erosion of an ancient text. The letters become indistinct, and the message is lost.”

To overcome this challenge, the research team turned to the physical structures of proteins. Proteins fold into complex shapes that determine a cell’s function. These shapes are more conserved over evolutionary time than the sequences themselves, meaning they change more slowly and retain ancestral features for longer. The authors further explained, “The notion that structural variation could inform phylogenetic inference has long been intriguing and sometimes divisive to the scientific community… The resilience of protein folds is well established and has routinely been used to infer homology across evolutionary timespans incompatible with sequence analysis. This observation has led to the speculation that the quantitative comparison of protein folds could be used as a metric to resolve deep nodes in phylogenetic trees.

The shape of a protein is dictated by its amino acid sequence. While sequences may mutate, the overall structure often remains similar to preserve function. The researchers hypothesized they could gauge how much the structures diverge over time by measuring the distance between pairs of amino acids within a protein, also known as intra-molecular distances (IMDs). “Our approach relies on the systematic comparison of homologous intra-molecular structural distances,” they wrote. “In this study, we explore the potential of intramolecular distances to be treated as evolutionary characters and set out to ask if these characters could either help the reconstruction of phylogenetic trees or provide new ways of estimating branch reliability…rather than trying to quantify the relative merits of various alternative phylogeny reconstruction methods, we focus our efforts on measuring the agreement between structure-based and sequence-based phylogenies in a controlled environment.”

The study compiled a massive dataset of proteins with known structures, covering a wide range of species. They calculated the IMDs for each protein and used these measurements to construct phylogenetic trees. “In this study, we explore the potential of intramolecular distances to be treated as evolutionary characters and set out to ask if these characters could either help the reconstruction of phylogenetic trees or provide new ways of estimating branch reliability.”

They found that trees built from structural data closely matched those derived from genetic sequences, but with a crucial advantage: the structural trees were less affected by saturation. This means they retained reliable signals even when genetic sequences had diverged significantly.

Recognizing that both sequences and structures offer valuable insights, the team developed a combined approach which not only improved the reliability of the tree branches but also helped distinguish between correct and incorrect relationships. “Our results show a significant level of congruence between sequence and structure-based phylogenetic reconstructions. We take advantage of this property to design a hybrid bootstrap support method named multistrap, which combines sequence and structural information,” they explained.

“It’s akin to having two witnesses describe an event from different angles,” explained Leila Mansouri, PhD, coauthor of the study. “Each provides unique details, but together they give a fuller, more accurate account.”

One practical example where the combined approach could make a significant impact is in understanding the relationships among kinases in the human genome. Kinases are proteins involved in many different important cellular functions.

“The genome of most mammals, including humans, contains about 500 protein kinases that regulate most aspects of our biology,” stated Notredame. “These kinases are major targets for cancer therapy, for example drugs like imatinib for humans or toceranib for dogs.”

Human kinases have arisen through duplications occurring over the last billion years. “Within the human genome, the most distantly related kinases are about a billion years apart,” Notredame added. “They duplicated in the common ancestor of the common ancestor of our common ancestor.”

This vast timescale involved makes it incredibly difficult to build accurate gene trees that show how all these kinases are related. “Yet, as imperfect as it may be, the kinase evolutionary tree is widely used to understand how it interacts with other drugs. Improving this tree, or improving trees of other important protein families, would be an important advance for human health,” Notredame pointed out.

The potential applications of the work go beyond cancer. Using the approach to create more accurate evolutionary trees could also improve our understanding of how diseases evolve more generally, aiding in the development of vaccines and treatments. They can also help shed light on the origins of complex traits, guide the discovery of new enzymes for biotechnology, and even help trace the spread of species in response to climate change.