SandboxAQ Debuts Public Database of Protein-Ligand Structures and Binding Affinities

Credit: Laguna Designs/Getty Images

Recently, SandboxAQ launched what it claims is the largest publicly available dataset of protein-ligand pairs with annotated experimental binding potency data. According to the company, the Structurally Augmented IC50 Repository (SAIR) contains about 5.2 million synthetic three-dimensional molecular structures across more than a million protein-ligand systems.

Full details about the resource and its development are provided in a preprint titled “SAIR: Enabling deep learning for protein-ligand interactions with a synthetic structural dataset.” Besides leveraging SanboxAQ’s large quantitative model (LQM) capabilities—artificial intelligence models trained on quantitative and scientific data—SAIR’s developers also used Nvidia’s DGX™ cloud, a development platform for AI model training and fine-tuning.

As with many companies in the drug discovery and development space, SandboxAQ is betting that AI can significantly shorten timelines and lower costs. Drug discovery is one of the few industries where a significant portion—upwards of 70%—of spending goes to research and development. “Everything is always moving [and] every drug is different,” said Adam Lewis, AI and quantum lead at SandboxAQ. “As we’ve pursued this space, we’ve encountered that ourselves.”

A key contributor to drug discovery costs is the experiments required to determine whether candidate molecules bind effectively to proteins of interest. AI tools can be a big help in this regard, but the prohibitive costs of running experiments in the lab mean that the very datasets needed to train the models for the task are lacking.

“We’ve been wanting to make our own R&D investment in creating AI models that are very specific to the problems of drug discovery. One of the problems we’ve identified is just the lack of data,” Lewis told GEN in an interview. “The whole reason you want to make LQMs in drug discovery is that experiments are slow, costly, and, in some cases, even unsafe,” he said. With access to the right training datasets, “AI can do these experiments on information instead of something physical,” and this “opens up some new opportunities” that are not “possible experimentally.” But “because these experiments are slow, expensive, and potentially dangerous, there is only a finite amount of data available to train on.”

Helping AI make better predictions

Programs like AlphaFold, OpenFold, and Boltz effectively generate protein structures and data on protein-drug interactions, but they have limitations. Furthermore, very few protein-ligand complexes have both a resolved 3D structure and potency measurement, so most AI algorithms are trained on indirect data like sequences or 2D chemical structures. Additionally, newer co-folding models may only be able to make predictions about proteins and ligands that are similar to those used in training and could have a harder time with novel proteins or chemically diverse compounds, Lewis noted.

One option to help AI algorithms make better predictions is to try to “produce more experimental structure data.” Another is to “find ways of exploiting different kinds of data,” and “this is what we did with SAIR,” he said.

To develop the SAIR data, the scientists used publicly available resources like BindingDB and ChemBL and a co-folding model—Boltz-1 in this case—to predict three-dimensional structures for sequence-affinity pairs. Crucially, they did not rely on a single prediction, choosing instead to generate multiple structures with various poses to best capture regions of uncertainty.

“What you get [are] multiple different predictions within the range of competence of the model that are tied to these affinity pairs,” Lewis explained. Next, the team used affinity prediction algorithms to analyze the structures, select those that had the best agreement with the experimental affinity data, and then discard those that did not fit the bill. “In effect, this is a way of taking much cheaper data…and using it to improve on the structural data without having to directly create experimental crystal structures.”

For Arman Zaribafiyan, head of product, AI & simulation platforms at SandboxAQ, SAIR bridges what he described as a longstanding gap between protein structures, binding affinity, and drug potency. He echoed Lewis’ sentiments about the prohibitive costs of generating training data at scale as well as the limitations of using two-dimensional chemical structures and sequence information for algorithm training. ”SAIR’s launch proves that we have the know-how to run these simulations at scale to produce 3D structures for the data that exists out there and then connect them to the binding affinity.”

He highlighted the importance of Nvidia’s contributions to the project, noting that the SAIR team collaborated closely with the computing company to achieve a 2x improvement in GPU utilization for the project. Simply using more GPUs is not enough, Zaribafiyan noted. “You have to optimize your workflow,” and “ensure that you are increasing the utilization of the GPUs.”

SAIR’s data is most useful for benchmarking biofoundation models or for training or fine-tuning new artificial intelligence (AI) models for predicting binding affinity. The data is available for free for non-commercial use under a CC BY-NC-SA 4.0 license. “How we would imagine it could be used broadly would be either calibrating new affinity models or training affinity structural models, but of course, we’re open to the creativity of the scientific community,” Lewis said.

Meanwhile, commercial users can also use the data at no charge after submitting a form to SandboxAQ. By leveraging SAIR data for training, its developers believe that AI models will be able to deliver predictions at least 1000 times faster than traditional physics-based methods.

The company is also thinking about how best to maintain the resource long term. To some extent, “that is going to depend on feedback in the community and our own development,” Lewis said. One option would be to simply expand the dataset. Another would be to “create new parallel datasets” that could cover more than just small molecules. “We have a vision of expanding into whole cell modeling, and we would see this as a building block in that direction,” he said.