Million Species Listing: Basecamp Research Unearths Trove of Sequence Data From Novel Species

Basecamp Research collecting samples in Costa Rica [Coldhouse Collective]

In the latest step to tackle biology’s data scaling law, Basecamp Research has announced BaseData, the largest database for sequence-based model training, composed of 9.8 billion new protein sequences and one million newly discovered species.

The company-generated dataset offers a 10-fold expansion of known protein diversity when compared to all public databases combined. The work was posted as a preprint on the company’s website, which has not yet been peer reviewed.

“The rise of generative biology—using AI foundation models to design, generate, and annotate proteins, pathways, and therapeutics—creates unprecedented demand for large, diverse biological sequence databases,” said Glen Gowers, PhD, CEO and co-founder at Basecamp Research.

According to AI scaling laws, improving model performance involves scaling three things in parallel: model parameters, compute power, and data. Unlike AI domains, such as natural language processing, which can tap into expanding sources of text and imagery, biological foundation models depend heavily on slow-growing public sequence databases that are reliant on clinical or laboratory settings. According to the Basecamp Research preprint, 70% of public sequence data fueling today’s biological research is drawn from just 10 species.

“One of the reasons that growth rate for public data is so slow is because there’s no aligned incentive for global sampling,” said Gowers in an interview with GEN Edge. “With the right economic framework with countries around the world, we can massively accelerate the amount of data that’s being collected.”

BaseData uncovers over 1 million novel species not found in existing databases, such as OMG and the Genome Taxonomy Database (GTDB). [Basecamp Research] — BaseData uncovers over one million novel species not found in existing databases, such as OMG and the Genome Taxonomy Database (GTDB). [Basecamp Research]

Basecamp Research is a U.K.-based company founded in 2019 that has partnered with more than 125 communities in 26 countries to pioneer an economic partnership-based model that incentivizes the collection of samples across the planet’s most extreme and diverse environments.

In collaboration with biopharma companies and academic research institutions, the resulting purpose-built data is used to fuel AI models that design novel protein sequences and biological systems for broad applications across therapeutics, sustainability, chemical engineering, and more.

Among the novel findings shown in BaseData is a new species of Candidatus Eremiobacterota, a bacterium found in Antarctic soil that survives by generating its own water using hydrogen as an energy source. This discovery could inform novel gas-based drug delivery systems or therapeutic approaches.

On a World War II shipwreck, the team uncovered a new species of Burkholderia, a type of bacteria known for its ability to remove heavy metals from the environment, which could improve pollution control and deepen understanding of antibiotic resistance.

In acidic hot springs near an active volcano, a new member of the Sulfolobaceae family possesses stress-response systems and stability near boiling point, opening avenues to preserve biological materials under harsh conditions.

To ensure uniform data collection from diverse environments, the team developed a suite of mobile molecular biology tools and protocols that enable real-time, on-site DNA extraction and analysis without the need for large-scale laboratory infrastructure. The global sampling network is supported by more than 150 active commercial access permits and collaborations with national parks, private landowners, and regulatory authorities.

Gowers adds that the company not only aims to tackle extreme environments but also a range of conditions to capture the natural world. Metrics such as temperature and pH level are among hundreds of measured parameters to contribute to BaseData’s diverse breadth of sequences.

Give me context

Since the company’s founding in 2019, Basecamp Research has grown to more than 40 people and achieved total funding of $85 million. In January, the company established a Cambridge lab and office in Kendall Square to accelerate its pursuit of programmable genetic medicines, including a multi-year collaboration with genome editing pioneer David Liu, PhD, core institute member at the Broad Institute of MIT and Harvard, and a Howard Hughes Medical Institute investigator.

Concurrently, Basecamp Research appointed John Finn, PhD, as the company’s CSO. Finn was previously the CSO of Tome Biosciences, where he developed new methods for large gene insertion for therapeutic applications, before that company closed last year.

Gowers said the company aims to add metagenome models that possess an awareness of evolutionary context to the growing repertoire of biological foundation models.

“If we think about genome foundation models, you don’t want to just see one protein in one gene. You want to see the surrounding genome in the same way that ChatGPT reads a word in a sentence and gives it that meaning and context,” Gowers told GEN Edge.

Public datasets struggle with small contexts that often view one gene at a time. In contrast, BaseData is able to view a context length of 10,000 base pairs and longer even at a 9 trillion nucleotide scale.

Gowers says the team is prioritizing positioning BaseData for community impact. The company is currently corresponding with a handful of top pharma companies on granting data access, noting that BaseData-fueled foundation models can pair with smaller-scale datasets from individual companies for fine-tuned applications.

Additionally, Basecamp Research will offer early access to BaseData to researchers who express interest via their company website. The team is working on a follow-up study quantifying the impact of BaseData’s expanded diversity on biology model performance.