the-1000-chinese-pangenome-transforming-genetics
The 1000 Chinese Pangenome Transforming Genetics

The 1000 Chinese Pangenome Transforming Genetics

A groundbreaking study leveraging the 1000 Chinese Pangenome (1KCP) dataset has unveiled a comprehensive landscape of genetic variation, extending far beyond the confines of traditional short-read sequencing. This expansive resource encapsulates an array of variant classes, spanning from single nucleotide variants (SNVs) and indels to the more enigmatic complex variants—including structural variants (SVs), tandem repeats (TRs), and nested variants—that previous datasets often overlooked. By integrating such diverse variant types into unified analytical frameworks, the research pioneers a pan-variant approach, promising enhanced resolution and novel insights into the genetic regulation of human traits.

The essence of this endeavor is the application of pan-variant analysis to gene expression data across a vast cohort of 1,101 individuals. Employing RNA sequencing to quantify the expression of over 21,000 genes—comprising both protein-coding and non-coding regions—the researchers intricately mapped how distinct variant types modulate gene activity. A pivotal methodological advance involved using multicomponent Genetic-relationship-matrix-based Restricted Maximum Likelihood (GREML) models to dissect cis-heritability—the proportion of variance in gene expression attributable to genetic variation within a 1 Mb window flanking transcription start sites.

This nuance-oriented analysis reveals that complex variants account for an impressive 12.6% of cis-heritability among genes with moderate to high heritability, underscoring their substantive role in transcriptional regulation. These findings resoundingly challenge the prevailing SNV- and indel-centric paradigms, highlighting complex variants as potent contributors to gene expression variation. Extending beyond variance components, the team performed cis-expression quantitative trait locus (eQTL) mapping, identifying nearly 16,000 genes significantly associated with over 2.6 million unique eVariants at stringent false discovery rates. Fine-mapping further refined these associations, uncovering thousands of credible variant sets harboring complex variants with pronounced posterior inclusion probabilities, with tandem repeats demonstrating remarkable enrichment.

Delving deeper, the authors identified a subset of structural variants serving as lead eVariants, illuminating the intricate regulatory mechanisms embedded within genomic rearrangements. Intriguingly, numerous additional lead eVariants co-localized within structural variant sites, including various SNVs, indels, and nested variants, illustrating a complex tapestry of allelic interactions and regulatory influences. For example, a locus hosting an insertion of transposable elements alongside enhancers featured multiple nested variants invisible to conventional linear reference genomes, yet strongly correlated with the expression of ADAM10 and MINDY2 genes, both implicated in diverse biological processes.

The discovery of nested variants intricately embedded within SV sites offers compelling evidence for their regulatory potency. Annotating these variants against pangenomic contexts reveals a pronounced enrichment within genic regions such as untranslated regions and non-coding exons, as well as promoters and enhancers—hallmarks of gene regulatory architecture. Additionally, certain repeat elements like short interspersed nuclear elements (SINEs) and long terminal repeats (LTRs) show moderate enrichment, suggesting nuanced regulatory roles that transcend simple sequence repetition.

Tandem repeats emerge as particularly fascinating players in this landscape, with nearly 2,500 lead eVariants identified as TR variants subdivided into length and motif categories. A striking observation is the distinct genetic signals captured by motif variants compared to length variants, indicating multiple layers of sequence variation influencing gene expression. At a locus regulating MAD1L1 expression, both the copy number of the TR and the exact sequence of a 28-base pair motif independently associate with gene activity, collectively explaining a staggering 81% of expression variance. Such insights underscore the multifaceted genetic architecture shaping transcriptional dynamics.

The broader implications of this work resonate profoundly with genome-wide association studies (GWAS), which have historically overlooked complex variants, thereby limiting their discovery power. By integrating eQTL data from the enriched 1KCP variant catalogue with GWAS signals across 215 traits from BioBank Japan, the study identified over 1,500 colocalization events indicating shared genetic etiology. Notably, more than one hundred of these involved complex variants as lead candidates, exemplified by a substantial 18-kilobase deletion encompassing the GSTM1 gene. This deletion correlates with altered expression of GSTM1 and the neighboring GSTM2 gene and co-localizes with platelet count GWAS loci, implicating its functional impact on hematological traits.

Altogether, this integrative pan-variant eQTL study provides a transformative perspective on the complexity of genetic regulation. It highlights the previously concealed influence of structural and nested variants, as well as tandem repeats, in modulating gene expression and shaping phenotypic variation. By expanding the scope of genetic variation considered in association analyses, researchers can now access a richer, more nuanced understanding of the molecular underpinnings of complex traits and diseases.

This research not only augments the medical genetics toolkit but catalyzes a paradigm shift in population genomics, encouraging the community to embrace the full spectrum of genomic diversity through the lens of pangenomics. The insights gained pave the way for future explorations into the mechanistic bases of variant function and their translational potentials, particularly in ethnically diverse populations that have been historically underrepresented in genomic studies.

The 1KCP dataset exemplifies the power of comprehensive sequencing and advanced computational methodologies, offering a blueprint for global initiatives aiming to unravel human genomic intricacy. By championing the inclusion of complex variant classes, this work establishes a new frontier in associating genotype to phenotype, with broad implications for precision medicine and evolutionary biology.

As genomic research continues to evolve, the fusion of high-resolution variant catalogs with functional genomics and sophisticated statistical models promises to illuminate the dimensionality of gene regulation like never before. The 1000 Chinese Pangenome and its pan-variant analytical framework stand at the vanguard of this revolution, fostering novel discoveries that will enrich our understanding of human biology and disease susceptibility.

This study’s integrative approach marks a significant stride toward a holistic genomic medicine, emphasizing the significance of accounting for the full complement of genomic variation—beyond the scope of single nucleotide changes—in both research and clinical contexts. Future endeavors will build on these foundational insights to explore how complex variants interact with environmental factors and other genetic elements to influence human health and disease.

Subject of Research: Genetic variation, gene expression regulation, and complex trait association through pan-variant analysis in the 1000 Chinese Pangenome.

Article Title: The 1000 Chinese Pangenome empowers medical and population genetics.

Article References:
Wang, Y., Duan, Z., Chen, D. et al. The 1000 Chinese Pangenome empowers medical and population genetics. Nature (2026). https://doi.org/10.1038/s41586-026-10315-y

Image Credits: AI Generated

DOI: https://doi.org/10.1038/s41586-026-10315-y

Tags: 1000 Chinese Pangenome datasetcis-heritability genetics researchcomplex genetic variants analysisgene expression regulation geneticsgenetic regulation of human traitsgenetic variation beyond short-read sequencingGREML models in geneticsnested genetic variants studypan-variant approach in genomicsRNA sequencing in gene expressionstructural variants in human geneticstandem repeats genetic impact