
Harnessing Diversity: The Human Pangenome and Its Impact on Genetic Research
, by Haider Hassan, 2 min reading time
, by Haider Hassan, 2 min reading time
The Human Genome Project (HGP), a global collaboration involving twenty research groups, published the first draft of the human reference genome in 2001. This landmark release has served as the foundation for significant advancements in human genomics research and healthcare. Over the years, the original HGP reference genome underwent multiple updates, notably resulting in the Genome Research Consortium Human Build 38 patch release 7 (GRCh38.p7) and the Telomere-to-Telomere Consortium Human Genome Build 13 (T2T-CHM13). The T2T-CHM13 assembly, primarily constructed using long-read sequencing technologies such as PacBio, resolved challenging genomic regions, including repetitive sequences, centromeres, and telomeres, providing a more comprehensive and accurate genome reference.
However, researchers identified that reference genomes based on only a small number of individuals do not adequately represent global genetic diversity. Crucially, when genomic sequencing data aligns solely to a single reference genome, over two-thirds of structural variants (SVs)—including insertions, deletions, duplications, and translocations—are missed. Addressing this issue is essential because SVs typically exert a greater impact on gene function compared to single nucleotide polymorphisms (SNPs) or indels.
In a significant development, Liao et al. (2023) published in Nature the first human pangenome reference constructed by the Human Pangenome Reference Consortium (HPRC). The HPRC pangenome integrates high-quality genomic assemblies from diverse populations, aiming to better capture human genomic variation globally.
The initial human pangenome reference from HPRC comprises 47 genomic assemblies: 29 genomes sequenced directly by HPRC and 18 from external efforts. The 29 samples were sourced from 1000 Genomes (1KG) lymphoblastoid cell lines with normal karyotypes and sequenced using PacBio High Fidelity, Oxford Nanopore long-read, and Illumina short-read platforms, averaging 39.7X coverage. These genomes featured excellent quality metrics, including an N50 contiguity measure of 19.6 Kb and an accuracy of one error per 227,509 base pairs.
Using Trio-Hifiasm software, individual haploid genomes were assembled, followed by annotation with a customized Ensembl pipeline for accurate identification of GENCODE genes and transcripts. Alignment of these genomes to T2T-CHM13 validated their completeness and structural integrity, confirming over 99% coverage of known protein-coding genes.
The pangenome was assembled from the 47 genomes using advanced computational tools like Minigraph, Minigraph-Cactus (MC), and PanGenome Graph Builder (PGGB). The MC graph proved most effective, showing superior alignment accuracy, precision, and recall rates for small genetic variants compared to previous reference genomes. The study demonstrated that using the HPRC pangenome enhances variant detection, significantly improving upon traditional reference-based mapping approaches.
Overall, the HPRC pangenome marks a substantial advancement toward a more inclusive and representative human genome reference. Incorporating diverse genetic variations highlights population-specific differences and promises to enhance genetic analyses, ultimately enabling more precise and personalized medical applications.