Improved sequence mapping using a complete reference genome and lift-over

Chen, Nae-Chyun; Paulin, Luis F.; Sedlazeck, Fritz J.; Koren, Sergey; Phillippy, Adam M.; Langmead, Ben

doi:10.1038/s41592-023-02069-6

Article
Published: 30 November 2023

Improved sequence mapping using a complete reference genome and lift-over

Nature Methods volume 21, pages 41–49 (2024)Cite this article

3915 Accesses
2 Citations
32 Altmetric
Metrics details

Subjects

Abstract

Complete, telomere-to-telomere (T2T) genome assemblies promise improved analyses and the discovery of new variants, but many essential genomic resources remain associated with older reference genomes. Thus, there is a need to translate genomic features and read alignments between references. Here we describe a method called levioSAM2 that performs fast and accurate lift-over between assemblies using a whole-genome map. In addition to enabling the use of several references, we demonstrate that aligning reads to a high-quality reference (for example, T2T-CHM13) and lifting to an older reference (for example, Genome reference Consortium (GRC)h38) improves the accuracy of the resulting variant calls on the old reference. By leveraging the quality improvements of T2T-CHM13, levioSAM2 reduces small and structural variant calling errors compared with GRC-based mapping using real short- and long-read datasets. Performance is especially improved for a set of complex medically relevant genes, where the GRC references are lower quality.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Fig. 2: Small variant calling performance.**

**Fig. 3: Small variant calling performance in difficult regions.**

**Fig. 4: Small and SV calling using PacBio-HiFi reads from HG002.**

**Fig. 5: LevioSAM2 resolved large-scale mapping errors in medically relevant genes.**

**Fig. 6: Runtime of levioSAM2-lift and levioSAM2 workflows.**

A survey of algorithms for the detection of genomic structural variants from long-read sequencing data

Article 29 June 2023

Variant calling and benchmarking in an era of complete human genome sequences

Article 14 April 2023

Tradeoffs in alignment and assembly-based methods for structural variant detection with long-read sequencing data

Article Open access 19 March 2024

Data availability

The Illumina data are from Baid et al.³⁷. The PacBio-HiFi data are from Jarvis et al.⁴¹. The HG002 ONT data were sequenced at the Human Genome Sequencing Center, Baylor College of Medicine, and are available at https://www.ncbi.nlm.nih.gov/sra/PRJNA930475. Source data are provided with this paper.

Code availability

The software is available at https://github.com/milkschen/leviosam2 under the MIT license⁶⁴. The experiments described in this paper are further described at https://github.com/milkschen/levioSAM2-experiments under the MIT license⁶⁵.

References

Schneider, V. A. et al. Evaluation of grch38 and de novo haploid genome assemblies demonstrates the enduring quality of the reference assembly. Genome Res. 27, 849–864 (2017).
Article CAS PubMed PubMed Central Google Scholar
Nurk, S. et al. The complete sequence of a human genome. Science 376, 44–53 (2022).
Article CAS PubMed PubMed Central Google Scholar
Guo, Y. et al. Improvements and impacts of grch38 human reference on high throughput sequencing data analysis. Genomics 109, 83–90 (2017).
Article CAS PubMed Google Scholar
Aganezov, S. et al. A complete reference genome improves analysis of human genetic variation. Science 376, eabl3533 (2022).
Article CAS PubMed PubMed Central Google Scholar
1000 Genomes Project Consortium. A global reference for human genetic variation. Nature 526, 68–74 (2015).
Karczewski, K. J. et al. The mutational constraint spectrum quantified from variation in 141,456 humans. Nature 581, 434–443 (2020).
Article CAS PubMed PubMed Central Google Scholar
Sudlow, C. et al. UK Biobank: an open access resource for identifying the causes of a wide range of complex diseases of middle and old age. PLoS Med. 12, e1001779 (2015).
Article PubMed PubMed Central Google Scholar
Smigielski, E. M., Sirotkin, K., Ward, M. & Sherry, S. T. dbSNP: a database of single nucleotide polymorphisms. Nucleic Acids Res. 28, 352–355 (2000).
Article CAS PubMed PubMed Central Google Scholar
Mailman, M. D. et al. The NCBI dbGAP database of genotypes and phenotypes. Nat. Genet. 39, 1181–1186 (2007).
Article CAS PubMed PubMed Central Google Scholar
Taliun, D. et al. Sequencing of 53,831 diverse genomes from the NHLBI TOPMed program. Nature 590, 290–299 (2021).
Article CAS PubMed PubMed Central Google Scholar
Consortium, G. The GTEX Consortium Atlas of genetic regulatory effects across human tissues. Science 369, 1318–1330 (2020).
Article Google Scholar
Frankish, A. et al. GENCODE reference annotation for the human and mouse genomes. Nucleic Acids Res. 47, D766–D773 (2019).
Article CAS PubMed Google Scholar
Lowy-Gallego, E. et al. Variant calling on the GRCh38 assembly with the data from phase three of the 1000 genomes project. Wellcome Open Res. 4, 50 (2019).
Article PubMed PubMed Central Google Scholar
Salzberg, S. L. Next-generation genome annotation: we still struggle to get it right. Genome Biol. 20, 92 (2019).
Article PubMed PubMed Central Google Scholar
Gao, G. F. et al. Before and after: comparison of legacy and harmonized TCGA genomic data commons’ data. Cell Syst. 9, 24–34 (2019).
Article CAS PubMed PubMed Central Google Scholar
Lansdon, L. A. et al. Factors affecting migration to GRCh38 in laboratories performing clinical next-generation sequencing. J. Mol. Diagn. 23, 651–657 (2021).
Article CAS PubMed Google Scholar
Fujita, P. A. et al. The UCSC genome browser database: update 2011. Nucleic Acids Res. 39, D876–D882 (2010).
Article PubMed PubMed Central Google Scholar
Zhao, H. et al. CrossMap: a versatile tool for coordinate conversion between genome assemblies. Bioinformatics 30, 1006–1007 (2014).
Article PubMed Google Scholar
Picard toolkit. GitHub https://broadinstitute.github.io/picard/ (2019).
Mun, T., Chen, N.-C. & Langmead, B. Leviosam: fast lift-over of variant-aware reference alignments. Bioinformatics 37, 4243–4245 (2021).
Pan, B. et al. Similarities and differences between variants called with human reference genome HG19 or HG38. BMC Bioinformatics 20, 17–29 (2019).
Google Scholar
Ormond, C., Ryan, N. M., Corvin, A. & Heron, E. A. Converting single nucleotide variants between genome builds: from cautionary tale to solution. Brief. Bioinform. 22, bbab069 (2021).
Article PubMed PubMed Central Google Scholar
Li, H. et al. Exome variant discrepancies due to reference genome differences. Am. J. Hum. Genet. 108, 1239–1250 (2021).
Lansdon, L. A. et al. Clinical validation of genome reference consortium human build 38 in a laboratory utilizing next-generation sequencing technologies. Clin. Chem. 68, 1177–1183 (2022).
Article PubMed Google Scholar
Behera, S. et al. FixItFelix: improving genomic analysis by fixing reference errors. Genome Biol. 24, 31 (2023).
Shumate, A. & Salzberg, S. L. Liftoff: accurate mapping of gene annotations. Bioinformatics 37, 1639–1643 (2021).
Article CAS PubMed PubMed Central Google Scholar
Chen, N.-C., Solomon, B., Mun, T., Iyer, S. & Langmead, B. Reference flow: reducing reference bias using multiple population genomes. Genome Biol. 22, 8 (2021).
Article PubMed PubMed Central Google Scholar
Poplin, R. et al. Scaling accurate genetic variant discovery to tens of thousands of samples. Preprint at bioRxiv https://doi.org/10.1101/201178 (2018).
Poplin, R. et al. A universal SNP and small-indel variant caller using deep neural networks. Nat. Biotechnol. 36, 983–987 (2018).
Wagner, J. et al. Benchmarking challenging small variants with linked and long reads. Cell Genom. 2, 100128 (2022).
Article CAS PubMed PubMed Central Google Scholar
Wagner, J. et al. Curated variation benchmarks for challenging medically relevant autosomal genes. Nat. Biotechnol. 40, 672–680 (2022).
Wenger, A. M. et al. Accurate circular consensus long-read sequencing improves variant detection and assembly of a human genome. Nat. Biotechnol. 37, 1155–1162 (2019).
Article CAS PubMed PubMed Central Google Scholar
Zook, J. M. et al. A robust benchmark for detection of germline large deletions and insertions. Nat. Biotechnol. 38, 1347–1355 (2020).
Holtgrewe, M. Mason: A Read Simulator for Second Generation Sequencing Data. Report No. TR-B-10-06 (Technical Reports of Institut für Mathematik und Informatik, Freie Universität Berlin, 2010).
Langmead, B. & Salzberg, S. L. Fast gapped-read alignment with Bowtie 2. Nat. Methods 9, 357–359 (2012).
Li, H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. Preprint at https://doi.org/10.48550/arXiv.1303.3997 (2013).
Baid, G. et al. An extensive sequence dataset of gold-standard samples for benchmarking and development. Preprint at bioRxiv https://doi.org/10.1101/2020.12.11.422022 (2020).
Jain, M. et al. Nanopore sequencing and assembly of a human genome with ultra-long reads. Nat. Biotechnol. 36, 338–345 (2018).
Article CAS PubMed PubMed Central Google Scholar
Li, H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34, 3094–3100 (2018).
Article CAS PubMed PubMed Central Google Scholar
Jain, C., Rhie, A., Hansen, N. F., Koren, S. & Phillippy, A. M. Long-read mapping to repetitive reference sequences using winnowmap2. Nat. Methods 19, 705–710 (2022).
Jarvis, E. D. et al. Semi-automated assembly of high-quality diploid human reference genomes. Nature 611, 519–531 (2022).
Article CAS PubMed PubMed Central Google Scholar
Smolka, M. et al. Comprehensive structural variant detection: from mosaic to population-level. Preprint at bioRxiv https://doi.org/10.1101/2022.04.04.487055 (2022).
English, A. C., Menon, V. K., Gibbs, R. A., Metcalf, G. A. & Sedlazeck, F. J. Truvari: refined structural variant comparison preserves allelic diversity. Genome Biol. 23, 271 (2022).
Article CAS PubMed PubMed Central Google Scholar
Cheng, H., Concepcion, G. T., Feng, X., Zhang, H. & Li, H. Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm. Nat. Methods 18, 170–175 (2021).
Article CAS PubMed PubMed Central Google Scholar
Mandelker, D. et al. Navigating highly homologous genes in a molecular diagnostic setting: a resource for clinical next-generation sequencing. Genet. Med. 18, 1282–1289 (2016).
Article CAS PubMed Google Scholar
Thorvaldsdóttir, H., Robinson, J. T. & Mesirov, J. P. Integrative genomics viewer (IGV): high-performance genomics data visualization and exploration. Brief. Bioinform. 14, 178–192 (2013).
Article PubMed Google Scholar
Talenti, A. & Prendergast, J. nf-LO: a scalable, containerized workflow for genome-to-genome lift over. Genome Biol. Evol. 13, evab183 (2021).
Article PubMed PubMed Central Google Scholar
Garrison, E. & Guarracino, A. Unbiased pangenome graphs. Bioinformatics 39, btac743 (2023).
Article CAS PubMed Google Scholar
Ebert, P. et al. Haplotype-resolved diverse human genomes and integrated analysis of structural variation. Science 372, eabf7117 (2021).
Delcher, A. L. et al. Alignment of whole genomes. Nucleic Acids Res. 27, 2369–2376 (1999).
Article CAS PubMed PubMed Central Google Scholar
Chin, C.-S. et al. Multiscale analysis of pangenomes enables improved representation of genomic diversity for repetitive and clinically relevant genes. Nat. Methods 20, 1213–1221 (2023).
Article CAS PubMed PubMed Central Google Scholar
Gog, S., Beller, T., Moffat, A. & Petri, M. From theory to practice: plug and play with succinct data structures. In Proc. 13th International Symposium on Experimental Algorithms (eds. Gudmundsson, J. & Katajainen, J.) 326–337 (SEA, 2014).
Li, H. Minimap and miniasm: fast mapping and de novo assembly for noisy long sequences. Bioinformatics 32, 2103–2110 (2016).
Article CAS PubMed PubMed Central Google Scholar
Rapid yaml. GitHub https://github.com/biojppm/rapidyaml (2022).
Bonfield, J. K. et al. Htslib: C library for reading/writing high-throughput sequencing data. Gigascience 10, giab007 (2021).
Article PubMed PubMed Central Google Scholar
Pockrandt, C., Alzamel, M., Iliopoulos, C. S. & Reinert, K. GenMap: ultra-fast computation of genome mappability. Bioinformatics 36, 3687–3692 (2020).
Article CAS PubMed PubMed Central Google Scholar
Leitner-Ankerl, M. Robin hood unordered map and set. GitHub https://github.com/martinus/robin-hood-hashing (2022).
Quinlan, A. R. & Hall, I. M. Bedtools: a flexible suite of utilities for comparing genomic features. Bioinformatics 26, 841–842 (2010).
Article CAS PubMed PubMed Central Google Scholar
Danecek, P. et al. Twelve years of SAMtools and BCFtools. Gigascience 10, giab008 (2021).
Article PubMed Central Google Scholar
Martin, M. et al. Whatshap: fast and accurate read-based phasing. Preprint at bioRxiv https://doi.org/10.1101/085050 (2016).
Cook, D., Kolesnikov, A., Chang, P.-C. & Carroll, A. Improving variant calling using haplotype information. DeepVariant Blog https://google.github.io/deepvariant/posts/2021-02-08-the-haplotype-channel/ (2021).
Krusche, P. et al. Best practices for benchmarking germline small-variant calls in human genomes. Nat. Biotechnol. 37, 555–560 (2019).
Gordon, A. Gnu time. https://www.gnu.org/software/time/ (2018).
Chen, N.-C. leviosam2. Zenodo https://doi.org/10.5281/zenodo.8198490 (2023).
Chen, N.-C. levioSAM2-experiments v.0.1. Zenodo https://doi.org/10.5281/zenodo.8198541 (2023).

Download references

Acknowledgements

We thank T. Mun for his advice and contribution to the levioSAM2 programming infrastructure. We appreciate advice from H.-C. Chen on software deployment, C. Pockrandt on mappability resources, A. Shumate on gene lift-over and S. Zarate on T2T-CHM13 variant analysis. We also thank A. Carroll and P.-C. Chang for DeepVariant discussions, A. Rhie for T2T-CHM13 discussions and J. Zook for GIAB strata suggestions. N.-C.C. and B.L. were supported by National Institutes of Health (NIH) grants R01HG011392 and R35GM139602 to B.L. F.J.S. and L.F.P. were supported by NIH grants 1U01HG011758-01 and UM1HG008898. S.K. and A.M.P. were supported by the Intramural Research Program of the National Human Genome Research Institute (NHGRI), NIH. Part of this research project was conducted using computational resources at the Maryland Advanced Research Computing Center (MARCC). Prebuilt levioSAM2 resources for T2T-CHM13 to GRC references are made freely available on Amazon Web Services thanks to the AWS Public Dataset Program.

Author information

Authors and Affiliations

Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA
Nae-Chyun Chen & Ben Langmead
Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX, USA
Luis F. Paulin & Fritz J. Sedlazeck
Department of Computer Science, Rice University, Houston, TX, USA
Fritz J. Sedlazeck
Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
Sergey Koren & Adam M. Phillippy

Authors

Nae-Chyun Chen
View author publications
You can also search for this author in PubMed Google Scholar
Luis F. Paulin
View author publications
You can also search for this author in PubMed Google Scholar
Fritz J. Sedlazeck
View author publications
You can also search for this author in PubMed Google Scholar
Sergey Koren
View author publications
You can also search for this author in PubMed Google Scholar
Adam M. Phillippy
View author publications
You can also search for this author in PubMed Google Scholar
Ben Langmead
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

N.-C.C., S.K., A.M.P. and B.L. designed the method. N.-C.C. wrote the software. N.-C.C. and L.F.P. performed the experiment. N.-C.C., L.F.P., F.J.S., S.K., A.M.P. and B.L. performed analysis and wrote the manuscript. All authors read and approved the final manuscript.

Corresponding authors

Correspondence to Nae-Chyun Chen or Ben Langmead.

Ethics declarations

Competing interests

N.-C.C. is an employee of Exai Bio. L.F.P. received financial funds from Genentech. L.F.P. received travel funds to speak at events hosted by ONT. F.J.S. received research support from Genetech, Illumina, Pacbio and ONT. S.K. has received travel funds to speak at events hosted by Oxford Nanopore Technologies for ethics conflicts. The remaining authors declare no competing interests.

Peer review

Peer review information

Nature Methods thanks Jan Korbel, Erik Garrison and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Peer reviewer reports are available. Primary Handling Editor: Lin Tang, in collaboration with the Nature Methods team.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Regions unique to T2T-CHM13 and variant calls within.

Regions unique to T2T-CHM13 compared to GRCh38 (blue) and high-quality HG002 variant calls from DeepVariant in these regions (red).

Extended Data Fig. 2 Mapping accuracy comparison using simulated data.

Mapping accuracy using simulated reads that carry GRCh38-based HG001 genotypes.

Extended Data Fig. 3 Peak memory usage comparison.

Peak memory usage of levioSAM2 and direct-to-GRC pipelines using a real 30 × WGS dataset from HG002. The alignment steps used BWA-MEM. The lift-over tasks (‘CHM13-to-GRCh38’ and ‘CHM13-to-GRCh37’) excluded the cost of the initial mapping step to T2T-CHM13 (‘CHM13’).

Extended Data Fig. 4 Small-variant calling performance using DeepVariant.

Small variant calling performance in difficult regions using DeepVariant. A. Small variant calling accuracy in major difficult genomic regions for HG002. B. GIAB stratified regions with top small variant calling error reduction densities by levioSAM2.

Source data

Extended Data Fig. 5 A disagreed SV call between the GIAB Tier 1 SV callset and personalized assemblies.

IGV visualization near chr5:21,543,010 for the HG002 PacBio-HiFi dataset. The reads were grouped using the allele at chr5:21,543,010. A 174-bp DEL was called when using direct-to-GRCh37, matching the GIAB Tier 1 SV callset. However, personalized whole-genome assemblies showed mappings of non-GRCh37 haplotypes in this region (the ‘2’ alignment in ‘HG002 Hap1’ and ‘HG002 Hap2’ tracks), suggesting collapsed mapping. The CHM13-to-GRCh37 mappings showed better concordance with the personalized HG002 assemblies.

Extended Data Fig. 6 An example of improved mapping using ONT data.

IGV visualization near chr7:125,400,000 (located in the KMT2C gene) for the HG002 ONT dataset. Four FP SV calls were made when aligning reads directly to GRCh38 because of large-scale mapping collapse. The levioSAM2 workflow (‘CHM13-to-GRCh38’) generated improved alignments and did not result in the FP SV calls.

Supplementary information

Supplementary Information

Supplementary Notes 1 and 2, Figs. 1–9 and Tables 1–19.

Reporting Summary

Peer Review File

Source data

Source Data Fig. 2

Variant calling summary reports generated using hap.py for HG001, HG002 and HG005 Illumina data.

Source Data Fig. 3

Stratified variant calling (GATK) summary generated using hap.py for HG002 Illumina data.

Source Data Fig. 4

Variant calling summary reports of HG002 PacBio-HiFi data.

Source Data Fig. 6

Computational efficiency reports.

Source Data Extended Data Fig. 4

Stratified variant calling (DeepVariant) summary generated using hap.py for HG002 Illumina data.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Chen, NC., Paulin, L.F., Sedlazeck, F.J. et al. Improved sequence mapping using a complete reference genome and lift-over. Nat Methods 21, 41–49 (2024). https://doi.org/10.1038/s41592-023-02069-6

Download citation

Received: 27 April 2022
Accepted: 09 October 2023
Published: 30 November 2023
Issue Date: January 2024
DOI: https://doi.org/10.1038/s41592-023-02069-6

This article is cited by

Measuring, visualizing, and diagnosing reference bias with biastools
- Mao-Jan Lin
- Sheila Iyer
- Ben Langmead
Genome Biology (2024)
Rapid genomic sequencing for genetic disease diagnosis and therapy in intensive care units: a review
- Stephen F. Kingsmore
- Russell Nofsinger
- Kasia Ellsworth
npj Genomic Medicine (2024)

Subjects

Abstract

Access options

Similar content being viewed by others

Data availability

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Peer review

Peer review information

Additional information

Extended data

Supplementary information

Source data

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Search

Quick links