Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Article
  • Published:

Improved sequence mapping using a complete reference genome and lift-over

Abstract

Complete, telomere-to-telomere (T2T) genome assemblies promise improved analyses and the discovery of new variants, but many essential genomic resources remain associated with older reference genomes. Thus, there is a need to translate genomic features and read alignments between references. Here we describe a method called levioSAM2 that performs fast and accurate lift-over between assemblies using a whole-genome map. In addition to enabling the use of several references, we demonstrate that aligning reads to a high-quality reference (for example, T2T-CHM13) and lifting to an older reference (for example, Genome reference Consortium (GRC)h38) improves the accuracy of the resulting variant calls on the old reference. By leveraging the quality improvements of T2T-CHM13, levioSAM2 reduces small and structural variant calling errors compared with GRC-based mapping using real short- and long-read datasets. Performance is especially improved for a set of complex medically relevant genes, where the GRC references are lower quality.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: Overview of levioSAM2.
Fig. 2: Small variant calling performance.
Fig. 3: Small variant calling performance in difficult regions.
Fig. 4: Small and SV calling using PacBio-HiFi reads from HG002.
Fig. 5: LevioSAM2 resolved large-scale mapping errors in medically relevant genes.
Fig. 6: Runtime of levioSAM2-lift and levioSAM2 workflows.

Similar content being viewed by others

Data availability

The Illumina data are from Baid et al.37. The PacBio-HiFi data are from Jarvis et al.41. The HG002 ONT data were sequenced at the Human Genome Sequencing Center, Baylor College of Medicine, and are available at https://www.ncbi.nlm.nih.gov/sra/PRJNA930475. Source data are provided with this paper.

Code availability

The software is available at https://github.com/milkschen/leviosam2 under the MIT license64. The experiments described in this paper are further described at https://github.com/milkschen/levioSAM2-experiments under the MIT license65.

References

  1. Schneider, V. A. et al. Evaluation of grch38 and de novo haploid genome assemblies demonstrates the enduring quality of the reference assembly. Genome Res. 27, 849–864 (2017).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  2. Nurk, S. et al. The complete sequence of a human genome. Science 376, 44–53 (2022).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  3. Guo, Y. et al. Improvements and impacts of grch38 human reference on high throughput sequencing data analysis. Genomics 109, 83–90 (2017).

    Article  CAS  PubMed  Google Scholar 

  4. Aganezov, S. et al. A complete reference genome improves analysis of human genetic variation. Science 376, eabl3533 (2022).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  5. 1000 Genomes Project Consortium. A global reference for human genetic variation. Nature 526, 68–74 (2015).

  6. Karczewski, K. J. et al. The mutational constraint spectrum quantified from variation in 141,456 humans. Nature 581, 434–443 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  7. Sudlow, C. et al. UK Biobank: an open access resource for identifying the causes of a wide range of complex diseases of middle and old age. PLoS Med. 12, e1001779 (2015).

    Article  PubMed  PubMed Central  Google Scholar 

  8. Smigielski, E. M., Sirotkin, K., Ward, M. & Sherry, S. T. dbSNP: a database of single nucleotide polymorphisms. Nucleic Acids Res. 28, 352–355 (2000).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  9. Mailman, M. D. et al. The NCBI dbGAP database of genotypes and phenotypes. Nat. Genet. 39, 1181–1186 (2007).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  10. Taliun, D. et al. Sequencing of 53,831 diverse genomes from the NHLBI TOPMed program. Nature 590, 290–299 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  11. Consortium, G. The GTEX Consortium Atlas of genetic regulatory effects across human tissues. Science 369, 1318–1330 (2020).

    Article  Google Scholar 

  12. Frankish, A. et al. GENCODE reference annotation for the human and mouse genomes. Nucleic Acids Res. 47, D766–D773 (2019).

    Article  CAS  PubMed  Google Scholar 

  13. Lowy-Gallego, E. et al. Variant calling on the GRCh38 assembly with the data from phase three of the 1000 genomes project. Wellcome Open Res. 4, 50 (2019).

    Article  PubMed  PubMed Central  Google Scholar 

  14. Salzberg, S. L. Next-generation genome annotation: we still struggle to get it right. Genome Biol. 20, 92 (2019).

    Article  PubMed  PubMed Central  Google Scholar 

  15. Gao, G. F. et al. Before and after: comparison of legacy and harmonized TCGA genomic data commons’ data. Cell Syst. 9, 24–34 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  16. Lansdon, L. A. et al. Factors affecting migration to GRCh38 in laboratories performing clinical next-generation sequencing. J. Mol. Diagn. 23, 651–657 (2021).

    Article  CAS  PubMed  Google Scholar 

  17. Fujita, P. A. et al. The UCSC genome browser database: update 2011. Nucleic Acids Res. 39, D876–D882 (2010).

    Article  PubMed  PubMed Central  Google Scholar 

  18. Zhao, H. et al. CrossMap: a versatile tool for coordinate conversion between genome assemblies. Bioinformatics 30, 1006–1007 (2014).

    Article  PubMed  Google Scholar 

  19. Picard toolkit. GitHub https://broadinstitute.github.io/picard/ (2019).

  20. Mun, T., Chen, N.-C. & Langmead, B. Leviosam: fast lift-over of variant-aware reference alignments. Bioinformatics 37, 4243–4245 (2021).

  21. Pan, B. et al. Similarities and differences between variants called with human reference genome HG19 or HG38. BMC Bioinformatics 20, 17–29 (2019).

    Google Scholar 

  22. Ormond, C., Ryan, N. M., Corvin, A. & Heron, E. A. Converting single nucleotide variants between genome builds: from cautionary tale to solution. Brief. Bioinform. 22, bbab069 (2021).

    Article  PubMed  PubMed Central  Google Scholar 

  23. Li, H. et al. Exome variant discrepancies due to reference genome differences. Am. J. Hum. Genet. 108, 1239–1250 (2021).

  24. Lansdon, L. A. et al. Clinical validation of genome reference consortium human build 38 in a laboratory utilizing next-generation sequencing technologies. Clin. Chem. 68, 1177–1183 (2022).

    Article  PubMed  Google Scholar 

  25. Behera, S. et al. FixItFelix: improving genomic analysis by fixing reference errors. Genome Biol. 24, 31 (2023).

  26. Shumate, A. & Salzberg, S. L. Liftoff: accurate mapping of gene annotations. Bioinformatics 37, 1639–1643 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  27. Chen, N.-C., Solomon, B., Mun, T., Iyer, S. & Langmead, B. Reference flow: reducing reference bias using multiple population genomes. Genome Biol. 22, 8 (2021).

    Article  PubMed  PubMed Central  Google Scholar 

  28. Poplin, R. et al. Scaling accurate genetic variant discovery to tens of thousands of samples. Preprint at bioRxiv https://doi.org/10.1101/201178 (2018).

  29. Poplin, R. et al. A universal SNP and small-indel variant caller using deep neural networks. Nat. Biotechnol. 36, 983–987 (2018).

  30. Wagner, J. et al. Benchmarking challenging small variants with linked and long reads. Cell Genom. 2, 100128 (2022).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  31. Wagner, J. et al. Curated variation benchmarks for challenging medically relevant autosomal genes. Nat. Biotechnol. 40, 672–680 (2022).

  32. Wenger, A. M. et al. Accurate circular consensus long-read sequencing improves variant detection and assembly of a human genome. Nat. Biotechnol. 37, 1155–1162 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  33. Zook, J. M. et al. A robust benchmark for detection of germline large deletions and insertions. Nat. Biotechnol. 38, 1347–1355 (2020).

  34. Holtgrewe, M. Mason: A Read Simulator for Second Generation Sequencing Data. Report No. TR-B-10-06 (Technical Reports of Institut für Mathematik und Informatik, Freie Universität Berlin, 2010).

  35. Langmead, B. & Salzberg, S. L. Fast gapped-read alignment with Bowtie 2. Nat. Methods 9, 357–359 (2012).

  36. Li, H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. Preprint at https://doi.org/10.48550/arXiv.1303.3997 (2013).

  37. Baid, G. et al. An extensive sequence dataset of gold-standard samples for benchmarking and development. Preprint at bioRxiv https://doi.org/10.1101/2020.12.11.422022 (2020).

  38. Jain, M. et al. Nanopore sequencing and assembly of a human genome with ultra-long reads. Nat. Biotechnol. 36, 338–345 (2018).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  39. Li, H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34, 3094–3100 (2018).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  40. Jain, C., Rhie, A., Hansen, N. F., Koren, S. & Phillippy, A. M. Long-read mapping to repetitive reference sequences using winnowmap2. Nat. Methods 19, 705–710 (2022).

  41. Jarvis, E. D. et al. Semi-automated assembly of high-quality diploid human reference genomes. Nature 611, 519–531 (2022).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  42. Smolka, M. et al. Comprehensive structural variant detection: from mosaic to population-level. Preprint at bioRxiv https://doi.org/10.1101/2022.04.04.487055 (2022).

  43. English, A. C., Menon, V. K., Gibbs, R. A., Metcalf, G. A. & Sedlazeck, F. J. Truvari: refined structural variant comparison preserves allelic diversity. Genome Biol. 23, 271 (2022).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  44. Cheng, H., Concepcion, G. T., Feng, X., Zhang, H. & Li, H. Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm. Nat. Methods 18, 170–175 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  45. Mandelker, D. et al. Navigating highly homologous genes in a molecular diagnostic setting: a resource for clinical next-generation sequencing. Genet. Med. 18, 1282–1289 (2016).

    Article  CAS  PubMed  Google Scholar 

  46. Thorvaldsdóttir, H., Robinson, J. T. & Mesirov, J. P. Integrative genomics viewer (IGV): high-performance genomics data visualization and exploration. Brief. Bioinform. 14, 178–192 (2013).

    Article  PubMed  Google Scholar 

  47. Talenti, A. & Prendergast, J. nf-LO: a scalable, containerized workflow for genome-to-genome lift over. Genome Biol. Evol. 13, evab183 (2021).

    Article  PubMed  PubMed Central  Google Scholar 

  48. Garrison, E. & Guarracino, A. Unbiased pangenome graphs. Bioinformatics 39, btac743 (2023).

    Article  CAS  PubMed  Google Scholar 

  49. Ebert, P. et al. Haplotype-resolved diverse human genomes and integrated analysis of structural variation. Science 372, eabf7117 (2021).

  50. Delcher, A. L. et al. Alignment of whole genomes. Nucleic Acids Res. 27, 2369–2376 (1999).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  51. Chin, C.-S. et al. Multiscale analysis of pangenomes enables improved representation of genomic diversity for repetitive and clinically relevant genes. Nat. Methods 20, 1213–1221 (2023).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  52. Gog, S., Beller, T., Moffat, A. & Petri, M. From theory to practice: plug and play with succinct data structures. In Proc. 13th International Symposium on Experimental Algorithms (eds. Gudmundsson, J. & Katajainen, J.) 326–337 (SEA, 2014).

  53. Li, H. Minimap and miniasm: fast mapping and de novo assembly for noisy long sequences. Bioinformatics 32, 2103–2110 (2016).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  54. Rapid yaml. GitHub https://github.com/biojppm/rapidyaml (2022).

  55. Bonfield, J. K. et al. Htslib: C library for reading/writing high-throughput sequencing data. Gigascience 10, giab007 (2021).

    Article  PubMed  PubMed Central  Google Scholar 

  56. Pockrandt, C., Alzamel, M., Iliopoulos, C. S. & Reinert, K. GenMap: ultra-fast computation of genome mappability. Bioinformatics 36, 3687–3692 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  57. Leitner-Ankerl, M. Robin hood unordered map and set. GitHub https://github.com/martinus/robin-hood-hashing (2022).

  58. Quinlan, A. R. & Hall, I. M. Bedtools: a flexible suite of utilities for comparing genomic features. Bioinformatics 26, 841–842 (2010).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  59. Danecek, P. et al. Twelve years of SAMtools and BCFtools. Gigascience 10, giab008 (2021).

    Article  PubMed Central  Google Scholar 

  60. Martin, M. et al. Whatshap: fast and accurate read-based phasing. Preprint at bioRxiv https://doi.org/10.1101/085050 (2016).

  61. Cook, D., Kolesnikov, A., Chang, P.-C. & Carroll, A. Improving variant calling using haplotype information. DeepVariant Blog https://google.github.io/deepvariant/posts/2021-02-08-the-haplotype-channel/ (2021).

  62. Krusche, P. et al. Best practices for benchmarking germline small-variant calls in human genomes. Nat. Biotechnol. 37, 555–560 (2019).

  63. Gordon, A. Gnu time. https://www.gnu.org/software/time/ (2018).

  64. Chen, N.-C. leviosam2. Zenodo https://doi.org/10.5281/zenodo.8198490 (2023).

  65. Chen, N.-C. levioSAM2-experiments v.0.1. Zenodo https://doi.org/10.5281/zenodo.8198541 (2023).

Download references

Acknowledgements

We thank T. Mun for his advice and contribution to the levioSAM2 programming infrastructure. We appreciate advice from H.-C. Chen on software deployment, C. Pockrandt on mappability resources, A. Shumate on gene lift-over and S. Zarate on T2T-CHM13 variant analysis. We also thank A. Carroll and P.-C. Chang for DeepVariant discussions, A. Rhie for T2T-CHM13 discussions and J. Zook for GIAB strata suggestions. N.-C.C. and B.L. were supported by National Institutes of Health (NIH) grants R01HG011392 and R35GM139602 to B.L. F.J.S. and L.F.P. were supported by NIH grants 1U01HG011758-01 and UM1HG008898. S.K. and A.M.P. were supported by the Intramural Research Program of the National Human Genome Research Institute (NHGRI), NIH. Part of this research project was conducted using computational resources at the Maryland Advanced Research Computing Center (MARCC). Prebuilt levioSAM2 resources for T2T-CHM13 to GRC references are made freely available on Amazon Web Services thanks to the AWS Public Dataset Program.

Author information

Authors and Affiliations

Authors

Contributions

N.-C.C., S.K., A.M.P. and B.L. designed the method. N.-C.C. wrote the software. N.-C.C. and L.F.P. performed the experiment. N.-C.C., L.F.P., F.J.S., S.K., A.M.P. and B.L. performed analysis and wrote the manuscript. All authors read and approved the final manuscript.

Corresponding authors

Correspondence to Nae-Chyun Chen or Ben Langmead.

Ethics declarations

Competing interests

N.-C.C. is an employee of Exai Bio. L.F.P. received financial funds from Genentech. L.F.P. received travel funds to speak at events hosted by ONT. F.J.S. received research support from Genetech, Illumina, Pacbio and ONT. S.K. has received travel funds to speak at events hosted by Oxford Nanopore Technologies for ethics conflicts. The remaining authors declare no competing interests.

Peer review

Peer review information

Nature Methods thanks Jan Korbel, Erik Garrison and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Peer reviewer reports are available. Primary Handling Editor: Lin Tang, in collaboration with the Nature Methods team.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Regions unique to T2T-CHM13 and variant calls within.

Regions unique to T2T-CHM13 compared to GRCh38 (blue) and high-quality HG002 variant calls from DeepVariant in these regions (red).

Extended Data Fig. 2 Mapping accuracy comparison using simulated data.

Mapping accuracy using simulated reads that carry GRCh38-based HG001 genotypes.

Extended Data Fig. 3 Peak memory usage comparison.

Peak memory usage of levioSAM2 and direct-to-GRC pipelines using a real 30 × WGS dataset from HG002. The alignment steps used BWA-MEM. The lift-over tasks (‘CHM13-to-GRCh38’ and ‘CHM13-to-GRCh37’) excluded the cost of the initial mapping step to T2T-CHM13 (‘CHM13’).

Extended Data Fig. 4 Small-variant calling performance using DeepVariant.

Small variant calling performance in difficult regions using DeepVariant. A. Small variant calling accuracy in major difficult genomic regions for HG002. B. GIAB stratified regions with top small variant calling error reduction densities by levioSAM2.

Source data

Extended Data Fig. 5 A disagreed SV call between the GIAB Tier 1 SV callset and personalized assemblies.

IGV visualization near chr5:21,543,010 for the HG002 PacBio-HiFi dataset. The reads were grouped using the allele at chr5:21,543,010. A 174-bp DEL was called when using direct-to-GRCh37, matching the GIAB Tier 1 SV callset. However, personalized whole-genome assemblies showed mappings of non-GRCh37 haplotypes in this region (the ‘2’ alignment in ‘HG002 Hap1’ and ‘HG002 Hap2’ tracks), suggesting collapsed mapping. The CHM13-to-GRCh37 mappings showed better concordance with the personalized HG002 assemblies.

Extended Data Fig. 6 An example of improved mapping using ONT data.

IGV visualization near chr7:125,400,000 (located in the KMT2C gene) for the HG002 ONT dataset. Four FP SV calls were made when aligning reads directly to GRCh38 because of large-scale mapping collapse. The levioSAM2 workflow (‘CHM13-to-GRCh38’) generated improved alignments and did not result in the FP SV calls.

Supplementary information

Supplementary Information

Supplementary Notes 1 and 2, Figs. 1–9 and Tables 1–19.

Reporting Summary

Peer Review File

Source data

Source Data Fig. 2

Variant calling summary reports generated using hap.py for HG001, HG002 and HG005 Illumina data.

Source Data Fig. 3

Stratified variant calling (GATK) summary generated using hap.py for HG002 Illumina data.

Source Data Fig. 4

Variant calling summary reports of HG002 PacBio-HiFi data.

Source Data Fig. 6

Computational efficiency reports.

Source Data Extended Data Fig. 4

Stratified variant calling (DeepVariant) summary generated using hap.py for HG002 Illumina data.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Chen, NC., Paulin, L.F., Sedlazeck, F.J. et al. Improved sequence mapping using a complete reference genome and lift-over. Nat Methods 21, 41–49 (2024). https://doi.org/10.1038/s41592-023-02069-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/s41592-023-02069-6

This article is cited by

Search

Quick links

Nature Briefing: Translational Research

Sign up for the Nature Briefing: Translational Research newsletter — top stories in biotechnology, drug discovery and pharma.

Get what matters in translational research, free to your inbox weekly. Sign up for Nature Briefing: Translational Research