Constructing telomere-to-telomere diploid genome by polishing haploid nanopore-based assembly

Darian, Joshua Casey; Kundu, Ritu; Rajaby, Ramesh; Sung, Wing-Kin

doi:10.1038/s41592-023-02141-1

Article
Published: 08 March 2024

Constructing telomere-to-telomere diploid genome by polishing haploid nanopore-based assembly

Nature Methods volume 21, pages 574–583 (2024)Cite this article

2743 Accesses
14 Altmetric
Metrics details

Subjects

Genome assembly algorithms

Abstract

Draft genomes generated from Oxford Nanopore Technologies (ONT) long reads are known to have a higher error rate. Although existing genome polishers can enhance their quality, the error rate (including mismatches, indels and switching errors between paternal and maternal haplotypes) can be significant. Here, we develop two polishers, hypo-short and hypo-hybrid to address this issue. Hypo-short utilizes Illumina short reads to polish an ONT-based draft assembly, resulting in a high-quality assembly with low error rates and switching errors. Expanding on this, hypo-hybrid incorporates ONT long reads to further refine the assembly into a diploid representation. Leveraging on hypo-hybrid, we have created a diploid genome assembly pipeline called hypo-assembler. Hypo-assembler automates the generation of highly accurate, contiguous and nearly complete diploid assemblies using ONT long reads, Illumina short reads and optionally Hi-C reads. Notably, our solution even allows for the production of telomere-to-telomere diploid genomes with additional manual steps. As a proof of concept, we successfully assembled a fully phased telomere-to-telomere diploid genome of HG00733, achieving a quality value exceeding 50.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Fig. 2: The pipeline of hypo-assembler.**

**Fig. 3: Various statistics of our HG00733 assembly.**

Telomere-to-telomere gapless chromosomes of banana using nanopore sequencing

Article Open access 07 September 2021

Telomere-to-telomere assembly of diploid chromosomes with Verkko

Article 16 February 2023

De novo diploid genome assembly using long noisy reads

Article Open access 05 April 2024

Data availability

Code availability

Hypo-assembler, related pipelines, and evaluation scripts are available on GitHub: https://github.com/kensung-lab/hypo-assembler.

References

Garg, S. et al. Chromosome-scale, haplotype-resolved assembly of human genomes. Nat. Biotechnol. 39, 309–312 (2021).
Article CAS PubMed Google Scholar
Cheng, H., Concepcion, G. T., Feng, X., Zhang, H. & Li, H. Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm. Nat. Methods 18, 170–175 (2021).
Article CAS PubMed PubMed Central Google Scholar
Nurk, S. et al. HiCanu: accurate assembly of segmental duplications, satellites, and allelic variants from high-fidelity long reads. Genome Res. 30, 1291–1305 (2020).
Article CAS PubMed PubMed Central Google Scholar
Luo, X., Kang, X. & Schönhuth, A. phasebook: haplotype-aware de novo assembly of diploid genomes from long reads. Genome Biol. 22, 299 (2021).
Article PubMed PubMed Central Google Scholar
Chin, C.-S. & Khalak, A. Human genome assembly in 100 minutes. Preprint at bioRxiv https://doi.org/10.1101/705616 (2019).
Koren, S. et al. Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation. Genome Res. 27, 722–736 (2017).
Article CAS PubMed PubMed Central Google Scholar
Kolmogorov, M., Yuan, J., Lin, Y. & Pevzner, P. A. Assembly of long, error-prone reads using repeat graphs. Nat. Biotechnol. 37, 540–546 (2019).
Article CAS PubMed Google Scholar
Vaser, R. & Šikić, M. Time- and memory-efficient genome assembly with Raven. Nat. Comput. Sci. 1, 332–336 (2021).
Article PubMed Google Scholar
Ruan, J. & Li, H. Fast and accurate long-read assembly with wtdbg2. Nat. Methods 17, 155–158 (2020).
Article CAS PubMed Google Scholar
Chen, Y. et al. Efficient assembly of nanopore reads via highly accurate and intact error correction. Nat. Commun. 12, 60 (2021).
Article CAS PubMed PubMed Central Google Scholar
Lang, D. et al. Comparison of the two up-to-date sequencing technologies for genome assembly: HiFi reads of Pacific Biosciences Sequel II system and ultralong reads of Oxford Nanopore. Gigascience 9, giaa123 (2020).
Article PubMed PubMed Central Google Scholar
Warren, R. L. et al. ntedit: scalable genome sequence polishing. Bioinformatics 35, 4430–4432 (2019).
Article CAS PubMed PubMed Central Google Scholar
Zimin, A. V. & Salzberg, S. L. The genome polishing tool POLCA makes fast and accurate corrections in genome assemblies. PLoS Comput. Biol. 16, e1007981 (2020).
Article CAS PubMed PubMed Central Google Scholar
Aury, J.-M. & Istace, B. Hapo-G, haplotype-aware polishing of genome assemblies with accurate reads. NAR Genom. Bioinform. 3, lqab034 (2021).
Article PubMed PubMed Central Google Scholar
Shafin, K. et al. Haplotype-aware variant calling with PEPPER-Margin-DeepVariant enables high accuracy in nanopore long-reads. Nat. Methods 18, 1322–1332 (2021).
Article CAS PubMed PubMed Central Google Scholar
Shafin, K. et al. Nanopore sequencing and the shasta toolkit enable efficient de novo assembly of eleven human genomes. Nat. Biotechnol. 38, 1044–1053 (2020).
Article CAS PubMed PubMed Central Google Scholar
Walker, B. J. et al. Pilon: an integrated tool for comprehensive microbial variant detection and genome assembly improvement. PLoS ONE 9, e112963 (2014).
Article PubMed PubMed Central Google Scholar
Vaser, R., Sović, I., Nagarajan, N. & Šikić, M. Fast and accurate de novo genome assembly from long uncorrected reads. Genome Res. 27, 737–746 (2017).
Article CAS PubMed PubMed Central Google Scholar
Hu, J., Fan, J., Sun, Z. & Liu, S. NextPolish: a fast and efficient genome polishing tool for long-read assembly. Bioinformatics 36, 2253–2255 (2020).
Article CAS PubMed Google Scholar
Formenti, G. et al. Merfin: improved variant filtering, assembly evaluation and polishing via k-mer validation. Nat. Methods 19, 696–704 (2022).
Article CAS PubMed PubMed Central Google Scholar
Li, H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. Preprint at https://doi.org/10.48550/arXiv.1303.3997 (2013).
Li, H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34, 3094–3100 (2018).
Article CAS PubMed PubMed Central Google Scholar
Rajaby, R. et al. INSurVeyor: improving insertion calling from short read sequencing data. Nat. Commun. 14, 3243 (2023).
Article CAS PubMed PubMed Central Google Scholar
Rajaby, R. & Sung, W.-K. SurVIndel: improving CNV calling from high-throughput sequencing data through statistical testing. Bioinformatics 37, 1497–1505 (2021).
Article CAS PubMed Google Scholar
Nurk, S. et al. The complete sequence of a human genome. Science 376, 44–53 (2022).
Article CAS PubMed PubMed Central Google Scholar
Vollger, M. R. et al. Long-read sequence and assembly of segmental duplications. Nat. Methods 16, 88–94 (2019).
Article CAS PubMed Google Scholar
Logsdon, G. A. et al. The structure, function and evolution of a complete human chromosome 8. Nature 593, 101–107 (2021).
Article CAS PubMed PubMed Central Google Scholar
Kronenberg, Z. N. et al. Extended haplotype-phasing of long-read de novo genome assemblies using Hi-C. Nat. Commun. 12, 1935 (2021).
Article CAS PubMed PubMed Central Google Scholar
Cheng, H. et al. Haplotype-resolved assembly of diploid genomes without parental data. Nat. Biotechnol. 40, 1332–1335 (2022).
Article CAS PubMed Google Scholar
Porubsky, D. et al. Fully phased human genome assembly without parental data using single-cell strand sequencing and long reads. Nat. Biotechnol. 39, 302–308 (2021).
Article CAS PubMed Google Scholar
Xie, M. et al. gcaPDA: a haplotype-resolved diploid assembler. BMC Bioinformatics 23, 68 (2022).
Article CAS PubMed PubMed Central Google Scholar
Sullivan, L. L. & Sullivan, B. A. Genomic and functional variation of human centromeres. Exp. Cell Res. 389, 111896 (2020).
Article CAS PubMed PubMed Central Google Scholar
Mikheenko, A., Bzikadze, A. V., Gurevich, A., Miga, K. H. & Pevzner, P. A. TandemTools: mapping long reads and assessing/improving assembly quality in extra-long tandem repeats. Bioinformatics 36, i75–i83 (2020).
Article CAS PubMed PubMed Central Google Scholar
Kim, J.-H. et al. Variation in human chromosome 21 ribosomal RNA genes characterized by tar cloning and long-read sequencing. Nucleic Acids Res. 46, 6712–6725 (2018).
Article CAS PubMed PubMed Central Google Scholar
Fiddes, I. T. et al. Comparative annotation toolkit (cat)–simultaneous clade and personal genome annotation. Genome Res. 28, 1029–1038 (2018).
Article CAS PubMed PubMed Central Google Scholar
Shumate, A. & Salzberg, S. L. Liftoff: accurate mapping of gene annotations. Bioinformatics 37, 1639–1643 (2021).
Article CAS PubMed PubMed Central Google Scholar
Ceballos, F. C., Joshi, P. K., Clark, D. W., Ramsay, M. & Wilson, J. F. Runs of homozygosity: windows into population history and trait architecture. Nat. Rev. Genet. 19, 220–234 (2018).
Article CAS PubMed Google Scholar
Ariyaratne, P. N. & Sung, W.-K. Pe-assembler: de novo assembler using short paired-end reads. Bioinformatics 27, 167–174 (2011).
Article CAS PubMed Google Scholar
Sudmant, P. H. et al. Diversity of human copy number variation and multicopy genes. Science 330, 641–646 (2010).
Article CAS PubMed PubMed Central Google Scholar
Carvalho, A. B., Dupim, E. G. & Goldstein, G. Improved assembly of noisy long reads by k-mer validation. Genome Res. 26, 1710–1720 (2016).
Article CAS PubMed PubMed Central Google Scholar
Miga, K. H. et al. Telomere-to-telomere assembly of a complete human X chromosome. Nature 585, 79–84 (2020).
Article CAS PubMed PubMed Central Google Scholar
Kundu, R., Casey, J. & Sung, W.-K. Hypo: super fast & accurate polisher for long read genome assemblies. Preprint at bioRXiv https://doi.org/10.1101/2019.12.19.882506 (2019).
Stanke, M. et al. Augustus: ab initio prediction of alternative transcripts. Nucleic Acids Res. 34, W435–W439 (2006).
Article CAS PubMed PubMed Central Google Scholar
Altschul, S. F., Gish, W., Miller, W., Myers, E. W. & Lipman, D. J. Basic local alignment search tool. J. Mol. Biol. 215, 403–410 (1990).
Article CAS PubMed Google Scholar

Download references

Acknowledgements

The authors received no specific funding for this work.

Author information

Authors and Affiliations

School of Computing, National University of Singapore, Singapore, Singapore
Joshua Casey Darian, Ritu Kundu & Wing-Kin Sung
Genome Institute of Singapore, Singapore, Singapore
Ramesh Rajaby & Wing-Kin Sung
Department of Chemical Pathology, The Chinese University of Hong Kong, Hong Kong, China
Wing-Kin Sung
JC STEM Laboratory of Computational Genomics, Li Ka Shing Institute of Health Sciences, The Chinese University of Hong Kong, Hong Kong, China
Wing-Kin Sung
Hong Kong Genome Institute, Hong Kong, China
Wing-Kin Sung

Authors

Joshua Casey Darian
View author publications
You can also search for this author in PubMed Google Scholar
Ritu Kundu
View author publications
You can also search for this author in PubMed Google Scholar
Ramesh Rajaby
View author publications
You can also search for this author in PubMed Google Scholar
Wing-Kin Sung
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

J.C. did the computations and experiments. R.R. developed variant callers that are crucial to the result. R.K. developed the early versions of the hypo polisher that become the starting point of the work. W.-K.S. supervises the work and encouraged the idea of solid k-mers. All authors discussed the results and contributed to the final manuscript.

Corresponding author

Correspondence to Wing-Kin Sung.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Methods thanks Kai Wang and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Primary Handling Editors: Lei Tang and Lin Tang, in collaboration with the Nature Methods team.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 CHM13 assembly benchmark.

Several measures of qualities of assembly of CHM13. (a) The number of errors per 100kbp and (b) YAK’s estimated quality value of various assemblies of CHM13.

Extended Data Fig. 2 HG002 assembly benchmark.

Several measures of qualities for diploid assemblies of HG002. (a) The number of errors per 100kbp. (b) YAK’s estimated quality value. (c) Kmer Mixture Rate (d) Switch Error Rate.

Extended Data Table 1 CHM13 assembly statistics

Full size table

Extended Data Table 2 CHM13 centromere evaluations

Full size table

Extended Data Table 3 CHM13 segmental duplication evaluations

Full size table

Extended Data Table 4 CHM13 BAC evaluation

Full size table

Extended Data Table 5 HG002 segmental duplication evaluations

Full size table

Extended Data Table 6 HG00733 centromere evaluations

Full size table

Supplementary information

Supplementary Information

Supplementary Sections A–O, Figs. 1–29 and Tables 1–30.

Reporting Summary

Supplementary Dataset

Data used to generate the box plots in Supplementary Figs. 16 and 17.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Darian, J.C., Kundu, R., Rajaby, R. et al. Constructing telomere-to-telomere diploid genome by polishing haploid nanopore-based assembly. Nat Methods 21, 574–583 (2024). https://doi.org/10.1038/s41592-023-02141-1

Download citation

Received: 08 September 2022
Accepted: 30 November 2023
Published: 08 March 2024
Issue Date: April 2024
DOI: https://doi.org/10.1038/s41592-023-02141-1

Constructing telomere-to-telomere diploid genome by polishing haploid nanopore-based assembly

Subjects

Abstract

Access options

Similar content being viewed by others

Telomere-to-telomere gapless chromosomes of banana using nanopore sequencing

Telomere-to-telomere assembly of diploid chromosomes with Verkko

De novo diploid genome assembly using long noisy reads

Data availability

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Peer review

Peer review information

Additional information

Extended data

Extended Data Fig. 1 CHM13 assembly benchmark.

Extended Data Fig. 2 HG002 assembly benchmark.

Supplementary information

Supplementary Information

Reporting Summary

Supplementary Dataset

Rights and permissions

About this article

Cite this article

Creating diploid assemblies from Nanopore and Illumina reads with hypo-assembler

Search

Quick links

Subjects

Abstract

Access options

Similar content being viewed by others

Data availability

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Peer review

Peer review information

Additional information

Extended data

Supplementary information

Rights and permissions

About this article

Cite this article

Share this article

Search

Quick links