Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Brief Communication
  • Published:

Haplotype-resolved assembly of diploid genomes without parental data

Abstract

Routine haplotype-resolved genome assembly from single samples remains an unresolved problem. Here we describe an algorithm that combines PacBio HiFi reads and Hi-C chromatin interaction data to produce a haplotype-resolved assembly without the sequencing of parents. Applied to human and other vertebrate samples, our algorithm consistently outperforms existing single-sample assembly pipelines and generates assemblies of similar quality to the best pedigree-based assemblies.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: Haplotype-resolved assembly using Hi-C data.

Similar content being viewed by others

Data availability

Human reference genome: GRCh38; CHM13 genome: GCA_009914755.3; HG002 HiFi reads: SRR10382244, SRR10382245, SRR10382248 and SRR10382249; HG002 Hi-C reads: ‘HG002.HiC_1*.fastq.gz’ from https://github.com/human-pangenomics/HG002_Data_Freeze_v1.0; HG002 parental short reads: from the same HG002 data freeze; HG00733 HiFi reads: ERX3831682; HG00733 Hi-C reads: SRR11347815; HG00733 parental short reads: ERR3241754 for HG00731 (father) and ERR3241755 for HG00732 (mother); European badger: PRJEB46293; sterlet: PRJEB19273; South Island takahe: https://vgp.github.io/genomeark/Porphyrio_hochstetteri/; and black rhinoceros: https://vgp.github.io/genomeark/Diceros_bicornis/. All evaluated assemblies are available at https://zenodo.org/record/5948487 and https://zenodo.org/record/5953248.

Code availability

Hifiasm is available at https://github.com/chhylp123/hifiasm.

Reference

  1. Logsdon, G. A., Vollger, M. R. & Eichler, E. E. Long-read human genome sequencing and its applications. Nat. Rev. Genet. 21, 597–614 (2020).

    Article  CAS  Google Scholar 

  2. Rhie, A. et al. Towards complete and error-free genome assemblies of all vertebrate species. Nature 592, 737–746 (2021).

    Article  CAS  Google Scholar 

  3. Chin, C.-S. et al. Phased diploid genome assembly with single-molecule real-time sequencing. Nat. Methods 13, 1050–1054 (2016).

    Article  CAS  Google Scholar 

  4. Nurk, S. et al. HiCanu: accurate assembly of segmental duplications, satellites, and allelic variants from high-fidelity long reads. Genome Res. 30, 1291–1305 (2020).

    Article  CAS  Google Scholar 

  5. Cheng, H., Concepcion, G. T., Feng, X., Zhang, H. & Li, H. Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm. Nat. Methods 18, 170–175 (2021).

    Article  CAS  Google Scholar 

  6. Luo, X., Kang, X. & Schönhuth, A. phasebook: haplotype-aware de novo assembly of diploid genomes from long reads. Genome Biol. 22, 299 (2021).

    Article  Google Scholar 

  7. Koren, S. et al. De novo assembly of haplotype-resolved genomes with trio binning. Nat. Biotechnol. 36, 1174–1182 (2018).

    Article  CAS  Google Scholar 

  8. Garg, S. et al. Chromosome-scale, haplotype-resolved assembly of human genomes. Nat. Biotechnol. 39, 309–312 (2021).

    Article  CAS  Google Scholar 

  9. Porubsky, D. et al. Fully phased human genome assembly without parental data using single-cell strand sequencing and long reads. Nat. Biotechnol. 39, 302–308 (2021).

    Article  CAS  Google Scholar 

  10. Kronenberg, Z. N. et al. Extended haplotype-phasing of long-read de novo genome assemblies using Hi-C. Nat. Commun. 12, 1–10 (2021).

    Article  Google Scholar 

  11. Edge, P., Bafna, V. & Bansal, V. Hapcut2: robust and accurate haplotype assembly for diverse sequencing technologies. Genome Res. 27, 801–812 (2017).

    Article  CAS  Google Scholar 

  12. Tourdot, R. W., Brunette, G. J., Pinto, R. A. & Zhang, C.-Z. Determination of complete chromosomal haplotypes by bulk dna sequencing. Genome Biol. 22, 139 (2021).

    Article  CAS  Google Scholar 

  13. Chin, C.-S. & Khalak, A. Human genome assembly in 100 minutes. Preprint at bioRxiv https://www.biorxiv.org/content/10.1101/705616v1 (2019).

  14. Makeyev, A. V. et al. GTF2IRD2 is located in the Williams–Beuren syndrome critical region 7q11. 23 and encodes a protein with two TFII-I-like helix–loop–helix repeats. Proc. Natl Acad. Sci. USA 101, 11052–11057 (2004).

    Article  CAS  Google Scholar 

  15. Wagner, J. et al. Curated variation benchmarks for challenging medically relevant autosomal genes. Nat. Biotechnol. https://doi.org/10.1038/s41587-021-01158-1 (2022).

  16. Darwin Tree of Life Project Consortium. Sequence locally, think globally: the Darwin Tree of Life Project. Proc. Natl Acad. Sci. USA 119, e2115642118 (2022).

    Article  Google Scholar 

  17. Du, K. et al. The sterlet sturgeon genome sequence and the mechanisms of segmental rediploidization. Nat. Ecol. Evol. 4, 841–852 (2020).

    Article  Google Scholar 

  18. Li, H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34, 3094–3100 (2018).

    Article  CAS  Google Scholar 

  19. Simão, F. A., Waterhouse, R. M., Ioannidis, P., Kriventseva, E. V. & Zdobnov, E. M. BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs. Bioinformatics 31, 3210–3212 (2015).

    Article  Google Scholar 

  20. Guan, D. et al. Identifying and removing haplotypic duplication in primary genome assemblies. Bioinformatics 36, 2896–2898 (2020).

    Article  CAS  Google Scholar 

Download references

Acknowledgements

This study was supported by the US National Institutes of Health (grants R01HG010040, U01HG010961, U01HG010971 and U41HG010972 to H.L.) and Howard Hughes Medical Institute funds to E.D.J. We thank members of the Vertebrate Genome Lab at The Rockefeller University and the Sanger genome team at the Sanger Institute for help with producing data for the non-human vertebrate species. Presentation and analyses of the completed reference genome assemblies will be reported on separately. We also thank the Human Pangenome Reference Consortium for making the HiFi and Hi-C data of HG002 and HG00733 publicly available. K.-P.K. thanks the International Rhino Foundation for providing funding to generate the black rhinoceros assembly (grant no. R-2018-1). The South Island takahe genome was funded by Revive and Restore and the University of Otago. The South Island takahe reference genome was created in direct collaboration with the Takahē Recovery Team (Department of Conservation, New Zealand) and Ngāi Tahu, the Māori kaitiaki (‘guardians’) of this taonga (‘treasured’) species. Sequencing of the takahe genome was funded by Revive and Restore and the University of Otago. L.U. was supported by a Feodor Lynen Fellowship of the Alexander von Humboldt Foundation, the Revive and Restore Catalyst Science Fund and the University of Otago.

Author information

Authors and Affiliations

Authors

Contributions

H.C. and H.L. designed the algorithm, implemented hifiasm and drafted the manuscript. H.C. benchmarked hifiasm and other assemblers. E.D.J. and O.F. coordinated generation of the non-human vertebrate species data as part of the vertebrate genomes project. K.-P.K. sponsored the black rhinoceros genome. L.U. obtained the South Island takahe samples, all necessary permits and funding for the South Island takahe reference genome. L.U. and N.G. sponsored the South Island takahe genome.

Corresponding author

Correspondence to Heng Li.

Ethics declarations

Competing interests

H.L. is a consultant of Integrated DNA Technologies and is on the Scientific Advisory Boards of Sentieon and Innozeen. The remaining authors declare no competing interests.

Peer review

Peer review information

Nature Biotechnology thanks Rayan Chikhi, David Rank, Riccardo Vicedomini and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Chromosome-level phasing results for hifiasm (Hi-C) human assemblies.

All contigs were aligned to the T2T CHM13 reference and the Y chromosome of GRCh38, and then the corresponding regions of contigs on the reference were determined based on the alignment results. For each chromosome, the top track and the bottom track indicate haplotype 1 contigs and haplotype 2 contigs, respectively. The phase density of contigs was calculated by the parental short reads. Gray bars indicate centromeric regions. (a) Chromosome-level phasing results for HG002 with 30X HiFi and 30X Hi-C. (b) Chromosome-level phasing results for HG00733 with 30X HiFi and 30X Hi-C.

Supplementary information

Supplementary Information

Supplementary Section 1, Tables 1–3 and Fig. 1.

Reporting Summary.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Cheng, H., Jarvis, E.D., Fedrigo, O. et al. Haplotype-resolved assembly of diploid genomes without parental data. Nat Biotechnol 40, 1332–1335 (2022). https://doi.org/10.1038/s41587-022-01261-x

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/s41587-022-01261-x

This article is cited by

Search

Quick links

Nature Briefing AI and Robotics

Sign up for the Nature Briefing: AI and Robotics newsletter — what matters in AI and robotics research, free to your inbox weekly.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing: AI and Robotics