Chromosome assembly of large and complex genomes using multiple references
- Mikhail Kolmogorov1,
- Joel Armstrong2,
- Brian J. Raney2,
- Ian Streeter3,
- Matthew Dunn4,
- Fengtang Yang4,
- Duncan Odom4,5,
- Paul Flicek3,4,
- Thomas M. Keane3,4,6,
- David Thybert3,7,
- Benedict Paten2 and
- Son Pham8
- 1Department of Computer Science and Engineering, University of California, San Diego, California 92093, USA;
- 2Center for Biomolecular Science and Engineering, University of California, Santa Cruz, California 95064, USA;
- 3European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton CB10 1SD, United Kingdom;
- 4Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton CB10 1SA, United Kingdom;
- 5Cancer Research UK Cambridge Institute, University of Cambridge, CB2 0RE Cambridge, United Kingdom;
- 6School of Life Sciences, University of Nottingham, Nottingham NG7 2NR, United Kingdom;
- 7Earlham Institute, Norwich Research Park, Norwich NR4 7UG, United Kingdom;
- 8BioTuring Incorporated, San Diego, California 92121, USA
Abstract
Despite the rapid development of sequencing technologies, the assembly of mammalian-scale genomes into complete chromosomes remains one of the most challenging problems in bioinformatics. To help address this difficulty, we developed Ragout 2, a reference-assisted assembly tool that works for large and complex genomes. By taking one or more target assemblies (generated from an NGS assembler) and one or multiple related reference genomes, Ragout 2 infers the evolutionary relationships between the genomes and builds the final assemblies using a genome rearrangement approach. By using Ragout 2, we transformed NGS assemblies of 16 laboratory mouse strains into sets of complete chromosomes, leaving <5% of sequence unlocalized per set. Various benchmarks, including PCR testing and realigning of long Pacific Biosciences (PacBio) reads, suggest only a small number of structural errors in the final assemblies, comparable with direct assembly approaches. We applied Ragout 2 to the Mus caroli and Mus pahari genomes, which exhibit karyotype-scale variations compared with other genomes from the Muridae family. Chromosome painting maps confirmed most large-scale rearrangements that Ragout 2 detected. We applied Ragout 2 to improve draft sequences of three ape genomes that have recently been published. Ragout 2 transformed three sets of contigs (generated using PacBio reads only) into chromosome-scale assemblies with accuracy comparable to chromosome assemblies generated in the original study using BioNano maps, Hi-C, BAC clones, and FISH.
Footnotes
-
[Supplemental material is available for this article.]
-
Article published online before print. Article, supplemental material, and publication date are at http://www.genome.org/cgi/doi/10.1101/gr.236273.118.
-
Freely available online through the Genome Research Open Access option.
- Received March 6, 2018.
- Accepted September 24, 2018.
This article, published in Genome Research, is available under a Creative Commons License (Attribution 4.0 International), as described at http://creativecommons.org/licenses/by/4.0/.