Binning metagenomic contigs by coverage and composition

Alneberg, Johannes; Bjarnason, Brynjar Smári; de Bruijn, Ino; Schirmer, Melanie; Quick, Joshua; Ijaz, Umer Z; Lahti, Leo; Loman, Nicholas J; Andersson, Anders F; Quince, Christopher

doi:10.1038/nmeth.3103

Brief Communication
Published: 14 September 2014

Binning metagenomic contigs by coverage and composition

Johannes Alneberg¹^na1,
Brynjar Smári Bjarnason¹^na1,
Ino de Bruijn^1,2,
Melanie Schirmer³,
Joshua Quick^4,5,
Umer Z Ijaz³,
Leo Lahti ORCID: orcid.org/0000-0001-5537-637X^6,7,
Nicholas J Loman⁴,
Anders F Andersson¹^na2 &
…
Christopher Quince³^na2

Nature Methods volume 11, pages 1144–1146 (2014)Cite this article

26k Accesses
1066 Citations
89 Altmetric
Metrics details

Subjects

Abstract

Shotgun sequencing enables the reconstruction of genomes from complex microbial communities, but because assembly does not reconstruct entire genomes, it is necessary to bin genome fragments. Here we present CONCOCT, a new algorithm that combines sequence composition and coverage across multiple samples, to automatically cluster contigs into genomes. We demonstrate high recall and precision on artificial as well as real human gut metagenome data sets.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Figure 1: Performance of CONCOCT and other unsupervised binning methods on a synthetic community.**

**Figure 2: Analysis of the 2011 STEC *Escherichia coli* O104:H4 outbreak.**

HiFi metagenomic sequencing enables assembly of accurate and complete genomes from human gut microbiota

Article Open access 26 October 2022

Chan Yeong Kim, Junyeong Ma & Insuk Lee

Strainberry: automated strain separation in low-complexity metagenomes using long reads

Article Open access 23 July 2021

Riccardo Vicedomini, Christopher Quince, … Rayan Chikhi

High-quality metagenome assembly from long accurate reads with metaMDBG

Article Open access 02 January 2024

Gaëtan Benoit, Sébastien Raguideau, … Christopher Quince

References

Tyson, G.W. et al. Nature 428, 37–43 (2004).
Article CAS Google Scholar
Herlemann, D.P. et al. MBio 4, e00569–e00512 (2013).
Article CAS Google Scholar
Sharon, I. & Banfield, J.F. Science 342, 1057–1058 (2013).
Article CAS Google Scholar
Kislyuk, A., Bhatnagar, S., Dushoff, J. & Weitz, J.S. BMC Bioinformatics 10, 316 (2009).
Article Google Scholar
Strous, M., Kraft, B., Bisdorf, R. & Tegetmeyer, H. Front. Microbiol. 3, 410 (2012).
Article Google Scholar
Chatterji, S., Yamazaki, I., Bai, Z. & Eisen, J.A. in Res. Comput. Mol. Biol. (eds. Vingron, M. & Wong, L.) 17–28 (Springer, 2008).
Kelley, D.R. & Salzberg, S.L. BMC Bioinformatics 11, 544 (2010).
Article Google Scholar
Sharon, I. et al. Genome Res. 23, 111–120 (2013).
Article CAS Google Scholar
Albertsen, M. et al. Nat. Biotechnol. 31, 533–538 (2013).
Article CAS Google Scholar
Corduneanu, A. & Bishop, C.M. in Artif. Intell. Stat. 2001 (eds. Jaakkola, T. & Richardson, T.) 27–34 (Morgan Kaufmann, 2001).
Human Microbiome Project Consortium. Nature 486, 207–214 (2012.).
Sandberg, R. et al. Genome Res. 11, 1404–1409 (2001).
Article CAS Google Scholar
Dick, G.J. et al. Genome Biol. 10, R85 (2009).
Article Google Scholar
Loman, N.J. et al. J. Am. Med. Assoc. 309, 1502–1510 (2013).
Article CAS Google Scholar
Asahara, T. et al. Infect. Immun. 72, 2240–2247 (2004).
Article CAS Google Scholar
Pell, J. et al. Proc. Natl. Acad. Sci. USA 109, 13272–13277 (2012).
Article CAS Google Scholar
Boisvert, S., Raymond, F., Godzaridis, E., Laviolette, F. & Corbeil, J. Genome Biol. 13, R122 (2012).
Article Google Scholar
Tatusov, R.L., Koonin, E.V. & Lipman, D.J. Science 278, 631–637 (1997).
Article CAS Google Scholar
Ciccarelli, F.D. et al. Science 311, 1283–1287 (2006).
Article CAS Google Scholar

Download references

Acknowledgements

This research arose out of a workshop funded through the COST project ES1103 and hosted by P. Fernandes at the Instituto Gulbenkian de Ciência. This work was funded by grants (to A.F.A.) from the Swedish Research Councils VR (grant 2011-5689), FORMAS (grant 2009-1174) and EC BONUS project BLUEPRINT. C.Q. is funded by an EPSRC Career Acceleration Fellowship—EP/H003851/1. M.S. is supported by Unilever R&D Port Sunlight, Bebington, UK. L.L. is supported by the Academy of Finland (grant 256950), N.L. by a UK Medical Research Council Special Training Fellowship in Biomedical Informatics and J.Q. by the UK National Institute for Health Research (NIHR) Centre for Surgical Reconstruction and Microbiology. This paper presents independent research funded by the NIHR Surgical Reconstruction and Microbiology Research Centre (partnership between University Hospitals Birmingham National Health Service (NHS) Foundation Trust, the University of Birmingham and the Royal Centre for Defence Medicine). The views expressed are those of the author(s) and not necessarily those of the NHS, the NIHR or the Department of Health.

Author information

Johannes Alneberg and Brynjar Smári Bjarnason: These authors contributed equally to this work.
Anders F Andersson and Christopher Quince: These authors jointly directed this work.

Authors and Affiliations

Division of Gene Technology, KTH Royal Institute of Technology, Science for Life Laboratory, School of Biotechnology, Stockholm, Sweden
Johannes Alneberg, Brynjar Smári Bjarnason, Ino de Bruijn & Anders F Andersson
Bioinformatics Infrastructure for Life Sciences (BILS), Stockholm, Sweden
Ino de Bruijn
School of Engineering, University of Glasgow, Glasgow, UK
Melanie Schirmer, Umer Z Ijaz & Christopher Quince
Institute of Microbiology and Infection, University of Birmingham, Birmingham, UK
Joshua Quick & Nicholas J Loman
National Institute for Health Research Surgical Reconstruction (NIHR) Surgical Reconstruction and Microbiology Research Centre, University of Birmingham, UK
Joshua Quick
Department of Veterinary Biosciences, University of Helsinki, Helsinki, Finland
Leo Lahti
Laboratory of Microbiology, Wageningen University, Wageningen, the Netherlands
Leo Lahti

Authors

Johannes Alneberg
View author publications
You can also search for this author in PubMed Google Scholar
Brynjar Smári Bjarnason
View author publications
You can also search for this author in PubMed Google Scholar
Ino de Bruijn
View author publications
You can also search for this author in PubMed Google Scholar
Melanie Schirmer
View author publications
You can also search for this author in PubMed Google Scholar
Joshua Quick
View author publications
You can also search for this author in PubMed Google Scholar
Umer Z Ijaz
View author publications
You can also search for this author in PubMed Google Scholar
Leo Lahti
View author publications
You can also search for this author in PubMed Google Scholar
Nicholas J Loman
View author publications
You can also search for this author in PubMed Google Scholar
Anders F Andersson
View author publications
You can also search for this author in PubMed Google Scholar
Christopher Quince
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

C.Q. developed the core algorithm and cluster validation metrics and performed analyses. A.F.A. assisted with the analyses, developed the SCG validation and contributed to algorithm development. J.A. and B.S.B. developed the CONCOCT software pipeline and contributed to algorithm development. I.d.B. performed assemblies and mappings. M.S. generated simulation data. J.Q. performed E. coli mappings. U.Z.I. assisted with SCG validation and production of graphics. L.L. helped with graphics and algorithm design. N.J.L. performed E. coli analysis and contributed to algorithm development. All authors contributed to the writing of the manuscript.

Corresponding authors

Correspondence to Anders F Andersson or Christopher Quince.

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Integrated supplementary information

Supplementary Figure 1 Relative abundances in the synthetic mock communities.

Relative abundances in the synthetic mock communities. A) heat map of the relative abundances of the 101 genomes in the synthetic species mock community distributed across the 96 HMP samples (see Online Methods). B) heat map of the relative abundances of the 20 genomes in the synthetic strain mock community distributed across the 64 HMP samples (see Online Methods). The samples have been positioned according to similarity. The relative abundances have been square root transformed to emphasise rare species and the inset scale should be interpreted accordingly.

Supplementary Figure 2 Two-dimensional visualisation of the synthetic species mock contigs labeled by species.

The 37,564 labelled synthetic species mock contig fragments of length > 1000bp plotted in the first two PCA dimensions with the 101 different species discriminated.

Supplementary Figure 3 Two-dimensional visualisation of the synthetic strain mock contigs labeled by genome.

The 9,109 labelled synthetic strain community contig fragments of length > 1000bp plotted in the first two PCA dimensions with the 20 different genomes discriminated.

Supplementary Figure 4 Two-dimensional visualisation of the synthetic species mock contigs labeled by cluster.

The 37,627 synthetic species mock contig fragments of length > 1000bp plotted in the first two PCA dimensions with the 101 different clusters discriminated.

Supplementary Figure 5 Confusion matrix for the synthetic species mock contigs.

A heatmap visualisation of the confusion matrix comparing the CONCOCT contig clusterings for the optimal 101 cluster solution with the species assignments for the synthetic species mock contig fragments. Each column is a cluster named Dk where k is the cluster index. The rows correspond to the species and the intensities reflect the proportion of each cluster deriving from each species. The intensities are weighted by contig length.

Supplementary Figure 6 Impact of contig length on error probability.

Predicted error probabilities from a logistic regression of contig missasignment as a function of contig fragment length (p-value < 2.0e-16) for the synthetic species community. Each cluster was assigned to the species from which the majority of contigs, weighted by length, derived. A contig was defined as misassigned if it derived from a different species to its cluster. The solid line is the logistic regression predicted error probability, together with standard errors as grey shaded portions. The points are average error rates calculated over a length bin of size 1000bp, their size is proportional to the log of the number of contigs in that bin.

Supplementary Figure 7 Frequency of single-copy core genes in the synthetic species mock clusters.

A heatmap visualisation of the number of single-copy core genes in each cluster for the optimal 101 cluster solution generated by CONCOCT applied to the synthetic species mock community.

Supplementary Figure 8 Two-dimensional visualisation of the synthetic strain mock contigs labeled by cluster.

The 9,411 synthetic strain mock contig fragments of length > 1000bp plotted in the first two PCA dimensions with the 21 different clusters discriminated.

Supplementary Figure 9 Validation of the synthetic strain mock contig clusterings.

Top panel: a heatmap visualisation of the number of single-copy core genes in each cluster for the optimal 21 cluster solution generated by CONCOCT applied to the synthetic strain mock community. Bottom panel: a heatmap visualisation of the confusion matrix comparing the CONCOCT contig clusterings with the genome assignments for the synthetic strain mock contig fragments. Each row is a cluster and column a species. Intensities reflect the proportion of each cluster weighted by contig length deriving from each species.

Supplementary Figure 10 Two-dimensional visualisation of the Sharon (2013) contigs labeled by cluster.

The 5,571 Sharon (2013) contig fragments of length > 1000bp plotted in the first two PCA dimensions with the 34 different clusters discriminated.

Supplementary Figure 11 Validation of the Sharon (2013) contig clusterings.

Top panel: A heatmap visualisation of the number of single-copy core genes in each cluster for the optimal 34 cluster solution generated by CONCOCT applied to the Sharon (2013) data. Only clusters with at least one SCG are shown. Bottom panel: A heatmap visualisation of the confusion matrix comparing the CONCOCT contig clusterings with the species assignments from TAXAassign. Each row is a cluster and the intensities reflect the proportion of each cluster deriving from each species. The intensities are weighted by contig length. Most clusters probably derive from the majority species assignment except 33 which is probably a novel Peptoniphilus species as determined by direct blasting against the NCBI and 11 an unknown Staphylococcus species.

Supplementary Figure 12 SCG plots for four different composition based clustering algorithms applied to the Sharon (2013) data.

These were A) CompostBin⁴, B) LikelyBin⁵, C) MetaWatt⁶, and D) SCIMM⁷.

Supplementary Figure 13 Confusion matrix for the Escherichia coli (STEC) O104:H4 outbreak contigs.

A heatmap visualisation of the confusion matrix comparing the CONCOCT contig clusterings for the optimal 297 cluster solution of the shiga-toxigenic Escherichia coli (STEC) O104:H4 outbreak. Each column is a cluster named Dk where k is the cluster index. The rows correspond to the species assignments, the intensities reflect the proportion of each cluster deriving from each species, only clusters and taxa with greater than 50kb of labeled representatives are shown. The intensities are weighted by contig length.

Supplementary Figure 14 Frequency of single-copy core genes in the Escherichia coli (STEC) O104:H4 outbreak clusters.

A heatmap visualisation of the number of single-copy core genes in each cluster for the optimal 297 cluster solution of CONCOCT applied to the shiga-toxigenic Escherichia coli (STEC) O104:H4 outbreak. Only clusters with a total average coverage summed across samples of greater than 50.0 are shown.

Supplementary Figure 15 Mapping of contigs to the Escherichia coli (STEC) O104:H4 outbreak genome.

The mapping of contig fragments to the known Escherichia coli (STEC) O104:H4 outbreak genome with cluster discriminated by colour and total coverage across all samples shown on the y-axis.

Supplementary information

Supplementary Text and Figures

Supplementary Figures 1–15, Supplementary Tables 1–8 and Supplementary Note (PDF 12624 kb)

Supplementary Software

CONCOCT version 0.3.3 software (ZIP 5301 kb)

Source data

Source data to Fig. 1

Source data to Fig. 2

Rights and permissions

Reprints and permissions

About this article

Cite this article

Alneberg, J., Bjarnason, B., de Bruijn, I. et al. Binning metagenomic contigs by coverage and composition. Nat Methods 11, 1144–1146 (2014). https://doi.org/10.1038/nmeth.3103

Download citation

Received: 02 December 2013
Accepted: 18 August 2014
Published: 14 September 2014
Issue Date: November 2014
DOI: https://doi.org/10.1038/nmeth.3103

This article is cited by

Deciphering the gut microbiome of grass carp through multi-omics approach
- Ming Li
- Hui Liang
- Zhigang Zhou
Microbiome (2024)
Functional similarity, despite taxonomical divergence in the millipede gut microbiota, points to a common trophic strategy
- Julius Eyiuche Nweze
- Vladimír Šustr
- Roey Angel
Microbiome (2024)
Unraveling the phylogenomic diversity of Methanomassiliicoccales and implications for mitigating ruminant methane emissions
- Fei Xie
- Shengwei Zhao
- Shengyong Mao
Genome Biology (2024)
Precision run-on sequencing (PRO-seq) for microbiome transcriptomics
- Albert C. Vill
- Edward J. Rice
- Ilana L. Brito
Nature Microbiology (2024)
COBRA improves the completeness and contiguity of viral genomes assembled from metagenomes
- LinXing Chen
- Jillian F. Banfield
Nature Microbiology (2024)