Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Brief Communication
  • Published:

Binning metagenomic contigs by coverage and composition

Abstract

Shotgun sequencing enables the reconstruction of genomes from complex microbial communities, but because assembly does not reconstruct entire genomes, it is necessary to bin genome fragments. Here we present CONCOCT, a new algorithm that combines sequence composition and coverage across multiple samples, to automatically cluster contigs into genomes. We demonstrate high recall and precision on artificial as well as real human gut metagenome data sets.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Figure 1: Performance of CONCOCT and other unsupervised binning methods on a synthetic community.
Figure 2: Analysis of the 2011 STEC Escherichia coli O104:H4 outbreak.

Similar content being viewed by others

References

  1. Tyson, G.W. et al. Nature 428, 37–43 (2004).

    Article  CAS  Google Scholar 

  2. Herlemann, D.P. et al. MBio 4, e00569–e00512 (2013).

    Article  CAS  Google Scholar 

  3. Sharon, I. & Banfield, J.F. Science 342, 1057–1058 (2013).

    Article  CAS  Google Scholar 

  4. Kislyuk, A., Bhatnagar, S., Dushoff, J. & Weitz, J.S. BMC Bioinformatics 10, 316 (2009).

    Article  Google Scholar 

  5. Strous, M., Kraft, B., Bisdorf, R. & Tegetmeyer, H. Front. Microbiol. 3, 410 (2012).

    Article  Google Scholar 

  6. Chatterji, S., Yamazaki, I., Bai, Z. & Eisen, J.A. in Res. Comput. Mol. Biol. (eds. Vingron, M. & Wong, L.) 17–28 (Springer, 2008).

  7. Kelley, D.R. & Salzberg, S.L. BMC Bioinformatics 11, 544 (2010).

    Article  Google Scholar 

  8. Sharon, I. et al. Genome Res. 23, 111–120 (2013).

    Article  CAS  Google Scholar 

  9. Albertsen, M. et al. Nat. Biotechnol. 31, 533–538 (2013).

    Article  CAS  Google Scholar 

  10. Corduneanu, A. & Bishop, C.M. in Artif. Intell. Stat. 2001 (eds. Jaakkola, T. & Richardson, T.) 27–34 (Morgan Kaufmann, 2001).

  11. Human Microbiome Project Consortium. Nature 486, 207–214 (2012.).

  12. Sandberg, R. et al. Genome Res. 11, 1404–1409 (2001).

    Article  CAS  Google Scholar 

  13. Dick, G.J. et al. Genome Biol. 10, R85 (2009).

    Article  Google Scholar 

  14. Loman, N.J. et al. J. Am. Med. Assoc. 309, 1502–1510 (2013).

    Article  CAS  Google Scholar 

  15. Asahara, T. et al. Infect. Immun. 72, 2240–2247 (2004).

    Article  CAS  Google Scholar 

  16. Pell, J. et al. Proc. Natl. Acad. Sci. USA 109, 13272–13277 (2012).

    Article  CAS  Google Scholar 

  17. Boisvert, S., Raymond, F., Godzaridis, E., Laviolette, F. & Corbeil, J. Genome Biol. 13, R122 (2012).

    Article  Google Scholar 

  18. Tatusov, R.L., Koonin, E.V. & Lipman, D.J. Science 278, 631–637 (1997).

    Article  CAS  Google Scholar 

  19. Ciccarelli, F.D. et al. Science 311, 1283–1287 (2006).

    Article  CAS  Google Scholar 

Download references

Acknowledgements

This research arose out of a workshop funded through the COST project ES1103 and hosted by P. Fernandes at the Instituto Gulbenkian de Ciência. This work was funded by grants (to A.F.A.) from the Swedish Research Councils VR (grant 2011-5689), FORMAS (grant 2009-1174) and EC BONUS project BLUEPRINT. C.Q. is funded by an EPSRC Career Acceleration Fellowship—EP/H003851/1. M.S. is supported by Unilever R&D Port Sunlight, Bebington, UK. L.L. is supported by the Academy of Finland (grant 256950), N.L. by a UK Medical Research Council Special Training Fellowship in Biomedical Informatics and J.Q. by the UK National Institute for Health Research (NIHR) Centre for Surgical Reconstruction and Microbiology. This paper presents independent research funded by the NIHR Surgical Reconstruction and Microbiology Research Centre (partnership between University Hospitals Birmingham National Health Service (NHS) Foundation Trust, the University of Birmingham and the Royal Centre for Defence Medicine). The views expressed are those of the author(s) and not necessarily those of the NHS, the NIHR or the Department of Health.

Author information

Authors and Affiliations

Authors

Contributions

C.Q. developed the core algorithm and cluster validation metrics and performed analyses. A.F.A. assisted with the analyses, developed the SCG validation and contributed to algorithm development. J.A. and B.S.B. developed the CONCOCT software pipeline and contributed to algorithm development. I.d.B. performed assemblies and mappings. M.S. generated simulation data. J.Q. performed E. coli mappings. U.Z.I. assisted with SCG validation and production of graphics. L.L. helped with graphics and algorithm design. N.J.L. performed E. coli analysis and contributed to algorithm development. All authors contributed to the writing of the manuscript.

Corresponding authors

Correspondence to Anders F Andersson or Christopher Quince.

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Integrated supplementary information

Supplementary Figure 1 Relative abundances in the synthetic mock communities.

Relative abundances in the synthetic mock communities. A) heat map of the relative abundances of the 101 genomes in the synthetic species mock community distributed across the 96 HMP samples (see Online Methods). B) heat map of the relative abundances of the 20 genomes in the synthetic strain mock community distributed across the 64 HMP samples (see Online Methods). The samples have been positioned according to similarity. The relative abundances have been square root transformed to emphasise rare species and the inset scale should be interpreted accordingly.

Supplementary Figure 2 Two-dimensional visualisation of the synthetic species mock contigs labeled by species.

The 37,564 labelled synthetic species mock contig fragments of length > 1000bp plotted in the first two PCA dimensions with the 101 different species discriminated.

Supplementary Figure 3 Two-dimensional visualisation of the synthetic strain mock contigs labeled by genome.

The 9,109 labelled synthetic strain community contig fragments of length > 1000bp plotted in the first two PCA dimensions with the 20 different genomes discriminated.

Supplementary Figure 4 Two-dimensional visualisation of the synthetic species mock contigs labeled by cluster.

The 37,627 synthetic species mock contig fragments of length > 1000bp plotted in the first two PCA dimensions with the 101 different clusters discriminated.

Supplementary Figure 5 Confusion matrix for the synthetic species mock contigs.

A heatmap visualisation of the confusion matrix comparing the CONCOCT contig clusterings for the optimal 101 cluster solution with the species assignments for the synthetic species mock contig fragments. Each column is a cluster named Dk where k is the cluster index. The rows correspond to the species and the intensities reflect the proportion of each cluster deriving from each species. The intensities are weighted by contig length.

Supplementary Figure 6 Impact of contig length on error probability.

Predicted error probabilities from a logistic regression of contig missasignment as a function of contig fragment length (p-value < 2.0e-16) for the synthetic species community. Each cluster was assigned to the species from which the majority of contigs, weighted by length, derived. A contig was defined as misassigned if it derived from a different species to its cluster. The solid line is the logistic regression predicted error probability, together with standard errors as grey shaded portions. The points are average error rates calculated over a length bin of size 1000bp, their size is proportional to the log of the number of contigs in that bin.

Supplementary Figure 7 Frequency of single-copy core genes in the synthetic species mock clusters.

A heatmap visualisation of the number of single-copy core genes in each cluster for the optimal 101 cluster solution generated by CONCOCT applied to the synthetic species mock community.

Supplementary Figure 8 Two-dimensional visualisation of the synthetic strain mock contigs labeled by cluster.

The 9,411 synthetic strain mock contig fragments of length > 1000bp plotted in the first two PCA dimensions with the 21 different clusters discriminated.

Supplementary Figure 9 Validation of the synthetic strain mock contig clusterings.

Top panel: a heatmap visualisation of the number of single-copy core genes in each cluster for the optimal 21 cluster solution generated by CONCOCT applied to the synthetic strain mock community. Bottom panel: a heatmap visualisation of the confusion matrix comparing the CONCOCT contig clusterings with the genome assignments for the synthetic strain mock contig fragments. Each row is a cluster and column a species. Intensities reflect the proportion of each cluster weighted by contig length deriving from each species.

Supplementary Figure 10 Two-dimensional visualisation of the Sharon (2013) contigs labeled by cluster.

The 5,571 Sharon (2013) contig fragments of length > 1000bp plotted in the first two PCA dimensions with the 34 different clusters discriminated.

Supplementary Figure 11 Validation of the Sharon (2013) contig clusterings.

Top panel: A heatmap visualisation of the number of single-copy core genes in each cluster for the optimal 34 cluster solution generated by CONCOCT applied to the Sharon (2013) data. Only clusters with at least one SCG are shown. Bottom panel: A heatmap visualisation of the confusion matrix comparing the CONCOCT contig clusterings with the species assignments from TAXAassign. Each row is a cluster and the intensities reflect the proportion of each cluster deriving from each species. The intensities are weighted by contig length. Most clusters probably derive from the majority species assignment except 33 which is probably a novel Peptoniphilus species as determined by direct blasting against the NCBI and 11 an unknown Staphylococcus species.

Supplementary Figure 12 SCG plots for four different composition based clustering algorithms applied to the Sharon (2013) data.

These were A) CompostBin4, B) LikelyBin5, C) MetaWatt6, and D) SCIMM7.

Supplementary Figure 13 Confusion matrix for the Escherichia coli (STEC) O104:H4 outbreak contigs.

A heatmap visualisation of the confusion matrix comparing the CONCOCT contig clusterings for the optimal 297 cluster solution of the shiga-toxigenic Escherichia coli (STEC) O104:H4 outbreak. Each column is a cluster named Dk where k is the cluster index. The rows correspond to the species assignments, the intensities reflect the proportion of each cluster deriving from each species, only clusters and taxa with greater than 50kb of labeled representatives are shown. The intensities are weighted by contig length.

Supplementary Figure 14 Frequency of single-copy core genes in the Escherichia coli (STEC) O104:H4 outbreak clusters.

A heatmap visualisation of the number of single-copy core genes in each cluster for the optimal 297 cluster solution of CONCOCT applied to the shiga-toxigenic Escherichia coli (STEC) O104:H4 outbreak. Only clusters with a total average coverage summed across samples of greater than 50.0 are shown.

Supplementary Figure 15 Mapping of contigs to the Escherichia coli (STEC) O104:H4 outbreak genome.

The mapping of contig fragments to the known Escherichia coli (STEC) O104:H4 outbreak genome with cluster discriminated by colour and total coverage across all samples shown on the y-axis.

Supplementary information

Supplementary Text and Figures

Supplementary Figures 1–15, Supplementary Tables 1–8 and Supplementary Note (PDF 12624 kb)

Supplementary Software

CONCOCT version 0.3.3 software (ZIP 5301 kb)

Source data

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Alneberg, J., Bjarnason, B., de Bruijn, I. et al. Binning metagenomic contigs by coverage and composition. Nat Methods 11, 1144–1146 (2014). https://doi.org/10.1038/nmeth.3103

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/nmeth.3103

This article is cited by

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing