Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Article
  • Published:

CheckM2: a rapid, scalable and accurate tool for assessing microbial genome quality using machine learning

An Author Correction to this article was published on 21 March 2024

This article has been updated

Abstract

Advances in sequencing technologies and bioinformatics tools have dramatically increased the recovery rate of microbial genomes from metagenomic data. Assessing the quality of metagenome-assembled genomes (MAGs) is a critical step before downstream analysis. Here, we present CheckM2, an improved method of predicting genome quality of MAGs using machine learning. Using synthetic and experimental data, we demonstrate that CheckM2 outperforms existing tools in both accuracy and computational speed. In addition, CheckM2’s database can be rapidly updated with new high-quality reference genomes, including taxa represented only by a single genome. We also show that CheckM2 accurately predicts genome quality for MAGs from novel lineages, even for those with reduced genome size (for example, Patescibacteria and the DPANN superphylum). CheckM2 provides accurate genome quality predictions across bacterial and archaeal lineages, giving increased confidence when inferring biological conclusions from MAGs.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: Overview of CheckM2 development, benchmarking and validation.
Fig. 2: Benchmarking ML models on synthetic genomes of varying taxonomic novelty.
Fig. 3: Comparison of tools on RefSeq r202 genomes.
Fig. 4: Comparison of tools on novel lineages.
Fig. 5: Comparison of tools on non-self contamination.
Fig. 6: CheckM1 versus CheckM2 predictions across GTDB.

Similar content being viewed by others

Data availability

Additional analyses supporting the conclusions of this study have been supplied as Supplementary Information. Supplementary scripts required to generate all synthetic genomes used in training and testing can be accessed from Zenodo (https://doi.org/10.5281/zenodo.6861629). Benchmarking data can be accessed from Zenodo (https://doi.org/10.5281/zenodo.8024307). A full list of feature vectors used by CheckM2 and their order can be accessed on GitHub (https://github.com/chklovski/checkm2_supplementary). The annotation vectors of all synthetic genomes used to train CheckM2, as well as completeness/contamination labels, are available as part of this repository in sparse vector format, formatted for both the NN and gradient boost models. Source data are provided with this paper.

Code availability

CheckM2 is available on GitHub (https://github.com/chklovski/CheckM2) and is released under the GNU General Public License v.3. The script required to update CheckM2 with new high-quality genomes is also available on GitHub (https://github.com/chklovski/checkm2_supplementary), although this will be carried out centrally by the CheckM2 team.

Change history

References

  1. Woodcroft, B. J. et al. Genome-centric view of carbon processing in thawing permafrost. Nature 560, 49–54 (2018).

    Article  CAS  PubMed  Google Scholar 

  2. Anantharaman, K. et al. Thousands of microbial genomes shed light on interconnected biogeochemical processes in an aquifer system. Nat. Commun. 7, 13219 (2016).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  3. Pasolli, E. et al. Extensive unexplored human microbiome diversity revealed by over 150,000 genomes from metagenomes spanning age, geography, and lifestyle. Cell 176, 649–662 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  4. Parks, D. H., Imelfort, M., Skennerton, C. T., Hugenholtz, P. & Tyson, G. W. CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes. Genome Res. 25, 1043–1055 (2015).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  5. AlQuraishi, M. AlphaFold at CASP13. Bioinformatics 35, 4862–4865 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  6. Brown, C. T. et al. Unusual biology across a group comprising more than 15% of domain Bacteria. Nature 523, 208–211 (2015).

    Article  CAS  PubMed  Google Scholar 

  7. Nissen, J. N. et al. Improved metagenome binning and assembly using deep variational autoencoders. Nat. Biotechnol. 39, 555–560 (2021).

    Article  CAS  PubMed  Google Scholar 

  8. Haft, D. H. et al. RefSeq: an update on prokaryotic genome annotation and curation. Nucleic Acids Res. 46, 851–860 (2017).

    Article  Google Scholar 

  9. Parks, D. H. et al. A standardized bacterial taxonomy based on genome phylogeny substantially revises the tree of life. Nat. Biotechnol. 36, 996–1004 (2018).

    Article  CAS  PubMed  Google Scholar 

  10. Kanehisa, M. & Goto, S. KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Res. 28, 27–30 (2000).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  11. Bowers, R. M. et al. Minimum information about a single amplified genome (MISAG) and a metagenome-assembled genome (MIMAG) of bacteria and archaea. Nat. Biotechnol. 35, 725–731 (2017).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  12. Abadi, M. et al. Tensorflow: a system for large-scale machine learning. In Proc. 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16) 265–283 (2016).

  13. Ke, G. et al. Lightgbm: A highly efficient gradient boosting decision tree. Adv. Neural Inf. Process. Syst. 30, 3146–3154 (2017).

    Google Scholar 

  14. Parks, D. H. et al. Recovery of nearly 8,000 metagenome-assembled genomes substantially expands the tree of life. Nat. Microbiol. 2, 1533–1542 (2017).

    Article  CAS  PubMed  Google Scholar 

  15. Castelle, C. J. & Banfield, J. F. Major new microbial groups expand diversity and alter our understanding of the tree of life. Cell 172, 1181–1197 (2018).

    Article  CAS  PubMed  Google Scholar 

  16. Castelle, C. J. et al. Biosynthetic capacity, metabolic variety and unusual biology in the CPR and DPANN radiations. Nat. Rev. Microbiol. 16, 629–645 (2018).

    Article  CAS  PubMed  Google Scholar 

  17. Méheust, R., Burstein, D., Castelle, C. J. & Banfield, J. F. The distinction of CPR bacteria from other bacteria based on protein family content. Nat. Commun. 10, 4173 (2019).

    Article  PubMed  PubMed Central  Google Scholar 

  18. Singleton, C. M. et al. Connecting structure to function with the recovery of over 1000 high-quality metagenome-assembled genomes from activated sludge using long-read sequencing. Nat. Commun. 12, 2009 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  19. Lui, L. M., Nielsen, T. N. & Arkin, A. P. A method for achieving complete microbial genomes and improving bins from metagenomics data. PLoS Comput. Biol. 17, e1008972 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  20. Orakov, A. et al. GUNC: detection of chimerism and contamination in prokaryotic genomes. Genome Biol. 22, 178 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  21. Yeoh, Y. K., Sekiguchi, Y., Parks, D. H. & Hugenholtz, P. Comparative genomics of candidate phylum TM6 suggests that parasitism is widespread and ancestral in this lineage. Mol. Biol. Evol. 33, 915–927 (2016).

    Article  CAS  PubMed  Google Scholar 

  22. Bowerman, K. L. et al. Disease-associated gut microbiome and metabolome changes in patients with chronic obstructive pulmonary disease. Nat. Commun. 11, 5886 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  23. Neuenschwander, S. M., Ghai, R., Pernthaler, J. & Salcher, M. M. Microdiversification in genome-streamlined ubiquitous freshwater Actinobacteria. ISME J. 12, 185–198 (2018).

    Article  CAS  PubMed  Google Scholar 

  24. Rinke, C. et al. A phylogenomic and ecological analysis of the globally abundant Marine Group II archaea (Ca. Poseidoniales ord. nov.). ISME J. 13, 663–675 (2019).

    Article  CAS  PubMed  Google Scholar 

  25. Nelson, W. C., Tully, B. J. & Mobberley, J. M. Biases in genome reconstruction from metagenomic data. PeerJ 8, e10119 (2020).

    Article  PubMed  PubMed Central  Google Scholar 

  26. Jarett, J. K. et al. Single-cell genomics of co-sorted Nanoarchaeota suggests novel putative host associations and diversification of proteins involved in symbiosis. Microbiome 6, 161 (2018).

    Article  PubMed  PubMed Central  Google Scholar 

  27. Lundberg, S. M., Allen, P. G. & Lee, S.-I. A unified approach to interpreting model predictions. Adv. Neural Info. Proc. Syst. 30, 4765–4774 (2017).

  28. Von Mering, C. et al. STRING 7—recent developments in the integration and prediction of protein interactions. Nucleic Acids Res. 35, D358–D362 (2007).

    Article  Google Scholar 

  29. Jensen, L. J. et al. eggNOG: automated construction and annotation of orthologous groups of genes. Nucleic Acids Res. 36, D250–D254 (2007).

    Article  PubMed  PubMed Central  Google Scholar 

  30. Shaffer, M. et al. DRAM for distilling microbial metabolism to automate the curation of microbiome function. Nucleic Acids Res. 48, 8883–8900 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  31. Woodcroft, B. J. Galah. GitHub https://github.com/wwood/galah (2020).

  32. Hyatt, D. et al. Prodigal: Prokaryotic gene recognition and translation initiation site identification. BMC Bioinforma. 11, 119 (2010).

    Article  Google Scholar 

  33. Bushnell, B. BBMap: a fast, accurate, splice-aware aligner (OSTI, US DoE, 2014).

  34. Kanehisa, M., Sato, Y., Kawashima, M., Furumichi, M. & Tanabe, M. KEGG as a reference resource for gene and protein annotation. Nucleic Acids Res. 44, D457–D462 (2016).

    Article  CAS  PubMed  Google Scholar 

  35. Benson, D. A. et al. GenBank. Nucleic Acids Res. 46, D41 (2018).

    Article  CAS  PubMed  Google Scholar 

  36. Seppey, M., Manni, M. & Zdobnov, E. M. BUSCO: assessing genome assembly and annotation completeness. Methods Mol. Biol. 1962, 227–245 (2019).

    Article  CAS  PubMed  Google Scholar 

  37. Waskom, M. L. Seaborn: statistical data visualization. J. Open Source Softw. 6, 3021 (2021).

    Article  Google Scholar 

Download references

Acknowledgements

We thank E. McMaster (Queensland University of Technology, Translational Research Institute, Woolloongabba, Queensland, Australia) for her help in refining the figures. This work was supported by the National Science Foundation Biology Integration Institute – EMERGE (GRT00059410). A.C. is supported by Australian Government Research Training Program Scholarships. G.W.T. is supported by Australian Research Council (ARC) (grant no. FT170100070). B.J.W. is supported by ARC Discovery Early Career Research (grant no. DE160100248).

Author information

Authors and Affiliations

Authors

Contributions

A.C. and G.W.T. designed the overall workflow and planned the key steps. A.C. generated the synthetic genomes, trained ML models, performed benchmarking and wrote the final code base of CheckM2. B.J.W. and D.H.P. guided code improvements and optimizations. G.W.T., D.H.P and B.J.W. helped interpret the result and made further suggestions for future directions and improvements. A.C. and G.W.T. drafted and wrote the manuscript. All authors edited the manuscript before submission.

Corresponding author

Correspondence to Gene W. Tyson.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Methods thanks Stephen Nayfach, Mads Albertsen and C. Titus Brown for their contribution to the peer review of this work. Primary Handling Editor: Lei Tang and Lin Tang, in collaboration with the Nature Methods team.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Supplementary Notes 1–10 and Figs. 1–3.

Reporting Summary

Supplementary Tables

Supplementary Tables 1–12 all referenced in the main paper.

Source data

Source Data Fig. 2

Statistical source data for Fig. 2.

Source Data Fig. 3

Statistical source data for Fig. 3.

Source Data Fig. 4

Statistical source data for Fig. 4.

Source Data Fig. 5

Statistical source data for Fig. 5.

Source Data Fig. 6

Statistical source data for Fig. 6.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Chklovski, A., Parks, D.H., Woodcroft, B.J. et al. CheckM2: a rapid, scalable and accurate tool for assessing microbial genome quality using machine learning. Nat Methods 20, 1203–1212 (2023). https://doi.org/10.1038/s41592-023-01940-w

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/s41592-023-01940-w

This article is cited by

Search

Quick links

Nature Briefing AI and Robotics

Sign up for the Nature Briefing: AI and Robotics newsletter — what matters in AI and robotics research, free to your inbox weekly.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing: AI and Robotics