CheckM2: a rapid, scalable and accurate tool for assessing microbial genome quality using machine learning

Chklovski, Alex; Parks, Donovan H.; Woodcroft, Ben J.; Tyson, Gene W.

doi:10.1038/s41592-023-01940-w

Article
Published: 27 July 2023

CheckM2: a rapid, scalable and accurate tool for assessing microbial genome quality using machine learning

Nature Methods volume 20, pages 1203–1212 (2023)Cite this article

7526 Accesses
54 Citations
105 Altmetric
Metrics details

Subjects

An Author Correction to this article was published on 21 March 2024

This article has been updated

Abstract

Advances in sequencing technologies and bioinformatics tools have dramatically increased the recovery rate of microbial genomes from metagenomic data. Assessing the quality of metagenome-assembled genomes (MAGs) is a critical step before downstream analysis. Here, we present CheckM2, an improved method of predicting genome quality of MAGs using machine learning. Using synthetic and experimental data, we demonstrate that CheckM2 outperforms existing tools in both accuracy and computational speed. In addition, CheckM2’s database can be rapidly updated with new high-quality reference genomes, including taxa represented only by a single genome. We also show that CheckM2 accurately predicts genome quality for MAGs from novel lineages, even for those with reduced genome size (for example, Patescibacteria and the DPANN superphylum). CheckM2 provides accurate genome quality predictions across bacterial and archaeal lineages, giving increased confidence when inferring biological conclusions from MAGs.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Fig. 1: Overview of CheckM2 development, benchmarking and validation.**

**Fig. 2: Benchmarking ML models on synthetic genomes of varying taxonomic novelty.**

**Fig. 3: Comparison of tools on RefSeq r202 genomes.**

**Fig. 4: Comparison of tools on novel lineages.**

**Fig. 5: Comparison of tools on non-self contamination.**

**Fig. 6: CheckM1 versus CheckM2 predictions across GTDB.**

Critical Assessment of Metagenome Interpretation: the second round of challenges

Article Open access 08 April 2022

Benchmarking second and third-generation sequencing platforms for microbial metagenomics

Article Open access 11 November 2022

Strain-level metagenomic assignment and compositional estimation for long reads with MetaMaps

Article Open access 11 July 2019

Data availability

Additional analyses supporting the conclusions of this study have been supplied as Supplementary Information. Supplementary scripts required to generate all synthetic genomes used in training and testing can be accessed from Zenodo (https://doi.org/10.5281/zenodo.6861629). Benchmarking data can be accessed from Zenodo (https://doi.org/10.5281/zenodo.8024307). A full list of feature vectors used by CheckM2 and their order can be accessed on GitHub (https://github.com/chklovski/checkm2_supplementary). The annotation vectors of all synthetic genomes used to train CheckM2, as well as completeness/contamination labels, are available as part of this repository in sparse vector format, formatted for both the NN and gradient boost models. Source data are provided with this paper.

Code availability

CheckM2 is available on GitHub (https://github.com/chklovski/CheckM2) and is released under the GNU General Public License v.3. The script required to update CheckM2 with new high-quality genomes is also available on GitHub (https://github.com/chklovski/checkm2_supplementary), although this will be carried out centrally by the CheckM2 team.

Change history

21 March 2024
A Correction to this paper has been published: https://doi.org/10.1038/s41592-024-02248-z

References

Woodcroft, B. J. et al. Genome-centric view of carbon processing in thawing permafrost. Nature 560, 49–54 (2018).
Article CAS PubMed Google Scholar
Anantharaman, K. et al. Thousands of microbial genomes shed light on interconnected biogeochemical processes in an aquifer system. Nat. Commun. 7, 13219 (2016).
Article CAS PubMed PubMed Central Google Scholar
Pasolli, E. et al. Extensive unexplored human microbiome diversity revealed by over 150,000 genomes from metagenomes spanning age, geography, and lifestyle. Cell 176, 649–662 (2019).
Article CAS PubMed PubMed Central Google Scholar
Parks, D. H., Imelfort, M., Skennerton, C. T., Hugenholtz, P. & Tyson, G. W. CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes. Genome Res. 25, 1043–1055 (2015).
Article CAS PubMed PubMed Central Google Scholar
AlQuraishi, M. AlphaFold at CASP13. Bioinformatics 35, 4862–4865 (2019).
Article CAS PubMed PubMed Central Google Scholar
Brown, C. T. et al. Unusual biology across a group comprising more than 15% of domain Bacteria. Nature 523, 208–211 (2015).
Article CAS PubMed Google Scholar
Nissen, J. N. et al. Improved metagenome binning and assembly using deep variational autoencoders. Nat. Biotechnol. 39, 555–560 (2021).
Article CAS PubMed Google Scholar
Haft, D. H. et al. RefSeq: an update on prokaryotic genome annotation and curation. Nucleic Acids Res. 46, 851–860 (2017).
Article Google Scholar
Parks, D. H. et al. A standardized bacterial taxonomy based on genome phylogeny substantially revises the tree of life. Nat. Biotechnol. 36, 996–1004 (2018).
Article CAS PubMed Google Scholar
Kanehisa, M. & Goto, S. KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Res. 28, 27–30 (2000).
Article CAS PubMed PubMed Central Google Scholar
Bowers, R. M. et al. Minimum information about a single amplified genome (MISAG) and a metagenome-assembled genome (MIMAG) of bacteria and archaea. Nat. Biotechnol. 35, 725–731 (2017).
Article CAS PubMed PubMed Central Google Scholar
Abadi, M. et al. Tensorflow: a system for large-scale machine learning. In Proc. 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16) 265–283 (2016).
Ke, G. et al. Lightgbm: A highly efficient gradient boosting decision tree. Adv. Neural Inf. Process. Syst. 30, 3146–3154 (2017).
Google Scholar
Parks, D. H. et al. Recovery of nearly 8,000 metagenome-assembled genomes substantially expands the tree of life. Nat. Microbiol. 2, 1533–1542 (2017).
Article CAS PubMed Google Scholar
Castelle, C. J. & Banfield, J. F. Major new microbial groups expand diversity and alter our understanding of the tree of life. Cell 172, 1181–1197 (2018).
Article CAS PubMed Google Scholar
Castelle, C. J. et al. Biosynthetic capacity, metabolic variety and unusual biology in the CPR and DPANN radiations. Nat. Rev. Microbiol. 16, 629–645 (2018).
Article CAS PubMed Google Scholar
Méheust, R., Burstein, D., Castelle, C. J. & Banfield, J. F. The distinction of CPR bacteria from other bacteria based on protein family content. Nat. Commun. 10, 4173 (2019).
Article PubMed PubMed Central Google Scholar
Singleton, C. M. et al. Connecting structure to function with the recovery of over 1000 high-quality metagenome-assembled genomes from activated sludge using long-read sequencing. Nat. Commun. 12, 2009 (2021).
Article CAS PubMed PubMed Central Google Scholar
Lui, L. M., Nielsen, T. N. & Arkin, A. P. A method for achieving complete microbial genomes and improving bins from metagenomics data. PLoS Comput. Biol. 17, e1008972 (2021).
Article CAS PubMed PubMed Central Google Scholar
Orakov, A. et al. GUNC: detection of chimerism and contamination in prokaryotic genomes. Genome Biol. 22, 178 (2021).
Article CAS PubMed PubMed Central Google Scholar
Yeoh, Y. K., Sekiguchi, Y., Parks, D. H. & Hugenholtz, P. Comparative genomics of candidate phylum TM6 suggests that parasitism is widespread and ancestral in this lineage. Mol. Biol. Evol. 33, 915–927 (2016).
Article CAS PubMed Google Scholar
Bowerman, K. L. et al. Disease-associated gut microbiome and metabolome changes in patients with chronic obstructive pulmonary disease. Nat. Commun. 11, 5886 (2020).
Article CAS PubMed PubMed Central Google Scholar
Neuenschwander, S. M., Ghai, R., Pernthaler, J. & Salcher, M. M. Microdiversification in genome-streamlined ubiquitous freshwater Actinobacteria. ISME J. 12, 185–198 (2018).
Article CAS PubMed Google Scholar
Rinke, C. et al. A phylogenomic and ecological analysis of the globally abundant Marine Group II archaea (Ca. Poseidoniales ord. nov.). ISME J. 13, 663–675 (2019).
Article CAS PubMed Google Scholar
Nelson, W. C., Tully, B. J. & Mobberley, J. M. Biases in genome reconstruction from metagenomic data. PeerJ 8, e10119 (2020).
Article PubMed PubMed Central Google Scholar
Jarett, J. K. et al. Single-cell genomics of co-sorted Nanoarchaeota suggests novel putative host associations and diversification of proteins involved in symbiosis. Microbiome 6, 161 (2018).
Article PubMed PubMed Central Google Scholar
Lundberg, S. M., Allen, P. G. & Lee, S.-I. A unified approach to interpreting model predictions. Adv. Neural Info. Proc. Syst. 30, 4765–4774 (2017).
Von Mering, C. et al. STRING 7—recent developments in the integration and prediction of protein interactions. Nucleic Acids Res. 35, D358–D362 (2007).
Article Google Scholar
Jensen, L. J. et al. eggNOG: automated construction and annotation of orthologous groups of genes. Nucleic Acids Res. 36, D250–D254 (2007).
Article PubMed PubMed Central Google Scholar
Shaffer, M. et al. DRAM for distilling microbial metabolism to automate the curation of microbiome function. Nucleic Acids Res. 48, 8883–8900 (2020).
Article CAS PubMed PubMed Central Google Scholar
Woodcroft, B. J. Galah. GitHub https://github.com/wwood/galah (2020).
Hyatt, D. et al. Prodigal: Prokaryotic gene recognition and translation initiation site identification. BMC Bioinforma. 11, 119 (2010).
Article Google Scholar
Bushnell, B. BBMap: a fast, accurate, splice-aware aligner (OSTI, US DoE, 2014).
Kanehisa, M., Sato, Y., Kawashima, M., Furumichi, M. & Tanabe, M. KEGG as a reference resource for gene and protein annotation. Nucleic Acids Res. 44, D457–D462 (2016).
Article CAS PubMed Google Scholar
Benson, D. A. et al. GenBank. Nucleic Acids Res. 46, D41 (2018).
Article CAS PubMed Google Scholar
Seppey, M., Manni, M. & Zdobnov, E. M. BUSCO: assessing genome assembly and annotation completeness. Methods Mol. Biol. 1962, 227–245 (2019).
Article CAS PubMed Google Scholar
Waskom, M. L. Seaborn: statistical data visualization. J. Open Source Softw. 6, 3021 (2021).
Article Google Scholar

Download references

Acknowledgements

We thank E. McMaster (Queensland University of Technology, Translational Research Institute, Woolloongabba, Queensland, Australia) for her help in refining the figures. This work was supported by the National Science Foundation Biology Integration Institute – EMERGE (GRT00059410). A.C. is supported by Australian Government Research Training Program Scholarships. G.W.T. is supported by Australian Research Council (ARC) (grant no. FT170100070). B.J.W. is supported by ARC Discovery Early Career Research (grant no. DE160100248).

Author information

Authors and Affiliations

Centre for Microbiome Research, School of Biomedical Sciences, Queensland University of Technology, Translational Research Institute, Woolloongabba, Queensland, Australia
Alex Chklovski, Ben J. Woodcroft & Gene W. Tyson
Donovan Parks, Bioinformatic Consultant, Castlegar, British Columbia, Canada
Donovan H. Parks

Authors

Alex Chklovski
View author publications
You can also search for this author in PubMed Google Scholar
Donovan H. Parks
View author publications
You can also search for this author in PubMed Google Scholar
Ben J. Woodcroft
View author publications
You can also search for this author in PubMed Google Scholar
Gene W. Tyson
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

A.C. and G.W.T. designed the overall workflow and planned the key steps. A.C. generated the synthetic genomes, trained ML models, performed benchmarking and wrote the final code base of CheckM2. B.J.W. and D.H.P. guided code improvements and optimizations. G.W.T., D.H.P and B.J.W. helped interpret the result and made further suggestions for future directions and improvements. A.C. and G.W.T. drafted and wrote the manuscript. All authors edited the manuscript before submission.

Corresponding author

Correspondence to Gene W. Tyson.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Methods thanks Stephen Nayfach, Mads Albertsen and C. Titus Brown for their contribution to the peer review of this work. Primary Handling Editor: Lei Tang and Lin Tang, in collaboration with the Nature Methods team.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Supplementary Notes 1–10 and Figs. 1–3.

Reporting Summary

Supplementary Tables

Supplementary Tables 1–12 all referenced in the main paper.

Source data

Source Data Fig. 2

Statistical source data for Fig. 2.

Source Data Fig. 3

Statistical source data for Fig. 3.

Source Data Fig. 4

Statistical source data for Fig. 4.

Source Data Fig. 5

Statistical source data for Fig. 5.

Source Data Fig. 6

Statistical source data for Fig. 6.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Chklovski, A., Parks, D.H., Woodcroft, B.J. et al. CheckM2: a rapid, scalable and accurate tool for assessing microbial genome quality using machine learning. Nat Methods 20, 1203–1212 (2023). https://doi.org/10.1038/s41592-023-01940-w

Download citation

Received: 06 July 2022
Accepted: 14 June 2023
Published: 27 July 2023
Issue Date: August 2023
DOI: https://doi.org/10.1038/s41592-023-01940-w

This article is cited by

Autotrophic biofilms sustained by deeply sourced groundwater host diverse bacteria implicated in sulfur and hydrogen metabolism
- Luis E. Valentin-Alvarado
- Sirine C. Fakra
- Jillian F. Banfield
Microbiome (2024)
Many purported pseudogenes in bacterial genomes are bona fide genes
- Nicholas P. Cooley
- Erik S. Wright
BMC Genomics (2024)
Physiological versatility of ANME-1 and Bathyarchaeotoa-8 archaea evidenced by inverse stable isotope labeling
- Xiuran Yin
- Guowei Zhou
- Michael W. Friedrich
Microbiome (2024)
Effective binning of metagenomic contigs using contrastive multi-view representation learning
- Ziye Wang
- Ronghui You
- Shanfeng Zhu
Nature Communications (2024)
An abundant bacterial phylum with nitrite-oxidizing potential in oligotrophic marine sediments
- Rui Zhao
- Steffen L. Jørgensen
- Andrew R. Babbin
Communications Biology (2024)