Abstract
Advances in sequencing technologies and bioinformatics tools have dramatically increased the recovery rate of microbial genomes from metagenomic data. Assessing the quality of metagenome-assembled genomes (MAGs) is a critical step before downstream analysis. Here, we present CheckM2, an improved method of predicting genome quality of MAGs using machine learning. Using synthetic and experimental data, we demonstrate that CheckM2 outperforms existing tools in both accuracy and computational speed. In addition, CheckM2’s database can be rapidly updated with new high-quality reference genomes, including taxa represented only by a single genome. We also show that CheckM2 accurately predicts genome quality for MAGs from novel lineages, even for those with reduced genome size (for example, Patescibacteria and the DPANN superphylum). CheckM2 provides accurate genome quality predictions across bacterial and archaeal lineages, giving increased confidence when inferring biological conclusions from MAGs.
This is a preview of subscription content, access via your institution
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$29.99 / 30 days
cancel any time
Subscribe to this journal
Receive 12 print issues and online access
$259.00 per year
only $21.58 per issue
Buy this article
- Purchase on Springer Link
- Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout
Similar content being viewed by others
Data availability
Additional analyses supporting the conclusions of this study have been supplied as Supplementary Information. Supplementary scripts required to generate all synthetic genomes used in training and testing can be accessed from Zenodo (https://doi.org/10.5281/zenodo.6861629). Benchmarking data can be accessed from Zenodo (https://doi.org/10.5281/zenodo.8024307). A full list of feature vectors used by CheckM2 and their order can be accessed on GitHub (https://github.com/chklovski/checkm2_supplementary). The annotation vectors of all synthetic genomes used to train CheckM2, as well as completeness/contamination labels, are available as part of this repository in sparse vector format, formatted for both the NN and gradient boost models. Source data are provided with this paper.
Code availability
CheckM2 is available on GitHub (https://github.com/chklovski/CheckM2) and is released under the GNU General Public License v.3. The script required to update CheckM2 with new high-quality genomes is also available on GitHub (https://github.com/chklovski/checkm2_supplementary), although this will be carried out centrally by the CheckM2 team.
Change history
21 March 2024
A Correction to this paper has been published: https://doi.org/10.1038/s41592-024-02248-z
References
Woodcroft, B. J. et al. Genome-centric view of carbon processing in thawing permafrost. Nature 560, 49–54 (2018).
Anantharaman, K. et al. Thousands of microbial genomes shed light on interconnected biogeochemical processes in an aquifer system. Nat. Commun. 7, 13219 (2016).
Pasolli, E. et al. Extensive unexplored human microbiome diversity revealed by over 150,000 genomes from metagenomes spanning age, geography, and lifestyle. Cell 176, 649–662 (2019).
Parks, D. H., Imelfort, M., Skennerton, C. T., Hugenholtz, P. & Tyson, G. W. CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes. Genome Res. 25, 1043–1055 (2015).
AlQuraishi, M. AlphaFold at CASP13. Bioinformatics 35, 4862–4865 (2019).
Brown, C. T. et al. Unusual biology across a group comprising more than 15% of domain Bacteria. Nature 523, 208–211 (2015).
Nissen, J. N. et al. Improved metagenome binning and assembly using deep variational autoencoders. Nat. Biotechnol. 39, 555–560 (2021).
Haft, D. H. et al. RefSeq: an update on prokaryotic genome annotation and curation. Nucleic Acids Res. 46, 851–860 (2017).
Parks, D. H. et al. A standardized bacterial taxonomy based on genome phylogeny substantially revises the tree of life. Nat. Biotechnol. 36, 996–1004 (2018).
Kanehisa, M. & Goto, S. KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Res. 28, 27–30 (2000).
Bowers, R. M. et al. Minimum information about a single amplified genome (MISAG) and a metagenome-assembled genome (MIMAG) of bacteria and archaea. Nat. Biotechnol. 35, 725–731 (2017).
Abadi, M. et al. Tensorflow: a system for large-scale machine learning. In Proc. 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16) 265–283 (2016).
Ke, G. et al. Lightgbm: A highly efficient gradient boosting decision tree. Adv. Neural Inf. Process. Syst. 30, 3146–3154 (2017).
Parks, D. H. et al. Recovery of nearly 8,000 metagenome-assembled genomes substantially expands the tree of life. Nat. Microbiol. 2, 1533–1542 (2017).
Castelle, C. J. & Banfield, J. F. Major new microbial groups expand diversity and alter our understanding of the tree of life. Cell 172, 1181–1197 (2018).
Castelle, C. J. et al. Biosynthetic capacity, metabolic variety and unusual biology in the CPR and DPANN radiations. Nat. Rev. Microbiol. 16, 629–645 (2018).
Méheust, R., Burstein, D., Castelle, C. J. & Banfield, J. F. The distinction of CPR bacteria from other bacteria based on protein family content. Nat. Commun. 10, 4173 (2019).
Singleton, C. M. et al. Connecting structure to function with the recovery of over 1000 high-quality metagenome-assembled genomes from activated sludge using long-read sequencing. Nat. Commun. 12, 2009 (2021).
Lui, L. M., Nielsen, T. N. & Arkin, A. P. A method for achieving complete microbial genomes and improving bins from metagenomics data. PLoS Comput. Biol. 17, e1008972 (2021).
Orakov, A. et al. GUNC: detection of chimerism and contamination in prokaryotic genomes. Genome Biol. 22, 178 (2021).
Yeoh, Y. K., Sekiguchi, Y., Parks, D. H. & Hugenholtz, P. Comparative genomics of candidate phylum TM6 suggests that parasitism is widespread and ancestral in this lineage. Mol. Biol. Evol. 33, 915–927 (2016).
Bowerman, K. L. et al. Disease-associated gut microbiome and metabolome changes in patients with chronic obstructive pulmonary disease. Nat. Commun. 11, 5886 (2020).
Neuenschwander, S. M., Ghai, R., Pernthaler, J. & Salcher, M. M. Microdiversification in genome-streamlined ubiquitous freshwater Actinobacteria. ISME J. 12, 185–198 (2018).
Rinke, C. et al. A phylogenomic and ecological analysis of the globally abundant Marine Group II archaea (Ca. Poseidoniales ord. nov.). ISME J. 13, 663–675 (2019).
Nelson, W. C., Tully, B. J. & Mobberley, J. M. Biases in genome reconstruction from metagenomic data. PeerJ 8, e10119 (2020).
Jarett, J. K. et al. Single-cell genomics of co-sorted Nanoarchaeota suggests novel putative host associations and diversification of proteins involved in symbiosis. Microbiome 6, 161 (2018).
Lundberg, S. M., Allen, P. G. & Lee, S.-I. A unified approach to interpreting model predictions. Adv. Neural Info. Proc. Syst. 30, 4765–4774 (2017).
Von Mering, C. et al. STRING 7—recent developments in the integration and prediction of protein interactions. Nucleic Acids Res. 35, D358–D362 (2007).
Jensen, L. J. et al. eggNOG: automated construction and annotation of orthologous groups of genes. Nucleic Acids Res. 36, D250–D254 (2007).
Shaffer, M. et al. DRAM for distilling microbial metabolism to automate the curation of microbiome function. Nucleic Acids Res. 48, 8883–8900 (2020).
Woodcroft, B. J. Galah. GitHub https://github.com/wwood/galah (2020).
Hyatt, D. et al. Prodigal: Prokaryotic gene recognition and translation initiation site identification. BMC Bioinforma. 11, 119 (2010).
Bushnell, B. BBMap: a fast, accurate, splice-aware aligner (OSTI, US DoE, 2014).
Kanehisa, M., Sato, Y., Kawashima, M., Furumichi, M. & Tanabe, M. KEGG as a reference resource for gene and protein annotation. Nucleic Acids Res. 44, D457–D462 (2016).
Benson, D. A. et al. GenBank. Nucleic Acids Res. 46, D41 (2018).
Seppey, M., Manni, M. & Zdobnov, E. M. BUSCO: assessing genome assembly and annotation completeness. Methods Mol. Biol. 1962, 227–245 (2019).
Waskom, M. L. Seaborn: statistical data visualization. J. Open Source Softw. 6, 3021 (2021).
Acknowledgements
We thank E. McMaster (Queensland University of Technology, Translational Research Institute, Woolloongabba, Queensland, Australia) for her help in refining the figures. This work was supported by the National Science Foundation Biology Integration Institute – EMERGE (GRT00059410). A.C. is supported by Australian Government Research Training Program Scholarships. G.W.T. is supported by Australian Research Council (ARC) (grant no. FT170100070). B.J.W. is supported by ARC Discovery Early Career Research (grant no. DE160100248).
Author information
Authors and Affiliations
Contributions
A.C. and G.W.T. designed the overall workflow and planned the key steps. A.C. generated the synthetic genomes, trained ML models, performed benchmarking and wrote the final code base of CheckM2. B.J.W. and D.H.P. guided code improvements and optimizations. G.W.T., D.H.P and B.J.W. helped interpret the result and made further suggestions for future directions and improvements. A.C. and G.W.T. drafted and wrote the manuscript. All authors edited the manuscript before submission.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Peer review
Peer review information
Nature Methods thanks Stephen Nayfach, Mads Albertsen and C. Titus Brown for their contribution to the peer review of this work. Primary Handling Editor: Lei Tang and Lin Tang, in collaboration with the Nature Methods team.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Supplementary Information
Supplementary Notes 1–10 and Figs. 1–3.
Supplementary Tables
Supplementary Tables 1–12 all referenced in the main paper.
Source data
Source Data Fig. 2
Statistical source data for Fig. 2.
Source Data Fig. 3
Statistical source data for Fig. 3.
Source Data Fig. 4
Statistical source data for Fig. 4.
Source Data Fig. 5
Statistical source data for Fig. 5.
Source Data Fig. 6
Statistical source data for Fig. 6.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Chklovski, A., Parks, D.H., Woodcroft, B.J. et al. CheckM2: a rapid, scalable and accurate tool for assessing microbial genome quality using machine learning. Nat Methods 20, 1203–1212 (2023). https://doi.org/10.1038/s41592-023-01940-w
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/s41592-023-01940-w
This article is cited by
-
Autotrophic biofilms sustained by deeply sourced groundwater host diverse bacteria implicated in sulfur and hydrogen metabolism
Microbiome (2024)
-
Many purported pseudogenes in bacterial genomes are bona fide genes
BMC Genomics (2024)
-
Physiological versatility of ANME-1 and Bathyarchaeotoa-8 archaea evidenced by inverse stable isotope labeling
Microbiome (2024)
-
Effective binning of metagenomic contigs using contrastive multi-view representation learning
Nature Communications (2024)
-
An abundant bacterial phylum with nitrite-oxidizing potential in oligotrophic marine sediments
Communications Biology (2024)