Recovery of nearly 8,000 metagenome-assembled genomes substantially expands the tree of life

Challenges in cultivating microorganisms have limited the phylogenetic diversity of currently available microbial genomes. This is being addressed by advances in sequencing throughput and computational techniques that allow for the cultivation-independent recovery of genomes from metagenomes. Here, we report the reconstruction of 7,903 bacterial and archaeal genomes from >1,500 public metagenomes. All genomes are estimated to be ≥50% complete and nearly half are ≥90% complete with ≤5% contamination. These genomes increase the phylogenetic diversity of bacterial and archaeal genome trees by >30% and provide the first representatives of 17 bacterial and three archaeal candidate phyla. We also recovered 245 genomes from the Patescibacteria superphylum (also known as the Candidate Phyla Radiation) and find that the relative diversity of this group varies substantially with different protein marker sets. The scale and quality of this data set demonstrate that recovering genomes from metagenomes provides an expedient path forward to exploring microbial dark matter. The recovery of 7,903 bacterial and archaeal metagenome-assembled genomes increases the phylogenetic diversity represented by public genome repositories and provides the first representatives from 20 candidate phyla.

S equencing of microbial genomes has accelerated with reductions in sequencing costs, and public repositories now contain nearly 70,000 bacterial and archaeal genomes. The majority of these genomes have been obtained from axenic cultures 1,2 and disproportionately reflect microorganisms of medical importance 3 . Consequently, current genome repositories are not representative of the microbial diversity known from 16S rRNA gene surveys 4 . Concerted efforts are being made to address this limitation by target ing phylogenetically distinct microorganisms for cultivation [5][6][7] and single-cell sequencing 4,8 . Although these approaches continue to provide valuable reference genomes, the former is restricted to microorganisms amenable to cultivation and the latter is hampered by technical challenges and the need for specialised equipment 9 . Obtaining genomes from metagenomes is an emerging approach with the potential for large-scale recovery of near-complete genomes [10][11][12] .
Until recently, recovering genomes from metagenomic data was restricted to samples with low microbial diversity 13 , but improved sequencing throughput and advances in computational techniques now allow metagenome-assembled genomes (MAGs) to be recovered from high diversity environments 14,15 . MAGs are obtained by grouping or 'binning' together assembled contigs with similar sequence composition, depth of coverage across one or more related samples and taxonomic affiliations 16,17 . Several tools have been developed that exploit these sources of information to produce genomes from metagenomic data [18][19][20][21] and there are ongoing efforts to evaluate the effectiveness of different approaches 22 . Although closed genomes have been obtained using metagenomic binning methods 10,23 , MAGs are typically incomplete and may contain contigs from multiple strains or species due to challenges in distinguishing between related community members both in the assembly and binning processes 19,24 . This has spurred the development of methods for assessing the quality of recovered MAGs in order to allow biological inferences to be made with regards to their estimated completeness and contamination 25,26 .
Significant insights have recently been made based on the MAGs of uncultivated microorganisms. These include elucidation of several phyla previously lacking genomic representatives [27][28][29] , including the Patescibacteria superphylum 4 , which has subsequently been referred to as the 'Candidate Phyla Radiation' (CPR) as it may consist of upwards of 35 candidate phyla 10,30 . Notable evolutionary and metabolic insights include the discovery of eukaryotic-like cytoskeleton genes in the archaeon Lokiarchaeota 31,32 and the identification of putative methane-metabolizing genes in the Bathyarchaeota and Verstraetearchaeota phyla 33,34 . These initial studies demonstrate the need for additional genomic representatives across the tree of life in order to more fully appreciate microbial evolution and metabolism.
Here, we present the first large-scale initiative to recover MAGs from publicly available metagenomes. Nearly 8,000 draft-quality genomes were recovered from over 1,500 metagenomes, more than a threefold increase over large initiatives to genomically populate the tree of life such as the Genomic Encyclopedia of Bacteria and Archaea 35 (~2,000 genomes), the Human Microbiome Project 3 (~2,000) and the largest previous MAG study 11 (~2,500). We refer to our set of MAGs as the Uncultivated Bacteria and Archaea (UBA) data set. Genome-based phylogenetic analysis indicates that the UBA genomes provide the first representatives of several major bacterial and archaeal lineages and substantially expand genomic representation across the tree of life.

Results
Genomes are readily recovered from metagenomic data. MAGs were recovered from 1,550 metagenomes submitted to the Sequence Read Archive (SRA) before 31 December 2015 ( Supplementary  Fig. 1). We predominantly considered environmental and nonhuman gastrointestinal samples in order to focus on metagenomes likely to contain microbial populations from under-sampled lineages (Supplementary Table 1). The completeness and contamination of each MAG was estimated from the presence and absence of lineage-specific genes expected to be ubiquitous and single copy 25 , and these estimates, along with assembly statistics, used to identify genomes suitable for further study. A total of 64,295 MAGs were obtained, of which 7,903 (7,280 bacterial and 623 archaeal) Articles NATuRe MiCRobiology form the UBA data set as they met our filtering criteria of having an estimated quality ≥ 50 (defined as the estimated completeness of a genome minus five times its estimated contamination) and consisting of ≤ 500 scaffolds with an N50 ≥ 10 kb (Fig. 1 and Supplementary  Table 2). Over 93% of the 7,903 UBA genomes have an average coverage of ≥ 10× (5th percentile, 9.2× , 95th percentile, 268× ) and 95.8% have > 5× coverage over 90% of bases, providing assurance of high-quality base-calling across the genomes 3,36 . Among the UBA genomes is a subset of 3,438 near-complete genomes (3,225 bacterial and 213 archaeal) estimated to be ≥ 90% complete with ≤ 5% contamination (Fig. 1a). These genomes consist of ≤ 100 scaffolds in 70.2% of cases (≤ 200 scaffolds in 92.0% genomes) and have an average N50 of 136 kb. Comparison of near-complete UBA genomes that are conspecific strains of complete isolate genomes also suggest that the recovered MAGs have no systematic loss of genomic content, with the exception of extrachromosomal elements such as plasmids (Supplementary Note 1).
The UBA data set was also assessed relative to the criteria used by the Human Microbiome Project (HMP) for defining high-quality draft genomes 3,37 . Of the 3,438 UBA genomes we have defined as near complete, 3,201 (93.1%) pass all of the HMP criteria, with the only substantial exception being 4.8% of the genomes having scaffolds with an N50 of < 20 kb (Supplementary Table 3). Nearly half of the remaining 4,465 UBA genomes also pass the HMP criteria for being high quality except that they are estimated to be < 90% complete.
The presence of tRNAs for the standard 20 amino acids was examined as a secondary measure of genome quality (Fig. 1c). The 3,438 near-complete UBA genomes have tRNAs that encode for an average of 17.3 ± 2.2 of the 20 amino acids and ≥ 15 amino acids in 90.3% of the genomes. The correlation between estimated genome completeness and identified tRNAs was positive but weak ( Supplementary Fig. 2) as tRNAs are regularly present in multiple copies and often collocated in a genome, making them poor markers for robustly estimating completeness 25,38 .
UBA genomes were represented in the majority of bacterial (47 of 59, Fig. 2) and archaeal (11 of 18, Fig. 3) phyla, as defined in the NCBI taxonomy 39 . In addition, they comprise the first genomic representatives of 17 bacterial and 3 archaeal phyla (see section 'UBA genomes are the first representatives of several phyla'). To provide an objective taxonomic analysis of the UBA data set, we used the phylogenetic criterion of mean branch length to extant taxa 30 as existing taxonomic classifications are not phylogenetically uniform 1 . The results were highly consistent across all trees and, as expected, named groups at each taxonomic rank vary substantially in their mean branch length to extant taxa ( Fig. 4 and Supplementary Figs. 3 and 4). Based on the range of mean branch length values for established taxa, the bacterial UBA genomes are exclusive representatives of 20-30% of all genus-to order-level lineages, 15-30% of class-level lineages and 5-15% of phylum-level lineages (Fig. 4a). Similarly, the archaeal UBA genomes are the only representatives within 20-30% of genus-to order-level lineages, 15-30% of class-level lineages and around 10% of archaeal phyla (Fig. 4b). We also tabulated the number of UBA-exclusive lineages at the 50th, 90th and 95th percentiles of the mean branch length distribution of each taxonomic rank (Supplementary Table 9). At the conservative 90th percentile, the bacterial UBA genomes are the first genomic representatives of 766 genera (34.6%), 226 families (28.6%), 61 orders (21.6%) and  Table 10) and are unaffiliated with existing phyla 40 (Supplementary Table 11). The 10 UBP/UAP phyla with 16S rRNA genes ≥ 600 bp have inter-phyla percent identity values between 76% and 86% (Supplementary Table 12), in agreement with established phyla 41 ( Supplementary Fig. 5). However, these 16S rRNA results should be treated with some caution as the percent identity of incomplete 16S rRNA sequences correlates poorly with values for full-length sequences 42 . Because 16S rRNA genes often fail to assemble 43 and are missing from half of the UBP/UAP lineages, we used average amino-acid identity (mean of 46.2%) and shared gene content (mean of 24.4%) calibrated against established phyla to further support the classification of the UBP/UAP as phyla (Supplementary Fig. 5 and Supplementary Table 13).
To further resolve the taxonomic identity of the UBP and UAP genomes, 16S rRNA genes from these genomes were placed into a tree containing genomic and environmental 16S rRNA sequences.
Only UBP9, UAP2 and UAP3 could be further taxonomically resolved as the other candidate phyla either lack genomes with a 16S rRNA gene, were placed sister to named phyla, or had incongruent placements across the protein and 16S rRNA trees (Supplementary  Tables 11 and 14). UBP9 genomes are the first genomic representatives of the Terrabacteria candidate phylum SHA-109 (Fig. 2) and were recovered from baboon faeces (five genomes), palm oil effluent (one genome), a toluene degrading community (one genome) and a dechlorination bioreactor (one genome, Supplementary Table 14). UAP2 contains the first representatives of the Marine Hydrothermal Vent Group (MHVG) and consists of three genomes recovered from the Tara Oceans Expedition along with a single genome from the Beebe hydrothermal vent (Supplementary Table 14 and Fig. 3). UAP3 is represented by a single genome recovered from a Costa Rican marine sediment metagenome (Supplementary Table 14) and is the first representative of the Ancient Archaeal Group (AAG), a group adjacent to the Lokiarchaeota (Fig. 3).
Genomic representation of several bacterial lineages was greatly expanded by the UBA genomes (Fig. 5). The bacterial phyla with the largest increase in PD were the underrepresented Aminicenantes, Gemmatimonadetes, Lentisphaerae and Omnitrophica lineages (PG of > 75%). Over 75% of the bacterial UBA genomes belong to the Actinobacteria, Bacteroidetes, Firmicutes and Proteobacteria. These genomes expand the PD of these phyla by 14-47%, despite > 50% of existing genomic representatives belonging to these four phyla ( Fig. 5 and Supplementary Table 15). Such high levels of increased phylogenetic diversity are the norm, with 56 of 77 phyla and 73 of 143 classes being expanded by > 20% (Supplementary Table 16). The UBA genomes have no representatives in only 10 bacterial phyla, which we attribute to the narrow ecological range and/or low relative abundance of microorganisms belonging to these lineages (for example, NC10 and Aerophobetes). Within the Archaea, the PD of 10 of 21 phyla and 7 of 17 classes increased by > 20% (Fig. 5  and Supplementary Tables 15 and 16). This includes well-established archaeal groups such as the Euryarchaeota (PG of 34.8%) and Thaumarchaeota (PG of 40.8%) and poorly sampled groups such as the Micrarchaeota, Pacearchaeota and Woesearchaeota which all had a PG of > 35%. Improved genomic representatives within several lineages. There are 12 bacterial phyla where the UBA genomes are estimated to be the highest-quality representatives (Supplementary Table 17). Among these is the Aminicenantes, where the number of available genomes increased from 37 to 47 with the addition of the UBA genomes, and the highest-quality genome improved from 86.9% to 91.9% complete, with the five highest-quality genomes all being UBA genomes. There are currently seven Hydrogenedentes genomes (five NCBI and two UBA), with the two highest-quality representatives being UBA genomes and appreciably improving upon the best previously available representative (88.3% complete, 4.3% contaminated to 98.9% complete, 1.1% contaminated). The most substantial improvement was in the Latescibacteria, where all previous representatives were derived from single cells and the UBA genomes improve the best-quality representative from 57.6% to 95.6% complete.
An alternative view of the CPR. The CPR has recently been proposed as a major collection of candidate phyla in the bacterial domain 30 . Under the bac120 tree and mean branch length to extant taxa criterion, the addition of the UBA genomes slightly reduces the percentage of phylum-level lineages represented by the CPR from a maximum of 29.4% to 26.3%, despite UBA genomes being the sole representatives within a number of genus-to class-level CPR lineages (Fig. 6a,b). Interestingly, the percentage of phylum-level lineages within the CPR increases substantially when considering the rp1 and rp2 trees where the maximum percentages are 38.7% and 38.3%, respectively (Fig. 6c,d). Consequently, under the bac120 tree and mean branch length to extant taxa criterion the CPR contains approximately the same percentage of phylum-level lineages as the Firmicutes and Actinobacteria combined, whereas under the rp1 tree the CPR is far more prominent (Fig. 6e,f). Interestingly, under the bac120 tree the CPR shows a pronounced increase in the relative percentage of lineages attributed to being of phylum-level diversity and at more specific taxonomic ranks contains approximately the same or fewer lineages than the Firmicutes (Fig. 6f).
A recent genome-based tree of life depicted the CPR as representing ~50% of bacterial lineages when aiming to recapitulate named phyla under the mean branch length to extant taxa criterion 30 . This analysis was conducted using the same 16 ribosomal proteins comprising the rp1 marker set and resulted in 39 of 76 (51%) phylumlevel lineages belonging to the CPR. Here, we provide an alternative view with lineages collapsed under the same constraints on the bac120 tree. This results in the CPR being represented by only 20 of 76 (26.3%) phylum-level lineages ( Supplementary Fig. 7), which occurs at the mean branch length threshold (namely, 0.85 substitutions per site) resulting in the maximum percentage of phylum-level CPR lineages (Fig. 6b).

Discussion
Despite considerable progress, many lineages known from 16S rRNA surveys still lack genomic representation 4 . Here, we expand the phylogenetic diversity of bacterial and archaeal genome trees by > 30% through the addition of 7,280 bacterial and 623 archaeal genomes obtained from over 1,500 public metagenomes (Fig. 5). These MAGs span the majority of recognized bacterial and archaeal phyla and include the first genomic representatives of 17 bacterial and three archaeal phyla (Figs. 2 and 3). The 7,903 genomes reported in this study range in quality from 50% complete to meeting the HMP criteria for high-quality draft genomes 3,37 (Fig. 1). They are more complete than those typically derived using single-cell genomics 4 and are of similar quality to those reported in other studies considering MAGs 10,11,44 . We have focused on these genomes, which represent only ~12% of the 64,295 recovered bins, as they are of sufficient quality to inform analyses such as resolving phylogenetic relationships 4,30 and comparing inter-and intralineage genomic features [45][46][47] . Importantly, these results demonstrate that a large amount of microbial diversity remains to be genomically described across the tree of life, even within existing metagenomic samples, and that this diversity is readily recovered using current tools and methodologies.
MAGs often lack 16S rRNA genes due to their conserved and repetitive nature impeding assembly 1,10,43 . The UBA genomes are no exception, with only 17.3% of bacterial UBA genomes containing a partial 16S rRNA gene and 10.2% having a fragment of ≥ 600 bp. Recovery was more successful in the archaeal UBA genomes, with 32.7% containing a 16S rRNA gene fragment ≥ 600 bp. We attribute this discrepancy to the higher average 16S rRNA copy number in Bacteria relative to Archaea 48 (bacterial mean = 4.12, archaeal mean = 1.63). Challenges in assembling and binning 16S rRNA genes motivated the use of protein-coding genes for the phylogenetic analyses presented in this and previous MAG studies 45,46 .
Recently, the diversity of the CPR was explored in the context of a genome tree inferred from 16 ribosomal proteins where it was divided into 36 named phyla and shown to represent approximately 50% of bacterial lineages of equal phylum-level evolutionary distance 30 . Our analyses using a 120 concatenated proteins contrast with this view, as the CPR is shown to comprise ~25% of phylumlevel lineages under the same criterion ( Fig. 6b and Supplementary  Fig. 7). This suggests that ribosomal proteins within CPR organisms may be evolving atypically relative to other proteins, perhaps as a result of their unusual ribosome composition and the presence of

NATuRe MiCRobiology
self-splicing introns and proteins being encoded within their rRNA genes 10 . These contrasting views of the diversity of the CPR are equally valid and probably reflect the unique biology of the organisms within this group. While the SRA represents a large set of publicly available metagenomic data, many additional metagenomes exist in other repositories such as the Integrated Microbial Genomes and Metagenomes 49 (IMG/M) database and Metagenomics Rapid Annotation Server 50 (MG-RAST). We expect that processing these metagenomes will add tens of thousands of additional genomes to the tree of life. Furthermore, methods for assembling and binning metagenomic data are continually improving, which makes it likely that systematic reprocessing of metagenomic data will result in the recovery of new genomes and improved versions of previously obtained genomes.
The number and diversity of genomes presented in this study, and the many similar studies we anticipate will follow, move us closer to a comprehensive genomic representation of the microbial world. Detailed examination of such genomes will further our understanding of microbial evolution and metabolic diversity, and provide important insights into the role of microorganisms in both natural and industrial processes. We anticipate that as metagenomic assembly and binning methods mature we will be presented with the challenge and great opportunity to be able to study microbial communities with complete, or near complete, genomic representation in the context of a comprehensive tree of life.
Note added in proof: During finalization of this manuscript, a new standard specifying the minimum information about a metagenomeassembled genome (MIMAG) was proposed 51 . The medium-quality and partial UBA MAGs meet the medium-quality criteria of the MIMAG standard. However, most of the near-complete UBA MAGs do not meet the stringent rRNA and tRNA requirements for high-quality draft MAGs under this standard, and we therefore deliberately refer to these MAGs as 'near complete' .   of bins with an estimated completeness > 70% and contamination < 5% were considered for further refinement and validation.

Recovery of cultivation-independent genomes.
Merging of compatible bins. Automated binning methods can produce multiple bins from the same microbial population. The 'merge' method of CheckM v.1.0.6 was used to identify pairs of bins where the completeness increased by ≥ 10% and the contamination increased by ≤ 1% when merged into a single bin. Bins meeting these criteria were grouped into a single bin if the mean GC of the bins were within 3%, the mean coverage of the bins had an absolute percentage difference ≤ 25%, and the bins had identical taxonomic classifications as determined by their placement in the reference genome tree used by CheckM. This set of criteria was used to avoid producing chimaeric bins.

NATuRe MiCRobiology
distributions of these genomic features, as determined empirically over a set of 5,656 trusted reference genomes 25,33 . Scaffolds were also removed if their mean coverage had an absolute percentage difference ≥ 50% when compared to the mean coverage of the bin.
Filtering scaffolds with incongruent taxonomic classification. Each gene within a bin was assigned a taxonomic classification through homology search using BLASTP 55 v.2.2.30+ against a custom database of 12,321 genomes from RefSeq/ GenBank 56 release 75. This database was constructed from RefSeq and GenBank genomes consisting of ≤ 300 contigs, having an N50 ≥ 20 kb and containing ≤ 10 kb of ambiguous base pairs. A genome was only included in the database if it was estimated to be ≥ 90% complete, ≤ 10% contaminated and had an overall quality ≥ 50 (defined as completeness − 5 × contamination). Quality estimates were determined with CheckM using the lineage-specific workflow and default parameters. Genomes meeting this set of requirements were dereplicated to remove genomes from the same named species with an amino-acid identity (AAI) ≥ 99.5%. AAI values were calculated with CompareM v.0.0.13 (https://github.com/ dparks1134/CompareM) and dereplication performed in a greedy fashion with a preference towards type strains and genomes annotated as complete at NCBI. Genes were assigned the taxonomic classification of their 'top' hit or designated as unclassified if the gene had no identified homologue with an E-value ≤ 1e −2 , a percent sequence identity ≥ 30% and a percent alignment length ≥ 50%. Scaffolds with incongruent taxonomic classifications were removed from each bin. The consensus classification of a bin at each taxonomic rank was determined by identifying the taxon that occurred at the highest frequency across all classified genes or designated as unclassified if no taxon was represented by ≥ 50% of the classified genes. Scaffolds where ≥ 50% of the classified genes at each rank agreed with the consensus classification of the bin were designated as 'trusted' , and a taxon was considered to be 'common' if it comprised ≥ 5% of the classified genes across the set of trusted scaffolds. A scaffold was considered to be taxonomically incongruent and removed from a bin if the following three conditions were met: (1) it contained ≥ 5 classified genes and ≥ 25% of all genes on the scaffold were classified; (2) ≤ 10% of the classified genes were contained in the set of common taxa at each classified rank; and (3) > 50% of classified genes were assigned to the same taxon at each classified rank. Taxonomic classification of genes and identification of scaffolds with divergent taxonomic classifications were performed with the 'taxon_profile' and 'taxon_filter' methods of RefineM v.0.0.14 (https://github.com/dparks1134/RefineM), respectively.
Filtering scaffolds with incongruent 16S rRNA genes. Scaffolds were removed from a bin if they contained a complete or partial 16S rRNA gene ≥ 600 bp with a taxonomic classification incongruent with the taxonomic identity of the bin. BLASTN 55 was used to assign 16S rRNA genes the taxonomy of its closest homologue within a database comprising the 10,769 16S genes identified within the 12,321 reference genomes discussed in the previous section. The sequence identity to the closest homologue was used to determine the set of ranks that should be examined for congruency. Specifically, previously reported median percent identities values were used to establish conservative thresholds for the taxonomic ranks to consider 41 : genus ≥ 98.7%, family ≥ 96.4%, order ≥ 92.25%, class ≥ 89.2%, phylum ≥ 86.35% and domain ≥ 83.68%. The taxon at each rank was then compared to the taxonomic classification of the genes across all scaffolds in the bin and designated as incongruent if the taxon was assigned to ≤ 10% of classified genes. This methodology is implemented in RefineM v.0.0.14.
Selection of refined genomes. Of the 64,295 bins produced by MetaBAT, only the 7,903 genomes with an estimated quality ≥ 50 (defined as completeness − 5 × contamination), scaffolds resulting in an N50 of ≥ 10 kb, containing < 100 kb ambiguous bases and consisting of < 1,000 contigs and < 500 scaffolds were considered to be of sufficient quality for further exploration and deposition in public repositories. We adopted the quality criteria of completeness − 5 × contamination as it provides a good signal (completeness) to noise (contamination) ratio, where higher levels of contamination are only permissible when the genome is largely complete. These genomes have been deposited as assemblies in NCBI's TPA:Assembly database along with alignment files indicating the mapping of SRA reads to UBA genomes.
Comparison of UBA genomes to complete conspecific strains. The 3,438 nearcomplete (≥ 90% complete; ≤ 5% contamination) UBA genomes were compared to complete isolate genomes in RefSeq release 76. Of these, 207 of the UBA genomes were determined to be conspecific strains of complete isolate genomes based on an average nucleotide identity (ANI) and alignment fraction (AF) above 96.5% and 60%, respectively 57 . ANI and AF values were determined using ANI Calculator 57 v.1. The genome size of the UBA genomes was adjusted to account for its estimated completeness and contamination: adjusted genome size = (genome size)/ (completeness + contamination). Homologues between UBA genomes and their conspecific counterparts were determined by inferring genes with Prodigal 58 v.2.6.3 and establishing sequence similarity with BLASTP v.2.2.30+ . A UBA protein was considered homologous to an isolate protein if it was the top hit among all isolate proteins, had an E-value of ≤ 1e −10 , a percent identity of ≥ 70% and an alignment length spanning ≥ 70% of the isolate protein.
Proteins used to infer genome trees. Bacterial and archaeal genome trees were inferred from the concatenation of 120 (Supplementary Table 6) and 122 (Supplementary Table 7) phylogenetically informative proteins, respectively. These proteins were identified as being present in ≥ 90% of bacterial or archaeal genomes and, when present, single-copy in ≥ 95% of genomes. Protein-coding regions were identified using Prodigal v.2.6.3 (with default parameters, but with Ns treated as masked sequences), translation tables determined using a coding density heuristic 25 , and the ubiquity of genes determined across genomes from NCBI's RefSeq release 73 annotated with the Pfam 59 v.27 and TIGRFAMs 60 v.15.0 databases. Only genomes composed of ≤ 200 contigs, with an N50 of ≥ 20 kb and with CheckM completeness and contamination estimates of ≥ 95% and ≤ 5%, respectively, were considered. Phylogenetically informative proteins were determined by filtering ubiquitous proteins whose gene trees had poor congruence with a set of subsampled concatenated genome trees. Specifically, the initial set of 188 bacterial (187 archaeal) proteins were randomly subsampled to 132 genes (~70%) and concatenated to infer a subsampled genome tree. Gene subsampling was independently performed 100 times to establish well-supported splits, which we define as any split occurring in > 80% of the subsampled trees and with ≥ 1% of taxa contained in both bipartitions induced by the split. The congruence between a gene tree and the subsampled genome tree was measured as the fraction of well-supported split lengths compatible with the gene tree, a measure we call the 'normalized compatible split length' . Genes with a normalized compatible split length of ≤ 50% were removed, as this poor congruence may indicate the presence of lateral gene transfer events. Proteins were aligned to Pfam and TIGRfam HMMs using HMMER 61 v.3.1b1 with default parameters and trees were inferred with FastTree 62 v.2.1.7 under the WAG+ GAMMA models.
Trees were also inferred from two ribosomal protein sets: (1) 16 ribosomal proteins (Supplementary Table 4) that form a syntenic block 10,30 and (2) 23 ribosomal proteins (Supplementary Table 5) previously used for tree inference and tested for lateral gene transfer 4 .
Inference of genome trees. Genome trees were inferred across a dereplicated set of UBA and RefSeq/GenBank release 76 (May 2016; includes 727 single cell and 1,811 MAGs) genomes. All 5,192 RefSeq genomes annotated at NCBI as 'reference' or 'representative' were retained, except for a low-quality subset of 294 genomes that did not meet our 'trusted' genome criteria: composed of ≤ 300 contigs, N50 ≥ 20 kb, CheckM completeness and contamination estimates of ≥ 90% and ≤ 10%, respectively. This set of 4,898 genomes was augmented with an additional 3,324 RefSeq genomes to retain at least two genomes per species where possible. Preference was given to genomes annotated at NCBI as being a type strain and/or 'complete' and restricted to genomes meeting the 'trusted' genome filtering criteria. An additional 551 RefSeq genomes currently without a species designation at NCBI, but passing the genome quality filtering, were also added to this initial set of seed genomes.
UBA, GenBank and remaining RefSeq genomes meeting the 'trusted' genome criteria were compared to these 8,773 seed genomes. Genomes with an AAI of ≥ 99.5% to a seed genome, as calculated over the 120 bacterial or 122 archaeal marker genes used for phylogenetic inference, were clustered with the seed genome and do not appear as separate genomes in the genome trees. This cutoff correlated with the proposed 96.5% ANI threshold for defining bacterial and archaeal species 57 (Supplementary Fig. 6). Trusted genomes with an AAI < 99.5% were added to the seed set. All remaining genomes, regardless of quality, were compared to this final seed set using the same AAI clustering criteria of 99.5%.
Seed and unclustered genomes with an estimated genomes quality ≥ 50 (defined as completeness − 5 × contamination) were used to create an initial multiple sequence alignment, with the exception of the 797 CPR genomes 10 , which were retained regardless of their estimated quality. Proteins were identified and aligned using HMMER v.3.1b1 and the resulting alignment trimmed to remove columns represented by < 50% of taxa or without a common amino acid in ≥ 25% of taxa. Genomes with amino acids in < 40% of aligned columns (20% for the lenient archaeal trees) were removed from consideration. The 120 concatenated bacterial protein set consisted of 34,796 aligned columns after trimming and was inferred over 19,198 genomes. The 122 concatenated archaeal protein set contained 28,025 trimmed columns and spanned 1,012 genomes when using standard filtering criteria and 27,942 columns spanning 1,070 genomes when using lenient filtering. Trees were inferred with FastTree v.2.1.7 under the WAG+ GAMMA models and support values determined using 100 non-parametric bootstrap replicates.
Taxonomic annotation of genome trees. Genome trees were annotated using taxonomic information from the NCBI Taxonomy Database 39 . Only the canonical seven taxonomic ranks (species to domain) were considered for each genome, and this information was used to annotate internal lineages using tax2tree 63 . Manual curation was performed to add in phylum information currently missing at NCBI and to resolve polyphyletic groups. Polyphyletic groups that could not be confidently resolved were identified using an underscore and numerical identifier (for example, Deltaproteobacteria_1).
Inference of 16S rRNA trees. Bacterial and archaeal trees were inferred from 16S rRNA genes > 600 bp and > 1,200 bp within UBA and RefSeq/GenBank release 76 genomes, respectively. The 16S rRNA genes were identified using HMMER NATuRe MiCRobiology and domain-specific SSU/LSU HMM models as implemented in the 'ssu-finder' method of CheckM. These genes were aligned with ssu-align 64 v.0.1 and trailing or leading columns represented by ≤ 70% of taxa trimmed, which resulted in bacterial and archaeal alignments of 1,421 and 1,378 bp, respectively. Trees were inferred with FastTree v.2.1.7 under the GTR+ GAMMA models and support values determined using 100 non-parametric bootstrap replicates.
Similarity of 16S rRNA genes. The percent identity between 16S rRNA genes was calculated from the multiple sequence alignments used to infer the domainspecific 16S rRNA gene trees. The 'dist.seqs' command of mothur 65 v.1.30.2 was used to calculate percent identity. Default parameters were used, except that gaps at the end of sequences were ignored (countends = F) in order to accommodate partial 16S rRNA sequences. Inter-phylum (inter-class) 16S rRNA percent identity values were determined by identifying the most similar sequence to each sequence within a phylum across all sequences from different phyla (classes).
Assessment of phylogenetic and taxonomic diversity. Phylogenetic diversity (total branch length spanned by a set of taxa) and gain (additional branch length contributed by a set of taxa) were calculated using GenomeTreeTk v.0.0.23 (https://github.com/dparks1134/GenomeTreeTk) and verified with ARB 66 v.6.0.2. Taxonomic diversity and the percentage of lineages of equal evolutionary distance unique to the UBA genomes were determined using the mean branch length to extant taxa criterion 30 . Lineages of equal evolutionary distance were related to the distribution of NCBI taxa 39 as defined on 19 May 2016 and used to construct the phylum-level lineage view ( Supplementary Fig. 7) by evaluating the number of groups formed at mean branch length values of 0.5 to 1.1 with a step size of 0.025. A value of 0.85 was selected as it most closely matched the number of bacterial phyla when excluding the CPR. In agreement with previous analyses 30 , we used this criterion to explore the taxonomic structure of phylogenetic trees and not to explicitly establish taxonomic status.
Genomic similarity. The AAI and shared gene content between genomes were determined with CompareM v.0.0.21 using default parameters (homologues defined by an E-value ≤ 0.001, a percent identity ≥ 30% and an alignment length ≥ 70%). CompareM reports shared gene content relative to the genome with the fewest identified genes in order to accommodate incomplete genomes. Interphylum and inter-class AAI and shared gene content values were determined by sampling up to 50 near-complete (completeness ≥ 90%, contamination ≤ 5%, N50 > 20 kb, total contigs ≤ 200) RefSeq release 76 genomes from each named lineage, taking care to sample evenly between named species. The AAI score, defined as the sum of the AAI and shared gene content, was used to determine the most similar genome to each query genome.
Genomic and assembly properties. Genomic and assembly properties (for example, GC, N50, coverage) were determined using CheckM. Transfer RNAs were identified with tRNAscan-SE 67 v.1.3.1 using either the bacterial or archaeal tRNA model and default parameters.
Data availability. The UBA genomes have been deposited under NCBI BioProject PRJNA348753. Individual genomes have been deposited at DDBJ/ENA/GenBank and accession numbers are provided in Supplementary Table 2. The initial versions of these genomes are described in this paper.