Quantitatively Partitioning Microbial Genomic Traits among Taxonomic Ranks across the Microbial Tree of Life

Recently, there has been great progress in defining a complete taxonomy of bacteria and archaea, which has been enabled by improvements in DNA sequencing technology and new bioinformatic techniques. A new, algorithmically defined microbial tree of life describes those linkages, relying solely on genetic data, which raises the issue of how microbial traits relate to taxonomy. Here, we adopted cluster of orthologous group functional categories as a scheme to describe the genomic contents of microbes, a method that can be applied to any microbial lineage for which genomes are available. This simple approach allows quantitative comparisons between microbial genomes with different gene compositions from across the microbial tree of life. Our observations demonstrate statistically significant patterns in cluster of orthologous group functional categories at taxonomic levels that span the range from domain to genus.

vided the ability to fortify those phenotype-based taxonomies with quantitative determinations of differences between marker genes, but canonical taxonomies such as the NCBI taxonomy continue to "reflect the current consensus in the systematic literature," which ultimately derives from trait-based taxonomies (4). Recently, Parks et al. (5) formalized the Genome Taxonomy Database (GTDB), a phylogeny in which taxonomic ranks are defined by "relative evolutionary divergence" in order to create taxonomic ranks that have uniform evolutionary meaning across the microbial tree of life (5). This approach removes phenotype or traits entirely from taxonomic assignment, as evolutionary distance is calculated from the alignment of 120 and 122 concatenated, universal proteins found in all bacterial and archaeal lineages, respectively. An investigation of the relationship between traits and phylogeny was possible until the recent publication of a microbial tree of life that is based solely on evolutionary distance. Thus, we ask the following question: to what extent does GTDB phylogeny predict microbial traits?
Comparing phenotypic characteristics of microorganisms across the tree of life is not currently possible, because most organisms and lineages currently lack cultured representatives (6,7). We therefore used the abundance of different clusters of orthologous groups (COGs) in microbial genomes, a proxy for phenotype which is available for all microorganisms for which genomes are available. The clusters of orthologous groups (COGs) method represents a classification scheme that defines protein domains based on groups of proteins sharing high sequence homology (8). More than ϳ5,700 COGs have been identified to date. COGs are placed into 1 of 25 metabolic functional categories (COG-FCs), with each representing a generalized metabolic function (e.g., "lipid transport and metabolism" or "chromatin structure and dynamics"). Our analyses quantify the degree to which taxonomic rank (genus through domain) predicts the COG-FC content of genomes and illustrate which lineages are relatively enriched or depleted in specific COG-FCs. These analyses constitute a step toward better understanding of how evolutionary processes influence the distribution of metabolic traits across taxonomy as well as a step toward being able to probabilistically predict the metabolic or functional similarity of microbes given their taxonomic classification.

RESULTS
The genomes analyzed in this work were compiled from a variety of different sources, including RefSeq v92, the JGI Integrated Microbial Genomes and Microbiomes (IMG/M) database, and GenBank, in order to include genomes created using diverse sequencing and assembly techniques. The integration of the RefSeq v92, JGI IMG/M, and GenBank databases resulted in a total of 119,852 genomes within the customcurated database. Raw data, GTDB taxonomy, and associated accession numbers are provided in Data Set S1 (available at https://zenodo.org/record/3361565). Among these genomes, we included only those that satisfied a set of criteria designed to ensure that each genus contained enough genomes to allow statistically robust analysis (see Materials and Methods). This resulted in a set of 13,735 lineages, representing 22 bacterial phyla and 4 archaeal phyla, 67% of which have been grown in culture (Table 1).
Most predicted open reading frames for most lineages could be assigned to a COG-FC. Across all phyla, an average of 84.3% Ϯ 7.8% of open reading frames were assigned to a COG-FC (see Fig. S1 in the supplemental material). Genomes of the same phylum tended to group together in an initial principal-component analysis (PCA) of raw COG-FC abundance (data not shown). Since this analysis was based on the absolute abundances of COG-FCs in genomes, rather than on the relative abundances, we hypothesized that the relationship between COG-FC abundance and phylum was largely a consequence of genome size, which is phylogenetically conserved (9). Consistent with this possibility, position on PC 1 correlated closely with genome size (R 2 ϭ 0.88; Fig. 1A). We therefore normalized each COG-FC abundance value, for each genome, to a prediction of COG-FC abundance as a function of genome size derived from a generalized additive model (GAM) (Fig. S2; summary statistics are listed in Table S1 in the supplemental material). The data associated with each GAM model were statistically significant (P Ͻ 0.001), and all but five COG-FCs had deviance-explained values (analogous to adjusted R 2 values) of more than 50%. We interpret analyses of these genome size-normalized data sets as reflecting the enrichment or depletion of COG-FC abundance, relative to that expected for a given genome size, and thus, the data are defined as COG-FC relative abundances. PCA of these COG-FC relative abundances showed that species-level lineages still tended to group by phylum, even though the interphylum gradients in genome size were no longer apparent ( Fig. 1B and C). Note that attempts were made to normalize by genome size alone; however, these attempts failed to properly remove the influence of genome size. We hypothesize that this was due to the nonlinear response in COG-FC abundances as a function of genome size.
To quantify the degree to which taxonomic rank explains the distribution of COG-FC relative abundances among individual genomes, we performed permutation multivariate analysis of variance (PERMANOVA) using the taxonomic ranks of domain, phylum, class, order, family, and genus, as well as culture status (cultured versus uncultured lineage). The rank of species was excluded from the analysis as every lineage was unique, and thus, species would explain 100% of the data. Every rank significantly influenced the distribution of COG-FC relative abundances (P Ͻ 0.001), but the fractions of variance that the various ranks explained differed substantially: phylum explained the most variance (14.6%), followed by order (9.2%), genus (5.5%), family (4.8%), and class (4.1%). Domain explained only 3.1% of the variances in COG-FC relative abundances, the least of any taxonomic rank. Culture status was a significant correlate of COG-FC abundance (P Ͻ 0.001) but had virtually no explanatory power, accounting for only Ͻ0.001% of the variance. This observation is consistent with the idea that no particular COG-FC relative abundance level was systematically higher or lower in uncultured microbes relative to cultured microbes. The variability in COG-FC relative abundances across different phyla was explored in addition to mean COG-FC composition for individual phyla (see Fig. 3). The evolutionary distance in COG-FC content was measured for all lineages in respect to the phylum COG-FC centroid (see Fig. 3A). The variations in calculated distances for all lineages within a given phylum were compared across the entire phylum (see Fig. 3A). Among all phyla, the Crenarchaeota differed the most from the phylum centroid, indicating the largest amount of genomic variation in terms of COG-FC content, followed by Patescibacteria and Cyanobacterota. The least variable phyla were the Synergistota, Marinisomatota, and Fibrobacterota phyla, respectively (see Fig. 3A). We explored the possibility that the amount of variance in lineages from the phylum centroid was a function of the number of lineages in the phylum. In other words, did the COG-FC content of some genomes seem less variable simply because they had been undersampled? A plot of the average distance of lineages from their phylum's centroid (i.e., the center of the mass of all genomes in the trait space) versus the number of lineages in the phylum reveals that increased sampling caused an apparent increase in the variability of traits within a phylum. This increase in variability across the phylum began to resemble an asymptotic line after approximately 100 genomes were sampled (see Fig. 3B). We modeled the data using both a saturating model (equation 1) and a linear model to test this observation. The saturating model described the relationship substantially better than a linear regression, as determined by the Akaike information criterion (AIC; ΔAIC ϭ 10.5). Coefficient A of the saturating model, which represents the value of the asymptote, was estimated to be 0.75 Ϯ 0.15 (P Ͻ 0.001). Coefficient B, which represents how quickly the function approaches the asymptote, was 0.43 Ϯ 0.30 (P ϭ 0.17). Coefficient C, representing an offset employed to address the fact that all the log-transformed distances have negative . Individual data points are colored by genome size in panels A and B. The data presented in panel A were not normalized by genome size, while the data in panels B and C were normalized by genome size. Black contours on panels B and C correspond to density plots for all genomes shown in panel B. Colored contours in panel C correspond to the respective lineage labels. For panel A, PC1 explained 71% and PC2 explained 7.0% of the variance. For panels B and C, PC1 explained 21% and PC2 explained 16% of the variance. Panel C corresponds to only the top 10 most abundant phyla analyzed as described for Table 1, while the remaining contours are shown in Fig. S3. values, was Ϫ1.63 Ϯ 0.14 (P Ͻ 0.001). This means that observing approximately 100 lineages in a phylum is sufficient to assess the variance in trait space representing half of all potential variance for that phylum (0.13). Note that this represents the effect of incorporation of the shift parameter, coefficient C.
We sought a qualitative determination of how the distribution of COG-FC relative abundances related to phylogeny. To achieve this, we quantified the average COG-FC relative abundances for all COG-FCs in each genus These values were then visualized on a genus-level phylogenetic tree (see Fig. 4) utilizing concatenated ribosomal protein sequences published previously by Parks et al. (5). Data underlying the genus-level phylogenetic tree (see Fig. 4) are presented in Table S2. Several notable features appear in COG-FC relative abundances at the phylum level. For example, among the four archaeal phyla represented here, Thermoplasmatota appears unique, with high COG-FC relative abundances in cell motility and depletion in every other category. In general, the COG-FC content of bacterial lineages appeared more variable than that of the archaeal lineages at all taxonomic resolutions. The clade consisting of Bacteroidota, Spirochaetota, and Verrucomicrobiota was notably depleted in the less variable COG-FCs, including energy production and conversion, amino acid transport and metabolism, and carbohydrate transport and metabolism, among others. Another prominent feature is the nearly ubiquitous elevation in COG-FC relative abundances of cell motility; secondary metabolite biosynthesis, transport, and catabolism; lipid transport and metabolism; and intracellular trafficking, secretion, and vesicular transport COGs in Proteobacteria. A notable dichotomy in the COG-FC relative abundances of RNA processing and modification within the Proteobacteria mirrors the division of the two largest clades within the Proteobacteria. Overall, the relative abundance data appear qualitatively consistent with phylogenetic relationships, albeit they occur on different taxonomic levels.
The relationship between individual COG-FC relative abundances and taxonomic ranks appeared largely variable (see Fig. 4). For instance, most of the variation in the RNA processing and modification relative abundances occurred at higher taxonomic ranks such as phylum and class whereas most of the variation in the secondary metabolite biosynthesis, transport, and catabolism relative abundances occurred at lower ranks such as order. To quantify this relationship, we applied a variance component model to determine the proportions of the variance explained by different taxonomic ranks (see Fig. 5). Domain and culture status were excluded from this analysis, as determinations of the level of variance explained become imprecise when a factor includes fewer than 5 groups (10). Consistent with the PERMANOVA results (Fig. 2), COG-FC relative abundances were best explained by the taxonomic rank of phylum. In contrast to the PERMANOVA results, the class taxonomic rank appeared to have reasonable explanatory power for a select set of COG-FCs. In general, the overall explanatory power for taxonomic rank appeared to decrease at the lower taxonomic ranks.
Last, to gain a sense of "notable" COG-FCs associated with different phyla, we calculated the mean COG-FC across all lineages in a given phyla and compared these values against the 85th and 15th percentiles for all lineages in our custom-curated database. All COG-FCs whose values were significantly (P Ͻ 0.05; based on a 10 5 iteration bootstrap analysis) greater or lesser than those calculated for the 85th or 15th percentile, respectively, are shown in Table 2 (see also Table 3). Each archaeal phylum was enriched or depleted in three to nine COG-FCs, whereas most bacterial phyla were enriched or depleted in three to four COG-FCs. A few exceptions arose, including Fibrobacterota (depleted in eight COG-FCs), Nitrospirota A (enriched in four and depleted in five), and Proteobacteria (the only phylum not heavily enriched or depleted in any COG-FCs). Relative abundance data, along with associated GTDB taxonomic assignments used for generating data presented here (see Fig. 4), are provided in Table S2.

DISCUSSION
We observed that the abundance of COG-FCs within individual lineages tentatively grouped according to phylum after variable reduction was performed via PCA (data not shown). Furthermore, PCA scores along PC1 correlated strongly with genome size (R 2 ϭ 0.88; Fig. 1A). The conserved nature of genome size across phylogeny (9) implies that phylogenic groupings may be an artifact of genome size. Thus, normalization of COG-FC abundances by genome size was performed to properly characterize the relationship between COG-FC and phylogeny. We performed the normalization using the slope from a GAM regression which modeled COG-FC abundance as a function of genome size. The COG-FC normalization removed the influence of genome size (R 2 ϭ 0;   Table 3 for COG-FC numbering key. The PERMANOVA (Fig. 2) and analysis of diversity of genomic composition within phyla (Fig. 3) showed that the microbial lineages exhibited characteristic relative abundances of COG-FC and that the extent of variation varied among taxonomic ranks. Among all the taxonomic ranks, phylum was the most powerful predictor of COG-FC relative abundances, which is consistent with observations that phylum can be informative of microbial function (see, e.g., references [11][12][13]. Lower taxonomic ranks such as genus and family had approximately half the explanatory power shown by the phylum taxonomic rank. Many studies have focused on metabolic coherence of individual traits and have regularly found traits conserved on the family level (2,3,14). The discrepancy between previous observations and our observation likely relates to how we characterized patterns in metabolic potential. These studies characterized trait function based on phenotype observation, protein structures, and pathway components. Such characterizations are effective metrics for characterizing finer units of taxonomy, such as genus, but do not scale to coarser units of taxonomy, such as

Lineages in Phylum
Average log 10 -Distance to Phylum Centriod phylum. In contrast, COG-FCs provide a coarse metabolic description which scales with the coarser units of taxonomy (35). The trade-off represented by the approach used here is that, by analyzing COG-FCs, we lose information about specific genes or potential metabolic functions but gain the ability to apply a consistent analysis across an entire genome and across the entire microbial tree of life. Thus, the extent to which observed patterns (Fig. 1) reflect phenotypically expressed differences among lineages is unknown. Nonetheless, the statistical robustness of analyses based on the relationship between all taxonomic ranks and COG-FC patterns suggests that evolutionary processes (e.g., horizontal gene transfer, vertical gene transfer, duplications, deletions, etc.) control the preponderance of different COG-FCs across lineages.
The roles that individual evolutionary processes play in influencing COG-FC relative abundances at a given taxonomic rank likely differ. For instance, horizontal gene transfer is more common among the more closely related lineages (15) and thus likely promotes increased levels of similarity at the lower taxonomic ranks. At higher taxonomic ranks, vertical processes may be more important. The asymptote in the mean log 10 distance from the centroid as a function of lineages in a phylum suggests that identifying more lineages for larger numbers of poorly represented lineages should expand the diversity of COG-FCs that are found, whereas phyla that were adequately sampled (with at least ϳ1,000 lineages) exhibited comparable levels of variability in COG-FC distributions (Fig. 3B). Since many more than 1,000 distinct lineages of each phylum are likely to exist (16), we propose that the taxonomic rank of phylum implies a fairly consistent degree of diversity in COG-FC distribution. To the extent that phenotype matches genotype at the level of COG-FC distributions, we therefore expect that typical phyla exhibit similar levels of phenotypic diversity. A notable exception is the phylum Crenarchaeota, whose members were far more diverse than would be expected based on the number of lineages sampled. The Crenarchaeota, as defined in the GTDB, collapsed members of several phyla that had been designated separately under previous taxonomies, including lineages that had previously been assigned as Crenarchaeota, Thaumarchaeota, Euryarchaeota, Verstraetearchaeota, Korarchaeota, and Bathyarchaeota (5). It is possible that the relationship between the marker genes used in the GTDB and those in the rest of the genome is unusual for this clade, compared to other phyla, or that the GTDB classification of Crenarchaeota is lacking in some other way.
Although the genus and family ranks explained relatively little of the variance in COG-FC distribution, examples of consistent colored blocks, indicating that higher or lower relative abundances of specific COG-FCs were conserved across each taxonomic rank in some parts of the phylogenetic tree, are evident in Fig. 4 at every taxonomic resolution. This is explained by the finding of "distantly" related clades (i.e., non-sister clades) occupying similar COG-FC trait space. Our variance component model accounted for the hierarchical nature of taxonomic lineage by partitioning the levels of explanatory power that individual taxonomic ranks had for individual COG-FC relative abundances (Fig. 5). Consistent with Fig. 4, different COG-FCs appeared most controlled at different taxonomic ranks. For instance, the coenzyme transport and metabolism COG-FC was almost entirely explained by the taxonomic rank of phylum. This observation is consistent with previous assessments suggesting that enzyme cofactors are deeply conserved at the phylum level (17,18). Similarly, the carbohydrate transport and metabolism COG-FC was best explained the taxonomic ranks genus and family, which is consistent with previous observations revealing that large amounts of variability exist for hydrolase traits at the lower taxonomic ranks (1-3). Ultimately, the variability in explanatory power for COG-FCs represented by the different taxonomic ranks supports the notion that evolutionary processes operate on microbial metabolisms at different timescales depending on which component of the metabolism is in question.
The coherence in metabolic potential at higher taxonomic ranks may help explain the distribution of microbial clades across ecological niches. Analyses of habitat associations (1,9,19) found phylum-level patterns in lineages occupying niches, which supports the idea that there is a relationship between higher taxonomic ranks, metabolism, and niche. Our analysis provided quantitative evidence supporting this idea by demonstrating coherence in metabolic potential with broad-scale patterns in genomic data ( Fig. 1; see also Fig. 5). The following question remains: how well do the observed COG-FC relative abundances reflect expressed functional traits (i.e., phenotypes) across these lineages? It is difficult to address this question systematically, but some of the relative abundances and depletions reported here appear consistent with known physiologies of clades. For instance, Rickettsiales were depleted in nucleotide metabolism and transport, consistent with a previously observed lack of a metabolic pathway for purine synthesis among five example Rickettsiales species (20). Another example is the depletion in the COG-FCs of energy production and conversion, amino acid transport and metabolism, and carbohydrate transport and metabolism within the Bacteroidetes, Spirochaetes, and Chlamydiales clade (NCBI taxonomy to be consistent with citing literature). This clade is known to contain many host-dependent pathogens and symbionts (21-23) which are often depleted in these COG-FCs (24). The GTDB classification is the first fully algorithmic and quantitatively self-consistent microbial taxonomy that can be applied across the tree of life (5). By standardizing the meaning of taxonomic ranks, it creates an objective basis on which to compare microbial functionality to phylogeny. The analyses presented here demonstrate that compositional patterns exist for genomic traits which can be explained by different taxonomic ranks. Furthermore, the proportion of variance explained for individual COG-FCs was partitioned as a function of taxonomic ranks. These quantitative relationships allude to the idea that evolutionary processes operate on different timescales for different components of microbial metabolisms and support previous suggestions proposing that a relationship exists between higher taxonomic ranks, metabolism, and ecological niches.

MATERIALS AND METHODS
Genome database curation. All bacterial and archaeal genomes from the RefSeq database v92 (25), all uncultured bacterial and archaeal (UBA) metagenome-assembled genomes (MAGs) reported in Parks et al. (5,26), all bacterial and archaeal MAGs from the Integrated Microbial Genomes and Microbiomes (IMG/M) database, and all bacterial and archaeal single amplified genomes (SAGs) from IMG/M were curated into a single database. All genomic content within the curated database is referred to as representing a "genome(s)" for simplicity. Genomes were assigned taxonomy consistent with the Genome Taxonomy Database (GTDB) using the GTDB toolkit (GTDB-Tk) v0.2.1 (5). The GTDB-Tk taxonomic assignments were consistent with reference package GTDB r86. Lineages which, due to the absence of a reference lineage, did not receive a genus classification were excluded from analyses. In total, 6.1% of the total number of genomes from the initial database met this condition. Due to bias resulting from the superabundance of strains in specific clades (e.g., Escherichia coli), the lowest taxonomic rank considered during our analysis was species. The COG-FC relative abundances (see below) were averaged for all strains within a given species. An exception was made for lineages which shared a genus classification but lacked a species classification. In this scenario, each genome was treated as an independent lineage. In total, 10.9% of the total number of genomes analyzed (i.e., of those that had a genus assignment) met this condition. Last, only genomes belonging to genera with at least 10 unique species in the database were retained. This criterion ensured the availability of enough data to generate meaningful statistics during our PERMANOVA. The final database is summarized in Table 1. The genus-level phylogenetic tree was generated from concatenated protein sequence alignments published in Parks et al. (5).
COG functional category identification, enumeration, and normalization. Genes were predicted from individual genomes and translated into protein sequences using Prodigal v.2.6.3 (27). The resulting protein sequences were analyzed for COGs (8). COG position-specific scoring matrices (PSSMs) were downloaded from NCBI's Conserved Domain Database (27 March 2017 definitions). COG PSSMs were BLASTed against protein sequences with the reverse-position-specific BLAST (RPS-BLAST) algorithm (28). Following a previously reported protocol (28), we used an E value cutoff of 0.01 to assign COGs with RPS-BLAST. The retrieved COGs were assigned to their respective COG functional categories (COG-FCs; 25 in total), and the abundances of the functional categories were tabulated for each genome by the use of cdd2cog (29). The abundance values determined for the individual COG-FCs were normalized by the respective COG-FC standard deviations across all lineages. For the COG-FCs extracellular structures and nuclear structures, the standard deviation was 0. Consequently, data could not be normalized; thus, these two categories were discarded from all analyses.
COG-FC abundances were normalized by their respective regression slopes of COG-FC abundance for a given genome as a function of genome size. COG-FC abundances were modeled as a function of genome size for individual categories using a generalized additive model (GAM) with a smoothing term employed due to the pairwise response to genome size (see Fig. S1 in the supplemental material). We used the gam function from the R package mgcv (30). In some instances, regression fits were visibly skewed by high-leverage data points. High-leverage data were filtered using the influence.gam function in the mgcv package. Data in the 99.5% percentile for influence were excluded in regression analysis but were included in all downstream analyses. All regressions were significant, with P values of Ͻ0.001.
was set as the "centroid" (mean). The distance matrix used for both the adonis and betadisper analyses was generated by calculating Euclidean distances for the normalized COG-FC relative abundances.
The mean log 10 distance from the phylum centroid was determined for each phylum and modeled with the following equation, which represents a hyperbola shifted on the x axis to ensure that the mean distance value is zero when n ϭ 1: where A, B, and C represent fit coefficients and n represents the total number of lineages in the given phylum. The Akaike information criterion value was calculated with the fit from equation 1 using the AIC function from the R package stats (32). A variance component model was performed using the lme function from the R package nlme (34). The proportion of variance explained by the taxonomic ranks, phylum, class, order, family, and genus, was determined for each individual COG-FC. Domain and culture status were not evaluated due to imprecise results generated from factors that only have 2 groups (10). Lineage was treated as a random intercept, where individual taxonomic ranks were nested within one another in a hierarchical manner (R notation: ϳ1|phylum/class/order/family/genus). Confidence intervals were determined by performing a 500-iteration bootstrap analysis with the variance component model. During the bootstrap analysis, genomes were randomly sampled with replacement.
Data availability. The genomes analyzed for the current study are available in the NCBI RefSeq database (ftp://ftp.ncbi.nlm.nih.gov/refseq/release/). UBA MAGs used for the current study are available under NCBI's BioProject PRJNA417962 and PRJNA348753. Publicly available JGI IMG/M genomes can be downloaded from the Genome Portal (https://img.jgi.doe.gov/), while the private genomes were acquired from Chad Burdyshaw. Associated genome accession numbers for genomes in the described data sets are available in Data set S1 at https://zenodo.org/record/3361565 (https://doi.org/10.5281/zenodo.3361565).

ACKNOWLEDGMENTS
We thank Chad Burdyshaw of JICS for help obtaining the genomes used in this project.
Funding for this project was provided by a C-DEBI graduate fellowship to T.M.R. and by a kind grant of resources from the University of Tennessee/Oak Ridge National Lab Joint Institute for Computational Sciences (JICS) to A.D.S. This material is based on work supported by (i) the University of Tennessee, Knoxville College of Arts and Sciences; (ii) Tickle College of Engineering; (iii) the Joint Institute for Computational Sciences; and (iv) Intel Corporation through an Intel Parallel Computing Center award to support development of HPC-BLAST.
Any opinions, findings, conclusions, or recommendations expressed in this material are ours and do not necessarily reflect the views of the University of Tennessee or Intel Corporation. This is C-DEBI contribution 487.