From Genomes to Phenotypes: Traitar, the Microbial Trait Analyzer

Bacteria are ubiquitous in our ecosystem and have a major impact on human health, e.g., by supporting digestion in the human gut. Bacterial communities can also aid in biotechnological processes such as wastewater treatment or decontamination of polluted soils. Diverse bacteria contribute with their unique capabilities to the functioning of such ecosystems, but lab experiments to investigate those capabilities are labor-intensive. Major advances in sequencing techniques open up the opportunity to study bacteria by their genome sequences. For this purpose, we have developed Traitar, software that predicts traits of bacteria on the basis of their genomes. It is applicable to studies with tens or hundreds of bacterial genomes. Traitar may help researchers in microbiology to pinpoint the traits of interest, reducing the amount of wet lab work required.

in microbial community composition (2). Microbial community members with various metabolic capabilities can aid in wastewater treatment, bioremediation of soils, and promotion of plant growth (3)(4)(5); in the cow rumen microbiota, bacterial cellulose degraders influence the ability to process plant biomass material (6). In the tammar wallaby foregut microbiome, the dominant bacterial species is implicated in the lower methane emissions produced by wallabies than by ruminants (7).
In addition to the exponential growth of available sequenced microbial genome isolates, metagenome and single-cell genome sequencing further contributes to the increasing number of available genomes. For the recovery of genomes from metagenomes (GFMs), computational methods based on, e.g., differential read coverage and k-mer usage were developed (8)(9)(10)(11)(12)(13) that allow the recovery of genomes without the need to obtain microbial isolates in pure culture (6,14). In addition, single-cell genomics provides another culture-independent analysis technique and also allows genome recovery, although often fragmented, for less abundant taxa in microbial communities (15,16). Together, these developments profoundly shift the analytical bottleneck from data generation to interpretation.
The genotype-phenotype relationships for some microbial traits have been well studied. For instance, bacterial motility is attributed to the proteins of the flagellar apparatus (17). We have recently shown that delineating such relationships from microbial genomes and accompanying phenotype information with statistical learning methods enables the accurate prediction of the plant biomass degradation phenotype and the de novo discovery of both known and novel protein families that are relevant for the realization of the plant biomass degradation phenotype (18,19). However, a fully automated software framework for prediction of a broad range of traits from only the genome sequence is currently missing. Additionally, horizontal gene transfer, a common phenomenon across bacterial genomes, has not been utilized to improve trait prediction so far. Traits with their causative genes may be transferred from one bacterium to another (20,21) (e.g., for antibiotic resistances [22]), and the vertically transferred part of a bacterial genome might be unrelated to the traits under investigation (2,23,24).
Here we present Traitar, the microbial trait analyzer, an easy-to-use, fully automated software framework for the accurate prediction of currently 67 phenotypes directly from a genome sequence (Fig. 1). We used phenotype data from the microbiology section of the Global Infectious Disease and Epidemiology Online Network (GIDEON)-a resource dedicated to the diagnosis, treatment, and teaching of infectious diseases and microbiology (25)-for training phenotype classification models on the protein family annotation of a large number of sequenced genomes of microbial isolates (predominantly bacterial pathogens). We investigated the effect of incorporating ancestral protein family gains and losses into the model inference on classification performance to allow consideration of horizontal gene transfer events in the inference of phenotype-related protein families and phenotype classification. We rigorously tested the performance of our software in cross-validation experiments, on further test data sets and for different taxonomic ranks. To test Traitar's applicability beyond the bacteria represented in GIDEON, we subsequently applied it to several hundred bacteria described in Bergey's Manual of Systematic Bacteriology (1). We used Traitar to phenotype bacterial single amplified genomes (SAGs) and simulated incomplete genomes to investigate its potential for the phenotyping of microbial samples with incomplete genome sequences. We characterized two novel Clostridiales species of a biogas reactor community with Traitar on the basis of their genomes recovered with metagenomics. This verified and complemented a manual metabolic reconstruction. As Traitar furthermore suggests protein families associated with the presence of a particular phenotype, we discuss the protein families Traitar identified for several phenotypes, namely, for motility, nitrate-to-nitrite conversion, and L-arabinose fermentation.

RESULTS
The Traitar software. We begin with a description of the Traitar software and phenotype classifiers. Traitar predicts the presence or absence of a phenotype, i.e., assigns a phenotype label, for 67 microbial traits to every input sequence sample (Table 1; see  Table S1 in the supplemental material). For each of these traits, Traitar furthermore suggests candidate protein families associated with its realization, which can be subjects of experimental follow-up studies.
For phenotype prediction, Traitar uses one of two different classification models. We trained the first classifier-the phypat classifier-on the protein and phenotype presence and absence labels from 234 bacterial species (see phenotype models in Materials and Methods). The 2nd classifier-the phypatϩPGL classifier-was trained by using the same data and additionally information on evolutionary protein family and phenotype gains and losses. The latter were determined by using maximum-likelihood inference of their ancestral character states on the species phylogeny (see ancestral protein family and phenotype gains and losses in Materials and Methods).
The input to Traitar is either a nucleotide sequence FASTA file for every sample, which is run through gene prediction software, or a protein sequence FASTA file. Traitar then annotates the proteins with protein families. Subsequently, it predicts the presence or absence of each of the 67 traits for every input sequence. Note that Traitar does not require a phylogenetic tree for the input samples. Finally, it associates the predicted phenotypes with the protein families that contributed to these predictions (Fig. 2). A parallel execution of Traitar is supported by GNU parallel (26). The Traitar annotation procedure and the training of the phenotype models are described in more detail below (see Traitar software in Materials and Methods).

Evaluation.
We evaluated the two Traitar classifiers by using 10-fold nested cross-validation of 234 bacterial species found in GIDEON (GIDEON I). The macroaccuracy (the accuracy balanced over all phenotypes) determined for the 67 GIDEON  phenotypes was 82.6% for the phypat classifier and 85.5% for the phypatϩPGL classifier; the accuracy (fraction of correct assignments averaged over all of the samples tested) for phypat was 88.1%, in comparison to 89.8% for phypatϩPGL (see evaluation metrics in Materials and Methods; Table 2). Notably, Traitar classified 53 phenotypes with Ͼ80% macroaccuracy and 26 phenotypes with at least 90% macroaccuracy with one of the two classifiers ( Fig. 3; see Table S2 in the supplemental material). Phenotypes that could be predicted with very high confidence included the outcome of a methyl red test, spore formation, oxygen requirement (i.e., anaerobe and aerobe), and growth on MacConkey agar or catalase. Some phenotypes proved to be difficult to predict (60 to 70% macroaccuracy), which included DNase, myo-inositol, yellow pigment, and tartrate utilization, regardless of which classifier was used. This might be caused by the relatively small number (Ͻ20) of positive (phenotype present) examples that were available.
For an independent assessment of Traitar's classification performance, we next tested Traitar with 42 bacterial species that had phenotype information available in GIDEON (GIDEON II) but were not used for learning the phenotype models (see annotation in the Traitar software). For calculation of macroaccuracy, we considered only phenotypes represented by at least five phenotype-positive and five phenotypenegative bacteria. On these data, Traitar predicted the phenotypes with a macroaccuracy of 85.3% with the phypat classifier and 86.7% with the phypatϩPGL classifier and accuracies of 87.5% and 87.9%, respectively ( Table 2). To investigate the performance of Traitar for bacterial genomes from a different data source, we next determined from two volumes of Bergey's Manual of Systematic Bacteriology, namely, the Proteobacteria and the Firmicutes, the phenotypes of further sequenced bacteria that were not in our GIDEON I and II data sets (see Tables S1 and S4 in the supplemental material). In total, we thus identified phenotypes for another 296 sequenced bacterial species (see annotation in the Traitar software). Also for these bacteria, Traitar performed well but was less reliable than before, with accuracies of 72.9% for the phypat classifier and  72.1% for the phypatϩPGL classifier ( Table 2). This is likely due to the taxonomic differences among the bacteria listed in GIDEON and Bergey's Manual of Systematic Bacteriology and also because most of the bacteria in Bergey's Manual of Systematic Bacteriology have only draft genomes available for phenotyping.
When combining the predictions of the phypat and phypatϩPGL classifiers into a consensus vote, Traitar assigns phenotypes more reliably, while predicting fewer phenotype labels than the individual classifiers ( Table 2). Depending on the use case, Traitar can be used with performance characterized by different tradeoffs between the recall of the phenotype-positive and phenotype-negative classes.
Performance per taxon at different ranks of the taxonomy. We investigated the performance of Traitar across the part of the bacterial tree of life represented in our data set. For this purpose, we evaluated the nested cross-validation performance of the phypat and phypatϩPGL classifiers at different ranks of the National Center for Biotechnology Information (NCBI) taxonomy. For a given GIDEON taxon, we pooled all of the bacterial species that are descendants of this taxon. Figure 4 shows the accuracy estimates projected on the NCBI taxonomy from the domain level down to individual families. Notably, the accuracy of the phypatϩPGL (phypat) classifier for the phyla covered by at least five bacterial species showed low variance and was high across all of the phyla, i.e., 84% (81%) for Actinobacteria, Ͼ90% (89%) for Bacteroidetes, 89% (90%) for Proteobacteria, 91% (90%) for Firmicutes, and 91% (86%) for Tenericutes.
Phenotyping of incomplete genomes. GFMs or SAGs are often incomplete, and thus we analyzed the effect of missing genome assembly parts on the performance of Traitar. Rinke et al. used a single-cell sequencing approach to analyze poorly characterized parts of the bacterial and archaeal tree of life, the so-called microbial dark matter (16). They pooled 20 SAGs from the "Candidatus Cloacimonetes" phylum, formerly known as WWE1, to generate joint-more complete-genome assemblies that had at least a genome-wide average nucleotide identity of 97% and belonged to a single 16S rRNA gene-based operational taxonomic unit, namely, "Candidatus Cloacamonas acidaminovorans" (27).
According to our predictions based on the joint assembly of the single-cell genomes, "Candidatus Cloacamonas acidaminovorans" is Gram negative and is adapted to Only the overall accuracy is reported, as insufficient phenotype labels (fewer than five with negative and positive labels, respectively) were available for several phenotypes, to enable a comparable macroaccuracy calculation to the other data sets (see Table S1 in the supplemental material).
an anaerobic lifestyle, which agrees with the description of Rinke et al. (Fig. 5). Traitar further predicted arginine dihydrolase activity, which is in line with the characterization of the species as an amino acid degrader (16). Remarkably, the prediction of a bacillus or coccobacillus shape agrees with the results of Limam et al. (28), who used a  Table 1; see Table S1 in the supplemental material).
WWE1-specific probe and characterized the samples by fluorescence in situ hybridization. They furthermore reported that members of the "Candidatus Cloacimonetes" phylum are implicated in the anaerobic digestion of cellulose primarily in early hydrolysis, which is in line with the very limited carbohydrate degradation spectrum found by Traitar. Subsequently, we compared the predicted phenotypes for the SAGs to the predictions for the joint assembly. The phypat classifier recalled more of the phenotype predictions of the joint assembly based on the SAGs than the phypatϩPGL classifier. However, the phypatϩPGL classifier made fewer false-positive predictions (Fig. 6a).
In the next experiment, we inferred phenotypes based on simulated GFMs by subsampling from the coding sequences of each of the 42 bacterial genomes (GIDEON II). Starting with the complete set of coding sequences, we randomly deleted genes from the genomes. For the draft genomes obtained with different degrees of completeness, we reran the Traitar classification and computed the accuracy measures as before. We observed that the average fraction of phenotypes identified (macrorecall for the positive class) of the phypatϩPGL classifier dropped more quickly with more missing coding sequences than that of the phypat classifier (Fig. 6b). However, at the same time, the recall of the negative class of the phypatϩPGL classifier improved with a decreasing number of coding sequences, meaning that fewer but more reliable predictions were made.
Overall, the tradeoffs in the recall of the phenotype-positive and phenotypenegative classes of the two classifiers resulted in a similar overall macroaccuracy across the range of tested genome completeness. Thus, depending on the intended use, a particular classifier can be chosen. We expect that the reliable predictions inferred with the phypatϩPGL classifier and the more abundant but less reliable predictions made with the phypat classifier will complement one another in different use cases for partial genomes recovered from metagenomic data.
By analyzing the protein families with assigned weights and the bias terms of the two classifiers, we found the phypatϩPGL classifier to base its predictions primarily on the presence of protein families that were typical for the phenotypes. In contrast, the phypat classifier also took typically absent protein families from phenotype-positive genomes into account in its decision. More technically, the positive weights in models of the phypat classifier are balanced out by negative weights, whereas for the phypatϩPGL classifier, they are balanced out by the bias term. By downweighting the bias term for the phypatϩPGL classifier by the protein content completeness, we could show that the accuracy of the phypat classifier could be exceeded by the phypatϩPGL classifier, regardless of the protein content completeness (data not shown). However, this requires knowledge of the protein content completeness for each genomic sample, which could be indirectly estimated by using methods such as checkM (29).
Traitar as a resource for gene target discovery. In addition to phenotype assignment, Traitar suggests the protein families relevant for the assignment of a phenotype (see majority feature selection in Materials and Methods, Table 3). Here, as an example, we demonstrate this capability for three phenotypes that are already well studied, namely, motile, nitrate-to-nitrite conversion, and L-arabinose metabolism. These phenotypes each represent one of the phenotype categories morphology, enzymatic activity, and growth on sugar. In general, we observed that the protein families important for classification can be seen to be gained and lost jointly with the respective phenotypes within the microbial phylogeny (Fig. 7).
Among the selected Pfam families that are important for classifying the motility phenotype were proteins of the flagellar apparatus and chemotaxis-related proteins ( Table 3). Motility allows bacteria to colonize their preferred environmental niches. Genetically, it is attributed mainly to the flagellum, which is a molecular motor, and is closely related to chemotaxis, a process that lets bacteria sense chemicals in their surroundings. Motility also plays a role in bacterial pathogenicity, as it enables bacteria to establish and maintain an infection. For example, pathogens can use flagella to adhere to their host and have been reported to be less virulent if they lack flagella (30). Of the 48 flagellar proteins described in reference 31, 4 (FliS, MotB, FlgD, and FliJ) were sufficient for accurate classification of the motility phenotype and were selected by our classifier, as well as FlaE, which was not included in this collection. FliS (accession no. PF02561) is a known export chaperone that inhibits early polymerization of the flagellar filament FliC in the cytosol (32). MotB (PF13677), part of the membrane proton-channel complex, acts as the stator of the bacterial flagellar motor (33). Traitar also identified further protein families related to chemotaxis, such as CZB (PF13682), a family of chemoreceptor zinc-binding domains found in many bacterial signal transduction proteins involved in chemotaxis and motility (34), and the P2 response regulatorbinding domain (PF07194). The latter is connected to the chemotaxis kinase CheA and is thought to enhance the phosphorylation signal of the signaling complex (35).
Nitrogen reduction in nitrate-to-nitrite conversion is an important step of the nitrogen cycle and has a major impact on agriculture and public health. Two types of nitrate reductases are found in bacteria, the membrane-bound Nar and periplasmic Nap nitrate reductases (36), both of which we found to be relevant for the classification of  the phenotype. We identified all of the subunits of the Nar complex as being relevant for the nitrate-to-nitrite conversion phenotype (i.e., the gamma and delta subunits [PF02665, PF02613]), as well as Fer4_11 (PF13247), which is in the iron-sulfur center of the beta subunit of Nar. The delta subunit is involved in the assembly of the Nar complex and is essential for its stability but probably is not directly part of it (37). Traitar also identified the molybdopterin oxidoreductase Fe 4 S 4 domain (PF04879), which is bound to the alpha subunit of the nitrate reductase complex (37). Traitor furthermore suggested as relevant NapB (PF03892), which is a subunit of the periplasmic Nap protein, and NapD (PF03927), which is an uncharacterized protein implicated in Nap formation (36). L-Arabinose is major constituent of plant polysaccharides that is located, for instance, in pectin side chains and is an important microbial carbon source (38). Traitar identified the L-arabinose isomerase C-terminal domain (PF11762), which catalyzes the first step in L-arabinose metabolism-the conversion of L-arabinose into L-ribulose (39), as being important for realizing L-arabinose metabolism (Fig. 7). It furthermore suggested the C-terminal domain of ␣-L-arabinofuranosidase (PF06964), which cleaves nonreducing terminal ␣-L-arabinofuranosidic linkages in L-arabinose-containing polysaccharides (40) and is also part of the well-studied L-arabinose operon of Escherichia coli (39).

Phenotyping of biogas reactor population genomes.
We used Traitar to phenotype two novel Clostridiales species (unClos_1, unFirm_1) on the basis of their genomic information reconstructed from metagenome samples. These were taken from a commercial biogas reactor operating with municipal waste (41). The genomes of unClos_1 and unFirm_1 were estimated to be 91 and 60% complete, respectively, on the basis of contigs of Ն5 kb. Traitar predicted unClos_1 to utilize a broader spectrum of carbohydrates than unFirm_1 (Table 4). We cross-referenced our predictions with a metabolic reconstruction conducted by Frank et al. (64). We considered all phenotype predictions that Traitar inferred with either the phypat or the phypatϩPGL classifier. The manual reconstruction and predictions inferred with Traitar agreed to a great extent (Table 4). Traitar recalled 87.5% (6/7) of the phenotypes inferred via the metabolic reconstruction and also agreed to 81.8% (9/11) on the absent phenotypes. Notable exceptions were that Traitar found only a weak signal for D-xylose utilization. A weak signal means that only a minority of the classifiers in the voting committee assigned these samples to the phenotype-positive class (see phenotype models in Materials and Methods). However, the metabolic reconstruction was also inconclusive with respect to xylose fermentation. Furthermore, Traitar found only a weak signal for glucose fermentation by unFirm_1. While genomic analysis of unFirm_1 revealed the Embden-Meyerhof-Parnas (EMP) pathway, which would suggest glucose fermentation, gene-centric and metaproteomic analyses of this phylotype indicated that the EMP pathway was probably employed in an anabolic direction (gluconeogenesis); therefore, unFirm_1 is also unlikely to ferment D-mannose. This suggests that unFirm_1 is unlikely to ferment sugars and instead metabolizes acetate (also predicted by Traitar; Table 4) via a syntrophic interaction with hydrogen-utilizing methanogens.
Traitar predicted further phenotypes for both species that were not targeted by the manual reconstruction. One of these predictions was an anaerobic lifestyle, which is likely to be accurate, as the genomes were isolated from an anaerobic bioreactor environment. It also predicted them to be Gram positive, which is probably correct, as the Gram-positive sortase protein family can be found in both genomes. This is a Gram positivity biomarker (42). Furthermore, all Firmicutes known so far are Gram positive (1). Additionally, Traitar assigned motile and spore formation to unFirm_1 on the basis of the presence of several flagellar proteins (i.e., FlaE, FliM, MotB, FliS, and FliJ) and the sporulation proteins CoatF and YunB.

DISCUSSION
We have developed Traitar, a software framework for predicting phenotypes from the protein family profiles of bacterial genomes. Traitar provides a quick and fully automated way of assigning 67 different phenotypes to bacteria on the basis of the protein family contents of their genomes.
Microbial trait prediction from phyletic patterns has been proposed in previous studies for a limited number of phenotypes (18,19,(43)(44)(45)(46). To our knowledge, the only currently available software for microbial genotype-phenotype inference is PICA, which is based on learning associations of clusters of orthologous genes (47) with traits (45). Recently, PICA was extended by Feldbauer et al. for predicting 11 traits overall, optimized for large data sets, and tested on incomplete genomes (46). Of the 67 phenotypes that Traitar predicts, 60 are entirely novel. It furthermore includes different prediction modes, one based on phyletic patterns, one additionally including a statistical model of protein family evolution for its predictions. An initial prototype of the Traitar methodology was originally developed for prediction of the plant biomass phenotype, with excellent classification performance observed and providing suggestions of candidate domains for experimental verification (18). The methodology has since been adapted to the use of GIDEON and inclusion of phylogenetic signals, which is why the plant biomass predictor is not included in the Traitar release. This shows that, principally, given suitable training data, also very complex phenotypes can be learned and predicted with this methodology. Traitar also suggests associations between phenotypes and protein families. For three traits, we showed that several of these associations are to known key families of establishment of a particular trait. Furthermore, candidate families were suggested that might be relevant for particular traits and serve as targets for experimental studies. Some of the phenotypes annotated in GIDEON are specific for the human habitat (such as coagulase production or growth on ordinary blood agar), and the genetic underpinnings learned by Traitar could be interesting to study for infection disease research.
In cross-validation experiments with phenotype data from the GIDEON database, we showed that the Traitar phypat classifier has high accuracy in phenotyping bacterial samples. Consideration of ancestral protein family gains and losses in the classification, which is implemented in the Traitar phypatϩPGL classifier, improves the accuracy compared to prediction from phyletic patterns only, both for individual phenotypes and overall. Barker et al. were the first to note the phylogenetic dependence of genomic samples and how this can lead to biased conclusions (24). MacDonald et al. selected protein families on the basis of correlations with a phenotype and corrected for the taxonomy (45). Here we accounted for the evolutionary history of the phenotype and the protein families in the classifier training itself to automatically improve phenotype assignment. We additionally demonstrated the reliability of the performance estimates by phenotyping, with similar accuracy, an independent test data set with bacteria described in GIDEON that we did not use in the cross-validation. Traitar also reliably phenotyped a large and heterogenic collection of bacteria that we extracted from Bergey's Manual of Systematic Bacteriology-mostly with only draft genomes available. We did not observe any bias toward specific taxa in GIDEON, but some of the phenotypes might be realized with different protein families in taxa that are less well represented, as indicated by the around 15 to 20% less reliable phenotyping results for bacteria described in Bergey's Manual of Systematic Bacteriology. We expect that the accuracy of the phenotype classification models already available in Traitar will further improve as more data become available and can be incorporated into its training.
We found that Traitar can provide reliable insights into the metabolic capabilities of microbial community members even from partial genomes, which are very common for genomes recovered from single cells or metagenomes. One obvious limitation being for incomplete genomes, the absence of a phenotype prediction may be due to the absence of the relevant protein families from the input genomes. The analysis of both the SAGs and simulated genomes led us to the same conclusions, i.e., that the phypat classifier is more suitable for exploratory analysis, as it assigned more phenotypes to incomplete genomes at the price of more false-positive predictions. In contrast, the phypatϩPGL classifier assigned fewer phenotypes but also made fewer false assignments. At the moment, genotype-phenotype inference with Traitar only takes into account the presence and absence of protein families of the bacteria analyzed. This information can be readily computed from the genomic and metagenomic data. Future research could focus also on the integration of other omics data to allow even more accurate phenotype assignments. Additionally, expert knowledge of the biochemical pathways that are used in manual metabolic reconstructions, for example, could be integrated as prior knowledge into the model in future studies.
For the phenotyping of novel microbial species, generating a detailed (manual) metabolic reconstruction such as the one by Frank et al. (64) is time-intensive. Furthermore, such reconstructions are usually focused on specific pathways and are dependent on the research question. This is not an option for studies with tens or hundreds of genomes, which are becoming more and more common in microbiology (6,14,16). Traitar thus is likely to be particularly helpful for multigenome studies. It furthermore may pick up on things outside the original research focus and could serve as a seed or a first-pass method for a detailed metabolic reconstruction in future studies.

MATERIALS AND METHODS
The Traitar software. In this section, we first describe the Traitar annotation procedure. We proceed with the genome and phenotype data used for the training of Traitar phenotype models; afterward, we explain the training and illustrate how we considered ancestral protein family gains and losses in the models. Finally, we specify the requirements for running the Traitar software.
Annotation. In the case of nucleotide DNA sequence input, Traitar uses Prodigal (48) for gene prediction prior to Pfam family annotation. The amino acid sequences are then annotated in Traitar with protein families (Pfams) from the Pfam database (version 27.0) (49) by using the hmmsearch command of HMMER 3.0 (50).
Each Pfam family has a hand-curated threshold for the bit score, which is set in such a way that no false positive is included (51). A fixed threshold of 25 is then applied to the bit score (the log-odds score), and all Pfam domain hits with an E value above 10 Ϫ2 are discarded. The resulting Pfam family counts (phyletic patterns) are turned into presence or absence values, as we found this representation to yield favorable classification performance (18).
Genome and phenotype data. We obtained our phenotype data from the GIDEON database (25). In GIDEON, a bacterium is labeled either as phenotype positive, phenotype negative, or strain specific. In the latter case, we discarded this phenotype label. The GIDEON traits can be grouped into the categories such as the use of various substrates as sources of carbon and energy for growth, oxygen requirement, morphology, antibiotic susceptibility, and enzymatic activity ( Table 1; see Table S1 in the supplemental material). We considered only phenotypes that were available in GIDEON for at least 20 bacteria, with a minimum of 10 bacteria annotated as positive (phenotype presence) and 10 as negative (phenotype absence) for a given phenotype to enable a robust and reliable analysis of the respective phenotypes. Furthermore, for inclusion in the analysis, we required each bacterial sample to have (i) at least one annotated phenotype, (ii) at least one sequenced strain, and (iii) a representative in the sequenced tree of life (sTOL).
In total, we extracted 234 species-level bacterial samples with 67 phenotypes with sufficient total, positive, and negative labels from GIDEON (GIDEON I). GIDEON associates these bacteria with 9,305 individual phenotype labels, 2,971 being positive and 6,334 negative (see Tables S1 and S3 in the supplemental material). GIDEON species that had at least one sequenced strain available but were not part of the sTOL were set aside for a later independent assessment of classification accuracy. In total, this additional data set comprised a further 42 unique species with 58 corresponding sequenced bacterial strains (GIDEON II; see Tables S1 and S4). We obtained 1,836 additional phenotype labels for these bacteria, consisting of 574 positive and 1,262 negative ones. We searched the Firmicutes and Proteobacteria volumes of Bergey's Manual of Systematic Bacteriology specifically for further bacteria not represented so far in the GIDEON data sets (1). In total, we obtained phenotype data from Bergey's Manual of Systematic Bacteriology for 206 Firmicutes and 90 Proteobacteria with a total of 1,152 positive labels and 1,376 negative labels (see Tables S1 and S5). As in GIDEON, in Bergey's Manual of Systematic Bacteriology, the phenotype information is usually given on the species level.
We downloaded the coding sequences of all of the complete bacterial genomes that were available via the NCBI FTP server at ftp://ftp.ncbi.nlm.nih.gov/genomes/ as of 11 May 2014 and genomes available from the PATRIC database as of September 2015 (52). These were annotated with Traitar. For bacteria with more than one sequenced strain available, we chose the union of the Pfam family annotation of the single genomes to represent the pangenome Pfam family annotation, as in reference 53.
Phenotype models. We represented each phenotype from the set of GIDEON phenotypes across all genomes as a vector yp and solved a binary classification problem by using the matrix of Pfam phyletic patterns XP across all genomes as input features and yp as the binary target variable (see Fig. S1 in the supplemental material). For classification, we relied on support vector machines (SVMs), which are a well-established machine learning method (54). Specifically, we used a linear L1-regularized L2-loss SVM for classification as implemented in the LIBLINEAR library (55). For many data sets, linear SVMs achieve accuracy comparable to that of SVMs with a nonlinear kernel but allow faster training. The weight vector of the separating hyperplane provides a direct link to the Pfam families that are relevant for the classification. L1 regularization enables feature selection, which is useful when applied to highly correlated and high-dimensional data sets such as those used in this study (56). We used the interface to LIBLINEAR implemented in scikit-learn (57). For classification of unseen data points-genomes without available phenotype labels supplied by the user-Traitar uses a voting committee of five SVMs with the best single cross-validation accuracy (see cross-validation below). Traitar then assigns each unseen data point to the majority class (phenotype presence or absence class) of the voting committee.
Ancestral protein family and phenotype gains and losses. We constructed an extended classification problem by including ancestral protein family gains and losses, as well as the ancestral phenotype gains and losses in our analysis, as implemented in GLOOME (58). Barker et al. report that common methods for inferring functional links between genes that do not take the phylogeny into account suffer from high rates of false positives (24). Here, we jointly derived the classification models from the observable phyletic patterns and phenotype labels, and from phylogenetically unbiased ancestral protein family and phenotype gains and losses, which we inferred via a maximum-likelihood approach from the observable phyletic patterns on a phylogenetic tree, showing the relationships among the samples (see Fig. S1 in the supplemental material). Ancestral character state evolution in GLOOME is modeled via a continuous-time Markov process with exponential waiting times. The gain and loss rates are sampled from two independent gamma distributions (59). GLOOME needs a binary phylogenetic tree with branch lengths as the input. The taxonomy of the NCBI and other taxonomies are not suitable because they provide no branch length information. We used the sTOL (60), which is bifurcating and was inferred by a maximum-likelihood approach based on unbiased sampling of structural protein domains from whole genomes of all sequenced organisms (61). We employed GLOOME with standard settings to infer posterior probabilities for the phenotype and Pfam family gains and losses from the Pfam phyletic patterns of all of the NCBI bacteria represented in the sTOL and the GIDEON phenotypes. Each GIDEON phenotype p is available for a varying number of bacteria. Therefore, for each phenotype, we pruned the sTOL to those bacteria that were present in the NCBI database and had a label for the respective phenotype in GIDEON. The posterior probabilities of ancestral Pfam gains and losses were then mapped onto this GIDEON phenotype-specific tree (Gps-sTOL; see Fig. S2 in the supplemental material).
Let B be the set of all branches in the sTOL and P be the set of all Pfam families. We then denote the posterior probability g ij of an event a for a Pfam family pf to be a gain event on branch b in the sTOL computed with GLOOME as g ij ϭ P͑a ϭ gain | i ϭ b, j ϭ pf͒ ∀ i ʦ B, ∀ j ʦ P and the posterior probability of a to be a loss event for a Pfam family p on branch b as We established a mapping f:B=¡B between the branches of the sTOL B and the set of branches B= of the Gps-sTOL (see Fig. S2 in the supplemental material). This was achieved by traversing the tree from the leaves to the root.
There are two different scenarios for branch b= in B= to map to the branches in B.
(i) Branch b= in the Gps-sTOL derives from single branch b in the sTOL as follows: f(b=) ϭ {b}. The posterior probability of a Pfam gain inferred in the Gps-sTOL on branch b= consequently is the same as that on branch b in the sTOL: g b'j ϭ g bj ∀j⑀P. ( Inferring the Gps-sTOL Pfam posterior loss probabilities (l= ij ) from the sTOL posterior Pfam loss probabilities is analogous to deriving the gain probabilities. The posterior probability for a phenotype (p) to be gained (g= ip ) or lost (l= ip ) can be directly defined for the Gps-sTOL in the same way as for the Pfam gain and loss probabilities. For classification, we did not distinguish between phenotype or Pfam gains or losses, assuming that the same set of protein families gained with a phenotype will also be lost with the phenotype. This assumption simplified the classification problem. Specifically, we proceeded in the following way.
(i) We computed the joint probability x= ij of a Pfam family gain or loss on branch b= and the joint probability y j of a phenotype gain or loss on branch b=: (ii) Let x i be a vector representing the probabilities x= ij for all Pfam families jʦP on branch b i . We discarded any samples (x i , y i ) that had a probability for a phenotype gain or loss (y i ) above the reporting threshold of GLOOME but below a threshold (t). We set the threshold t to 0.5.
This defines the matrix X and the vector y as follows: By this means, we avoided presenting the classifier with samples corresponding to uncertain phenotype gain or loss events and used only confident labels in the subsequent classifier training instead.
(iii) We inferred discrete phenotype labels y= by applying this threshold t to the joint probability y i for a phenotype gain or loss to set up a well-defined classification problem with a binary target variable. Whenever the probability for a phenotype to be gained or lost on a specific branch was larger than t, the event was considered to have happened as follows: (iv) Finally, we formulated a joint binary classification problem for each target phenotype yp and the corresponding gain and loss events y= the phyletic patterns XP, and the Pfam gain and loss events X, which we solved again with a linear L1-regularized L2-loss SVM. We applied this procedure for all of the GIDEON phenotypes under investigation.
Software requirements. Traitar can be run on a standard laptop with Linux/Unix. The run time (wall clock time) for annotating and phenotyping a typical microbial genome with 3 Mbp is 9 min (3 min/Mbp) on an Intel Core i5-2410M dual-core processor with 2.30 GHz, requiring only a few megabytes of memory.
Cross-validation. We employed cross-validation to assess the performance of the classifiers individually for each phenotype. For a given phenotype, we divided the bacterial samples that were annotated with that phenotype into 10 folds. Each fold was selected once for testing the model, which was trained on the remaining folds. The optimal regularization parameter C needed to be determined independently in each step of the cross-validation; therefore, we employed a further inner cross-validation by using the following range of values for the parameter C: 10 Ϫ3 , 10 Ϫ2 · 0.7, 10 Ϫ2 · 0.5, 10 Ϫ2 · 0.2, 10 Ϫ2 · 0.1,......,1. In other words, for each fold kept out for testing in the outer cross-validation, we determined the value of the parameter C that gave the best accuracy in an additional 10-fold cross-validation on the remaining folds. This value was then used to train the SVM model in the current outer cross-validation step. Whenever we proceeded to a new cross-validation fold, we recomputed the ancestral character state reconstruction of the phenotype with only the training samples included (see ancestral protein family and phenotype gains and losses above). This procedure is known as nested cross-validation (62).
The bacterial samples in the training folds imply a Gps-sTOL in each step of the inner and outer cross-validation without the samples in the test fold. We used the same procedure as before to map the Pfam gains and losses inferred previously on the Gps-sTOL onto the tree defined by the current cross-validation training folds. Importantly, the test error is only estimated on the observed phenotype labels rather than on the inferred phenotype gains and losses.
Evaluation metrics. We used evaluation metrics from multilabel classification theory for performance evaluation (63). We determined the performance for the individual phenotype-positive and the phenotype-negative classes based on the confusion matrix of true-positive (TP), true-negative (TN), false-negative (FN), and false-positive (FP) samples from their binary classification equivalents by averaging over all n phenotypes. We utilized two different accuracy measures to assess multiclass classification performance (i.e., the accuracy pooled over all classification decisions and the macroaccuracy). Macroaccuracy represents an average over the accuracy of the individual binary classification problems, and we computed this from the macrorecall of the phenotype-positive and phenotype-negative classes as follows: However, if there are only few available labels for some phenotypes, the variance of the macroaccuracy will be high and this measure cannot be reliably computed anymore; it cannot be computed at all if no labels are available. The accuracy only assesses the overall classification performance without consideration of the information about specific phenotypes. Large classes dominate small classes (63). Majority feature selection. The weights in linear SVMs can be directly linked to features that are relevant for the classification. We identified the most important protein families used as features from the voting committee of SVMs consisting of the five most accurate models, which were also used to classify new samples. If the majority, which is at least three predictors, included a positive value for a given protein family, we added this feature to the list of important features. We further ranked these protein family features by their correlation with the phenotype by using Pearson's correlation coefficient (see Table S6 in the supplemental material).
Loers for reviewing the Traitar software; and Gary Robertson for helping to set up the Traitar web service.

FUNDING INFORMATION
This work, including the efforts of Jeremy Frank and Phillip B. Pope, was funded by European Research Council (336355-MicroDE).