Eliciting the Functional Taxonomy from protein annotations and taxa

The advances of omics technologies have triggered the production of an enormous volume of data coming from thousands of species. Meanwhile, joint international efforts like the Gene Ontology (GO) consortium have worked to provide functional information for a vast amount of proteins. With these data available, we have developed FunTaxIS, a tool that is the first attempt to infer functional taxonomy (i.e. how functions are distributed over taxa) combining functional and taxonomic information. FunTaxIS is able to define a taxon specific functional space by exploiting annotation frequencies in order to establish if a function can or cannot be used to annotate a certain species. The tool generates constraints between GO terms and taxa and then propagates these relations over the taxonomic tree and the GO graph. Since these constraints nearly cover the whole taxonomy, it is possible to obtain the mapping of a function over the taxonomy. FunTaxIS can be used to make functional comparative analyses among taxa, to detect improper associations between taxa and functions, and to discover how functional knowledge is either distributed or missing. A benchmark test set based on six different model species has been devised to get useful insights on the generated taxonomic rules.


S1. Taxonomy partitioning
In Figure S1.1, the partitioning of the taxonomic tree is shown. The tree has been divided into 7 groups, represented by the petals: within each group, the general taxa with the highest number of unique GO terms are shown in bold (with the most characterized species inside parentheses). In Table S1.1 all the highly annotated general taxa (robust general taxa) are reported.

S2. Fuzzy Logic and statistical tests
Relative probabilities have been defined to decide whether to consider a GO term g enough studied in a general taxon t, not enough represented or uncertain. Two thresholds   and   , set to 0.1 and 1 respectively, seem to be enough discriminative, however, to lessen the precision on empirical values a smoothing around those thresholds using a trapezoidal preference function has been added. Besides the two thresholds   and    two uncertainty intervals   and   and two confidence intervals h 1 and h 2 around the estimated thresholds have been introduced. The resulting preference function can be designed as (see Figure S2.1): Figure S2.1: fuzzy threshold, in blue, and its generating trapezoid.
Fuzzy thresholds have been designed to obtain a more robust management of the uncertainty inherent the relative probabilities; moreover, by "abstracting" over the precise meanings of the relative probabilities thresholds, their precise semantics can be relaxed. Fuzzy thresholds likely improve the performances of the tool or, better, they improve the overall trade-off between performance and robustness. The current fuzzy thresholds are fairly stable: we run a Nelder-Mead optimization over the three preference function parameters (h 1 = h 2 = 1), that is the thresholds  1 ,  2 , and  1 =  2 ; a 25-fold cross-validation has been performed over 63,965 initial GOC constraints, fitting vectors of 2,558 elements and obtaining the following optimized parameters: •  1 = 0.11 +/-0.002 (standard deviation) In order to assess the statistical significance and assign a p-value to the taxonomic constraints, a bootstrap approach has been used since the data cannot be properly modeled with any known distribution. A one-sided t-hypothesis test has been adopted, and 'in taxon' constraints have been tested against the 'never in taxon' constraints distribution and vice versa. The test has been repeated for each constraint against 10,000 resampled distributions and the t scores obtained have been used to calculate the p-value considering a significance level of 5% (alpha <= 0.05). The p-values have finally been adjusted for multiple testing using the Bonferroni correction method.

S3.1 Rules application
The frequency distribution of the taxonomic propagation rules is reported in Figure S3.1. The majority of constraints comes from propagations from children, which is expected since the algorithm follows a bottom-up strategy. Bottom-up propagations are almost equally shared among positive, negative and dubious children.
The rarest are the dubious ones generated from a conflictual parent and a conflictual child. Figure S3

S3.2 Arbitrariness and Robustness
To have an idea on how much arbitrary our taxonomic propagation rules are, two additional alternative propagation criteria have been designed: one based on an open world hypothesis and the other based on a closed world hypothesis ( Figure S3.3). These rule sets have been then used to propagate the initial constraints over the Taxonomy tree and the results have been compared with the manual rules proposed by the GO Consortium (GOC in the following); results are shown in Figure S3.4 (the lower the better).
Another interesting point concerns the robustness of the generated rules; robustness can be defined as the resilience of the system when the input data, that is the protein annotations, are perturbed. Since a robust system far from optimality is not useful, first of all it has been established that the steady state deviation ss has to be measured with respect to the fraction of the common constraints between the FunTaxIS set and the GOC set whose resulting discretized polarity pset is not the same is the characteristic function of a set . In the case of a GOC constraint g, its polarity has been defined as The simulation is performed by applying the function : → to an increasingly wide random subsets of annotations by adding or subtracting a fixed amount of annotations according to a uniform Boolean random variable , False , True Four random replicates have been produced in order to obtain also an approximate standard deviation. In     Figure S3.5: the relative variation of wrong constraints with respect to the percentage of wrong constraints estimated from the "true" GO Consortium dataset; this plot has been built by perturbing an increasing percentage of annotations (% perturb.) by a fixed increasing amount of noise annotations (delta). The grey grid over the surface gives an idea of the standard deviation coming from four random replicates. For "delta" equal 0, by increasing "% perturb" there is obviously no difference ("% diff constr." is 0) and the same happens for "% perturb" equal 0 and increasing "delta" as shown in the plot.

S3.3 Rules application and world hypotheses
The open or closed world hypotheses must be considered also when the constraints are to be applied to a set of GO terms. More precisely, a concrete decision on neutrality of terms must be taken, i.e. uncertain/neutral terms must be either accepted or rejected for the taxon. To simplify the reasoning, a sort of ternary logic setting can be adopted, where in addition to the canonical true, T, and false, , symbols there is a novel neutral symbol ⊣. The differences between the open and closed world hypotheses reduce to the intrepretation of the neutral symbol: tendentially true in the open world and tendentially false in the closed world. In dealing with two possibly contradictory information sources, a fusion criterion has to be established. By giving a more authoritative role to the GOC source (G) with respect to FunTaxIS source (F), two slightly different truth tables (Table S3.1 and Table S3.2) can be devised for the open and closed world hypotheses respectively.

GOC decision ( ) Final decision
From those truth tables it is possible to synthesize the following formulas for the two world hypotheses:

S4.1 Taxon constraints comparison between FunTaxIS and GOC
The Venn diagram in Figure S4.1 shows that taxon constraints provided by GOC largely overlap those generated by FunTaxIS, while numerical details are presented in Table S4.1. Only non-neutral constraints are represented, while there are additional 10,887 constraints (0.1% of the total) that are discordant between the two methods and are not reported in the chart.

S4.2 FunTaxIS vs CroGO
In a paper published in 2013 (1), the authors presented a tool able to estimate the similarity of GO terms from different ontologies. Together with the tool, they provided two species-specific lists (one for S. cerevisiae and one for H. sapiens) of coupled GO terms from the Molecular Function and the Biological Process ontologies characterized by high similarity. Since these terms are functions present in yeast and/or human, we exploited these data to perform an independent benchmark to assess the correctness of the constraints generated by

S6.1 Functional taxonomic consistency vs. alignment significance
To investigate the potential presence of a dependency between the probability for a GO term of being discarded due to the application of a taxon constraint and the significance of the BLAST hit from which it is derived, we plotted the amount of GO terms surviving or not surviving the taxon constraints filtering and their corresponding e-values. The histograms ( Figure S6.1) show, for the yeast proteome, the amounts of GO terms coming from BLAST hits that survive (kept) and do not survive (discarded) the filtering step based on taxon constraints. They are divided in bins of log-transformed e-values, which represent the significance of the hit.
The bin "<= 300" contains e-values equal to zero. The results show that there is no difference in the e-value distributions of filtered vs. not filtered GO terms, suggesting that the significance of pairwise alignments is not a reliable indicator of taxon compatibility of annotations.