Computational Prediction of Protein Function Based on Weighted Mapping of Domains and GO Terms

In this paper, we propose a novel method, SeekFun, to predict protein function based on weighted mapping of domains and GO terms. Firstly, a weighted mapping of domains and GO terms is constructed according to GO annotations and domain composition of the proteins. The association strength between domain and GO term is weighted by symmetrical conditional probability. Secondly, the mapping is extended along the true paths of the terms based on GO hierarchy. Finally, the terms associated with resident domains are transferred to host protein and real annotations of the host protein are determined by association strengths. Our careful comparisons demonstrate that SeekFun outperforms the concerned methods on most occasions. SeekFun provides a flexible and effective way for protein function prediction. It benefits from the well-constructed mapping of domains and GO terms, as well as the reasonable strategy for inferring annotations of protein from those of its domains.


Introduction
More and more sequences of proteins are available due to the advanced sequencing technologies, but the biological roles and functions of the proteins are hardly known. As reported by [1], only less than one percent of proteins have been functionally characterized by experiments. In other words, protein sequencing is faster than annotating protein.
In the past few years, several sequence-based methods [2][3][4][5][6][7][8][9] have been proposed to infer protein functions. These methods annotated the protein with the representative annotations of its homologues. Intuitively, these methods are also called homology-based methods. Usually, the homology-based methods include two stages: searching homologues through BLAST or PSI-BLAST and selecting representative Gene Ontology (GO) terms from annotations of homologues of the unannotated protein. More specifically, Goblet [2] determined the homologues by a predefined threshold of BLAST e-value and annotated the unannotated protein with the GO terms of its homologues. GoFigure [3], OntoBlast [4], and Gotcha [5] weighted the GO terms by the BLAST e-values and chose GO terms by their weights. PFP [6,7] made use of both strongly and weakly similar sequences of the query sequence to increase the coverage of functional annotation. ESG [8] exploited cascading homologues of the unannotated protein iteratively to improve the precision of prediction. ConFunc 2 BioMed Research International [9] split the homologues into subgroups according to their annotations and then inferred annotations of the unannotated protein from these subgroups. These methods have a positive impact on protein function prediction. However, the homology-based methods may not work when the unannotated protein has low sequence similarity to other annotated sequences or all of its homologues are not annotated. Furthermore, it is also reported that transferring annotations among homologues may easily produce erroneous results [27].
As is known, domain is the conserved sequence and structure in the evolution of proteins, which plays as the stable and independent functional block of proteins [28]. Besides the detailed sequence, domain also carries some important structural information, that is, active site, which is tightly relevant to biological function [21]. Thus, a domain may be a suitable clue to discover the function of proteins. Statistics on UniProt database (released in May, 2013) show that more than sixty percent of proteins have domains. Moreover, domain databases and tools for efficient domain recognition have been developed including Pfam [29], SCOP [30], RPS-BLAST [31], and HMMER [32]. These databases and tools accelerate the analysis of domains in protein. In general, it seemed that inferring functions from resident domains of the protein is feasible and reasonable.

Related Works
So far, many efforts have been made for discovering functional signals carried by domains. Schug et al. [33] generated rules for function-domain associations based on the intersection of functions assigned to gene products which contain domains at varying levels of sequence similarity. Hayete and Bienkowska [34] designed an automated predictor based on decision tree to assign functions for domains. Mulder et al. [35] mapped GO terms to the domain if all proteins with the given domain do not exist in the set of proteins without the given GO term. Song et al. [36] transferred functions based on alignment of domain content. In analogy with [35], Forslund and Sonnhammer [37] assigned GO term to domain set if and only if all proteins containing the domain set also are annotated with the given GO term. Rentzsch and Orengo [38] transferred annotations in single profile-based sequence cluster. These methods are easily understood and realized, but they are readily misled into making an error-prone prediction by spurious and missing annotations of proteins. Even a single protein missing a valid GO term is enough to mislead the functional inferring about its domains.
In addition, Zhao et al. [39] utilized the protein-domain features, domain-domain interaction, and domain coexisting features to predict domain function. Their work extended the coverage of domain annotation effectively and provided solid foundation for predicting protein function. However, their work mainly paid attention to domain function rather than how the annotation of domain affects protein function. In our work, we focus on how to predict protein function based on domain annotation.
Recently, the probabilistic models have become increasingly popular for their remarkable performance on uncertainty inference. Forslund and Sonnhammer [37] utilized Naïve Bayesian (NB) model for assigning terms to domain set. Nevertheless the Naïve Bayesian model required that domain sets occurrence independently, which does not come with practice. Thus, Forslund et al. had attempted to reduce the dependencies between domain subsets using an averaged contribution from each domain subset. However, the conditional independence assumption may still not hold. Subsequently, Messih et al. [40] designed two models based on NB: one is DRDO that an averaged contribution from each subset which contains the sequential neighboring domains is used to solve the problem of dependency; the other is DRDO-NB which took recurrence and order of domains into consideration. Although computational complexity of DRDO is lower than that of NB, it may still not satisfy the conditional independence assumption. Moreover, all of these methods pruned GO terms of resident domains before they assigned GO terms to the host protein. Thus, some weak functional signals which may be amplified by dependencies between domains are likely to be neglected.
Fang and Gough [41] generalized a dcGO predictor for inferring GO terms associated with individual domains and supradomains based on protein-level GO annotation (GOA) and families of protein. dcGO exploited value to evaluate the association strength (mentioned as relevance in the following sections to simplify) between domain and GO term. Since value only represents the probability of error involved in null hypothesis, it may not be reasonable for estimating the relevance between domain and GO term by value. In other words, value can be used to determine which GO term is related to the given domain from statistical perspective but it is not enough to measure the degree of their relatedness. Thus, an appropriate metric is needed for weighting the relevance between GO term and domain objectively.
In this paper, we design a method to seek functions for proteins (SeekFun) effectively. Under this method, a mapping of GO terms and domains is constructed based on protein-level GOA and domain compositions of proteins. The relevance between domain and GO term is measured by symmetrical conditional probability. Based on the relevance of resident domains and terms, the relevance between host protein and GO terms is computed. Finally, the GO terms with relevance above a predefined threshold are used to annotate the host protein. The performance of SeekFun is validated by a series of experiments. The results suggest that our method is effective and reliable for protein function prediction.

3.1.
Step 1: Construct and Weight Mapping of Domains and GO Terms. It is assumed that the resident domains may be associated with GO terms of the host protein. It is a rough assumption about the relationship between domain and GO term and may result in a large number of false associations. To differentiate the true associations from the false ones, the relevance between domain and GO term need be measured. Judged with this, the true associations will have higher relevance while the false ones will have lower relevance.
As mentioned earlier, value can be used to determine whether the domain is related to the GO term or not. When the value of domain and GO term is larger than the given significance threshold, it is considered that the domain can be annotated with the GO term, and vice versa. However, the larger value does not mean a more tight relationship between domain and GO term. In simple words, value may be not suitable for measuring relevance between domain and GO term. Suppose that V represents that the protein containing domain and denotes that the protein plays the function described by GO term go . The conditional probability pr( | V ) means the probability of that the protein containing is annotated by go . The pr( | V ) can reflect the dependence of go on the . Likewise, the pr(V | ) represents the probability of that the protein annotated by go containing the domain . The pr(V | ) can reflect the dependence of on the go . Thus, it can be inferred that simple conditional probability can reflect relevance between domain and GO term partly but not enough. As (1), symmetrical conditional probability may be appropriate to measure the relevance between GO term go and domain , DR(go , ). Consider Equation (1) means that the relevance between go and is determined jointly by conditional probabilities between V and . The bigger the probabilities are, the stronger the relevance between them is. Range of the relevance is from 0 to 1. The higher relevance means that the domain is more probably annotated with the term.
Supposed that #prot(go ) is the number of proteins which are annotated with the go , #prot( ) is the number of proteins which contain , and #prot(go , ) is the number of proteins which have to do with both go and . Accordingly, (1) can be transformed into (2). Consider

3.2.
Step 2: Transfer GO Terms of Resident Domains to the Host Protein. As is known, GO terms are organized as a directed acyclic graph and may be related to each other. Thus, predicting functions of proteins should take the relationship between GO terms into consideration. GO has a rule called "true path rule", which defines the terms along the pathway from a given term to the root term that must annotate the protein if the protein is annotated with the given term.
And a path upward from the given term to the root term in GO hierarchy is regarded as a true path of the term. Considering the true path rule, the mapping of GO terms and domains is extended along true paths of the GO terms in our method. Traditionally, if a domain is associated with a GO term, it is also associated with all ancestral terms of the GO term with equal relevance. However, it is reported that the semantics of GO terms has differences even if they are parent-child relationship. Thus, the relevance between the domain and each ancestor of the GO term may be different and the semantic differences between GO terms should be considered.
In fact, the organization of GO terms can be regarded as a split-flow semantic system (SFSS). In SFSS, the root term is the source of semantics which can describe the general functions while others represent semantic branches of the root term and illustrate specific functions. So the terms along the true path of the given term have different capabilities to describe the functions. Generally, for a given function, the ancestral term is more likely to describe the given function than its descendants because the semantics of its ancestors is more general and has more power to describe function. It can be explained by semantic coverage of GO term, which can be roughly estimated by the number of its descendants [42].
Based on these analyses, we proposed a novel strategy, namely RSC, to measure the relevance between domain and ancestral term based on semantic coverage. That is, given a term go which is related to the domain with relevance DR(go , ), the relevance between the domain and the ancestral term go of term go , can be calculated by (3). In (3), (⋅) represent the descendant set of the given term and Anc(go ) consists of the ancestors of the term go . Naturally, along the true path, the term which is nearer to root has bigger relevance value with the given domain than others and it is more probably to annotate the host protein It is supposed that protein is associated with all GO terms which are related to the resident domains of the protein. The relevance between protein and GO term can be derived from the relevance of the term and resident domains of the protein.
For example, if a protein contains a set of domain = { 1 , 2 , . . . , } and DR(go , ) denote the relevance between go and , then the relevance between and go , PR(go , ), can be computed by (4). Consider After the extension, each protein is associated with a group of GO terms with strong or weak relevance. To facilitate comparison, the relevance of proteins and terms need be normalized. Each of GO categories should be analyzed, respectively, as they have different biological meanings. For each protein, the relevance between the protein and the root of subontology (GO: 00003674 for molecular function, GO: 00008150 for biological process, and GO: 00005575 for cellular component), PR( , ), is used as baseline because the real annotations of proteins must be split from the root in the GO hierarchy. The normalized relevance of go and , NPR(go , ), can be measured by (5). The relevance has been standardized to scale from 0 to 1. The higher relevance means that the protein is more probably annotated with the term. Consider Through the above steps, the relevance of proteins and GO terms has been measured already. To select real annotations from candidate annotations, a threshold of relevance need be defined. If the relevance between protein and term is above the predefined threshold t and the term is assigned to the protein, and vice versa. In our study, the threshold t is about 0.6∼0.7 as the proposed model performs well on the given datasets.  Table 1. [42], the precision, recall, and f -measure are utilized to judge the performance of methods in our experiments. Given a target protein and ( ) which is a set of known (true) annotations of , the precision of the method at threshold ∈ [0, 1], pr( ), can be calculated as

Evaluation Metrics. Consistent with Critical Assessment of Functional Annotations (CAFA) experiments
In (6), ( ) is the set of predictive annotations whose relevance with is above t. S is the target set for testing. ( ) is the number of proteins which at least has one predictive GO term under given . Similarly, the recall of method at threshold , rc( ), can be computed by The f -measure (the harmonic mean of precision and recall) gives an intuitive number for comparisons of the concerned methods. For each method, the maximal value of f -measure on the overall threshold of relevance, max , is calculated as Considering the relationships between GO terms, the comparisons are guided by the true path rule. That is, the ( ) and ( ) are extended by adding all ancestors of their members to them before comparing.

Comparisons of Relevance Computed by Different Strategies.
To illustrate the rationality of weighting strategies, the relevance weighted by symmetrical conditional probability ( SCP ) is compared with those measured by value ( PV ) and traditional conditional probability ( dSCP ). In fact, it is hard to evaluate the relevance between domain and GO term for lacking of the gold standard. To determine appropriate strategies for weighting relevance, some properties of relevance are analysed. A little random noise may make a difference between observed and real datasets and the relevance should be robust on these similar datasets. To simulate similar datasets, a series of subsets of Uniref50, SwissProt, and TrEMBL is constructed by taking nine of their ten equal-size partitions randomly at a time. The calculations of relevance by different strategies are performed on these subdatasets. The varied distributions of relevance on the different datasets may be good evidence for which strategy is more proper for weighting relevance.
The distributions of relevance derived from different strategies are displayed in Figure 1. In order to facilitate comparison, without loss of meanings, the logarithmic transformation and Z-score transformation are performed on PV , which are represented by log PV in Figure 1. Observed the figure, it can be found that dSCP is the most changeful while the distribution curves of both SCP and log PV have similar trends. All of those suggest that, as for robustness on tiny different datasets, the SCP and PV are more proper than dSCP . What is more, the curves of SCP and PV appear to have obvious monotonicity that is beneficial for assigning GO terms to the domain.
Meanwhile, the curves of PV are steeper than those of SCP on each dataset, which imply that the resolution of SCP is lower than PV . In this paper, the resolution describes how sensitive the relevance is to distinguish true positive association between domain and GO term from other negative ones. The resolution of relevance is inversely proportional to the average density of relevance in their range, which is just indicated by the steepness of the curves in the figures. In simple words, the larger the average density of relevance in their range is, the harder the true association between domain and GO term is determined.
On the other hand, the relevance derived from two significantly different datasets may vary more dramatically than those from the similar datasets. Statistically, the SwissProt and TrEMBL have no intersection while they have 5031 and  (i) Figure 1: Compare distributions of relevance on similar datasets. dSCP , SCP , and log PV represent the relevance computed by conditional probability, symmetrical conditional probability, and value, respectively. is constructed by taking nine of ten equal-size partitions of SwissProt at a time, = 1, 2 ⋅ ⋅ ⋅ 10. Likewise, and denote the constructed subsets of Uniref50 and TrEMBL separately, , = 1, 2 ⋅ ⋅ ⋅ 10.
The curves display the distributions of relevance on similar subsets of the experimental datasets.  6929 common proteins with Uniref50, about up to their 30% and 36% separately. Consequently, the difference between the curves of relevance on SwissProt and TrEMBL should be larger than those of others. Observing the distributions of relevance on these datasets, as displayed by Figure 2, it can be found that the SCP and log PV vary as expected but the log PV still suffers from low resolution. Generally speaking, it can be concluded that SCP is a more suitable measure of relevance between domain and GO term.

The Impact of SCP on Protein Function Prediction.
For validating its impact on protein function prediction, SCP is tested on experimental datasets: Uniref50, SwissProt, and TrEMBL, respectively. The comparison is performed on the three subontologies of GO: molecular function (MF), biological process (BP), and cellular component (CC) separately. The comparison includes two steps: constructing mapping of domains and GO terms and annotating proteins based on the mapping. In our experiment, the mapping of Pfam domains and GO terms (pfam2go) is downloaded from the Gene Ontology website in May, 2013. Based on this reliable mapping, all annotations which are associated with the resident domains are assigned to the host protein. This method is named Pred pfam2go in this paper. Meanwhile, the mapping of Pfam domains and GO terms which is weighted by SCP is also used for prediction, namely, Pred weighted . In the comparisons, Pred pfam2go and Pred weighted are validated by performing the same task in the same framework on the basis of different mappings of domains and GO terms. To avoid the influence of domain coverage, the weighted mapping with SCP just includes the domains in pfam2go when it is applied. Here, to compare the influence of the strategy SCP and RSC, the method which is the combination of them is also used to perform the same task and marked with Pred combine . Their performances are illustrated in Table 2.
As displayed in Table 2, Pred weighted has higher recall than Pred pfam2go while the latter achieves better precision than the former. These results suggest that the Pred weighted could improve the specificity of annotations but it is at the cost of precision.
It also can be found from Table 2 that Pred combine is superior to others in general. Compared to Pred pfam2go , Pred combine outperforms on both precision and recall. In  contrast to Pred weighted , Pred combine significantly improved the precision while it does as well as Pred weighted on recall. Thus, it can be concluded that SCP tend to select specific terms for the proteins and RSC balances this bias by propagating in the GO hierarchy. It may be the reason that Pred combine shows higher performances.

The Impact of RSC on Protein Function Prediction.
In order to validate the effectiveness of the RSC, it is compared with traditional strategy which set the relevance of domain and terms along a true path as equal (RPE). The two strategies are applied to predict protein functions based on the mapping of domains and GO terms weighted by SCP . Their best performances are listed in Table 3. As displayed, RPE gives a better recall while RSC has higher precision and max . In general, RSC may be more beneficial to protein function prediction than RPE. It may be because the resolution of SCP is effectively promoted by different relevance between protein and each term along a true path. On the contrary, RPE considered that protein has equal relatedness to every term along the true path, which makes it harder to determine the true positive associations between terms and the host protein. Even if the threshold of RPE is 1, its precision is still lower than the other one and recall goes down. It confirms that the differences of GO terms have significant influence on their relevance with protein.

Comparison of the Concerned Methods.
To assess the efficiency of SeekFun, it is compared together with NB, DRDO, DRDO-NB, and dcGO on the three benchmark datasets. The performances of concerned methods on different dataset are shown in Table 4. To provide a simple number for comparison between methods, the averages of metrics on each dataset are also listed.
In terms of precision, SeekFun is superior to others while NB, DRDO, and DRDO-NB follow in turn. The dcGO is significantly lower than others. As aforementioned, dcGO measured relevance between domain and GO term by value while other methods calculated it based on conditional probability. These results may indicate again that the relevance estimated by value is not sensitive enough to determine the true positive associations between domain and GO term. In other words, PV has low resolution for distinguishing real annotations of protein. By contrast, the conditional probability is more suitable for estimating relevance. As for the recall, SeekFun performs better than others while dcGO follows. It also can be found that the performances of NB, DRDO, and DRDO-NB are not as well as the other methods. Comparing the details of them, NB, DRDO, and DRDO-NB infer functions of protein from annotations of domain combinations, which enhance the precision of function prediction. However, in the process of discovering domain combinations, some slightly weak associations between domain and GO term may be neglected. The resident domains of the host protein may interplay as different combinations to perform different functions. Nevertheless, these methods judge domain combination if the members of the domain combination exist in the protein and the value of their combination is above predefined threshold. It may miss information covered in the potential domain combinations and domain themselves. We guess this may be the reason that these methods show lower recall of functions.
Overall, SeekFun has better performance than others. It can attribute to the weighted mapping of domains and GO terms and the strategy for transferring annotations of resident domains to the host proteins. The weighted mapping can reflect the relationship between domain and GO term properly. The transferring strategy takes both the differences and connections of terms into consideration, which greatly promote its capability of distinguishing real associations of domains and terms from the false ones.

Conclusions
In this paper, SeekFun is developed for protein function prediction. Instead of using amino acid sequence of protein directly, SeekFun takes the resident domains of proteins and protein-level GOA as clues to annotate proteins. We tested the overall performance of SeekFun and the results suggest that SeekFun is superior to the concerned methods: NB, DRDO, DRDO-NB, and dcGO on precision and recall generally.
Meanwhile the effects of relevance computed by symmetrical conditional probability, ( SCP ) and the strategy for inferring annotations of protein from the annotations of its resident domains (RSC) are validated, respectively. The results of these experiments confirmed that both of them are effective and can promote the performance of protein function prediction. In the proposed method, SCP tend to discover specific functions of protein but it cannot ensure the precision and RSC is used to compensate for the lack of SCP . So the combination of them achieves high performances. The main idea of SeekFun could be used to acquire knowledge from other functional ontologies based on different domain resources easily. SeekFun will facilitate the discovery of protein functions and the insights into the biological roles of proteins.