Using structural motif descriptors for sequence-based binding site prediction

Background Many protein sequences are still poorly annotated. Functional characterization of a protein is often improved by the identification of its interaction partners. Here, we aim to predict protein-protein interactions (PPI) and protein-ligand interactions (PLI) on sequence level using 3D information. To this end, we use machine learning to compile sequential segments that constitute structural features of an interaction site into one profile Hidden Markov Model descriptor. The resulting collection of descriptors can be used to screen sequence databases in order to predict functional sites. Results We generate descriptors for 740 classified types of protein-protein binding sites and for more than 3,000 protein-ligand binding sites. Cross validation reveals that two thirds of the PPI descriptors are sufficiently conserved and significant enough to be used for binding site recognition. We further validate 230 PPIs that were extracted from the literature, where we additionally identify the interface residues. Finally we test ligand-binding descriptors for the case of ATP. From sequences with Swiss-Prot annotation "ATP-binding", we achieve a recall of 25% with a precision of 89%, whereas Prosite's P-loop motif recognizes an equal amount of hits at the expense of a much higher number of false positives (precision: 57%). Our method yields 771 hits with a precision of 96% that were not previously picked up by any Prosite-pattern. Conclusion The automatically generated descriptors are a useful complement to known Prosite/InterPro motifs. They serve to predict protein-protein as well as protein-ligand interactions along with their binding site residues for proteins where merely sequence information is available.


Background
Exhaustive knowledge about protein interactions is a prerequisite to understanding the molecular machinery of the cell. While comprehensive protein sequence databases are available, the number of known PPIs is still small. In addition, experimentally proven PPIs often do not reveal the binding sites involved. The implications of the discovery of binding sites are manifold: the discovery of patterns in amino acid arrangements is of general importance in the study of protein-protein interactions. Furthermore, docking algorithms benefit greatly from the correct prediction of binding sites. Finally, interaction prediction is the key to mapping global interaction networks and signalling pathways, and may help elucidate the functions of individual proteins. Complementary to experimental techniques are computational approaches that analyze and predict protein-protein interactions. Sequence-based methods include gene context conservation [1], synthetic lethality [2], phylogenetic profiling [3,4] or co-evolution of gene expression [5]. Various databases of binding sites and interfaces between proteins and their domains exist [6][7][8]. An extensive list of prediction methods can be found in [9].
Functional characterization of novel genes and their proteins remains an important and challenging task. It often is improved by the identification of novel interaction partners. It has been observed that structural and functional features of proteins like catalytic sites are often well conserved [10]. In contrast, the rest of the surface is often more variable (see Figure 1A), which impedes sequence similarity searches for functionally equivalent or similar proteins. Descriptors previously used for conserved domains and interface motifs include regular expressions, weight matrices, and profile Hidden Markov Models (HMMs). These descriptors involve either sequentially consecutive stretches [11][12][13] or full length domains [14]. In particular, HMMs were successfully employed in many sequence similarity search tools [14][15][16].
As pointed out by Bailey and Gribskov [17], the signal-tonoise ratio in homology searches can be improved by using sets of motifs that characterize a family. In this study, we aim to create descriptors for all relevant sequence parts of structurally known protein-protein and protein-ligand binding sites. These binding sites are often well-conserved [18], but their segmented nature on sequence level has to be taken into account for sequence similarity searches ( Figure 1B). In accordance with previous work [19,20], we define an interface between two proteins to consist of two faces. Similar faces can be clustered geometrically into face types, and similar interfaces can be clustered into interface types [20].
Many approaches for interaction prediction and function annotation require structural knowledge about the protein of interest [21][22][23]. By waiving this requirement, interaction prediction is applicable to a much wider range of sequences but becomes a substantially harder problem. It has been addressed previously (see e.g. [24,25]). Most notably, Li and Li [13] discover stable and significant interface motifs and represent them with regular expressions, while Espadaler et al. [12] prove the usefulness of HMMs for this endeavor. Both approaches use single structural templates as seeds for generalization with further sequences, coming from either similarity search or random generation. However, several structures for a particular kind of domain-domain interaction are available, each providing new insights into the sequential variability of the actual interacting residues. Novel to our method is the incorporation of as many structures as possible for each binding site descriptor. The benefits are intriguing as protein-protein interactions from complex structures are considered to be the most reliable source of interaction data.

Face descriptors
We compiled a library that comprises profile HMM descriptors for 740 protein-protein and 3000 protein-ligand binding sites in the Protein Data Bank PDB [26]. Each descriptor describes one face. These descriptors, totalling more than 3,740, characterize an interaction/ligand binding site on sequence level. Hence, given a query sequence of interest, it is possible to compare it to each interface descriptor, thus identifying binding sites to possible interaction partners including ligands. Gene Ontology (GO) [27] annotations are linked to each descriptor from the original PDB entries that were used for its construction. The complete list of profile HMM descriptors is directly usable with the HMMer package [28] and is freely available for academics upon request.

PPI descriptor construction
Based on the family level of the Structural Classification of Proteins, SCOP [29], we can extract and classify all domain-domain interactions found in the PDB. This classification is available in the SCOPPI database [30]. As pointed out by Kim et al., even homologous domain pairs can associate in geometrically different ways by employing different sets of residues to form interfaces [19,20]. Consequently, the corresponding interface profiles would differ substantially and combining the information about interacting residues to a profile would be meaningless. However, often a number of domain-domain interactions expose striking similarities and it is desirable to collect all instances of one interface type for the calculation of the respective interface profile. We therefore compose descriptors for all interface types in SCOPPI by combining all instances of that interface type. When data for interface types is sparse, we utilize sequence data provided by HSSP [31]. Often several sequentially remote segments contribute to a binding site. To accommodate for this phenomenon, we adopt the multiple-motif approach from PRINTS [32], MAST [17] and Meta-Meme [33] to represent binding sites as a collection of small HMMs for one local binding motif. Thus we describe only the important sequence parts that form a structural feature. To represent the full sequence space of a whole family with a weight matrix or a profile-HMM, a large number of sequences is required, in particular for families of strong sequence variability. However, considerably fewer sequences are needed for short, conserved motifs.

PLI descriptor construction
As large protein complexes are often difficult to crystallize, knowledge about protein-protein interactions can be drawn from the more abundant protein-peptide interactions. The descriptor construction method can be extended to these cases in particular and to PLI in general. We construct HMM profiles for faces that bind to small molecules and peptides. To this end we scanned PDB for most frequently occuring co-crystallized ligands. We identify the residues surrounding the ligand incl. possible cofactors. A profile is solely built from one structural template and aligned sequences utilizing HSSP [31].

Evaluation
In order to assess the significance of HMM scores, a number of comparisons to shuffled databases were conducted. Figure 2B demonstrates the expressive power of low E-values achieved with interface type HMMs against Swiss-Prot vs. shuffled Swiss-Prot. Random hits generally only occur with E-values above 1, while hits in Swiss-Prot below 1 can be considered significant.

Assessing the performance of PLI descriptors
Benchmarking HMMer E-values Expectation values provided by the HMMer software are a means for assessing the significance of HMM hits. As demonstrated by Li et al. [13], the statistical evaluation to randomness can be used to establish a Z-score to distinguish significant from random hits. Here, we use comparisons to shuffled databases to gain further information about the significance: by calculating the ratio of best E-values of hits from shuffled and not shuffled sequence databases. For database shuffling, we generated a random permutation for every single sequence in the database. In the sequel, we demonstrate the use of shuffled databases for one particular ligand -adenosine triphosphate (ATP).

Case study: ATP-binding sites
We evaluate ligand-binding descriptors for the case of ATP. The descriptors for ADP-binding and ATP-binding were run against 205,000 Swiss-Prot-annotated sequences and could extract 10,349 hits whereof 9,255 were true positives and 1,094 false positives. Given that 36,774 sequences were annotated as ATP-binding, this amounts to 25.2% recall and a precision of 89.4%. Prosite's P-loop motif recognizes 24.87% but in contrast produces 6792 false positives (precision: 57.4%). Figure 4 shows an overview of these numbers.
The precision-recall curve ( Figure 2A) for our descriptors was generated from true/false positive rates at different Evalue thresholds (Table 1). Scores for regular expressions Constructing a set of sequence profiles to represent a conserved structural feature Figure 1 Constructing a set of sequence profiles to represent a conserved structural feature. Caspase's active site is highly conserved (1ICE, conservation levels are calculated using the von-Neumann entropy and displayed in a color gradient from blue (variable) to red (conserved)). Conserved residues in close vicinity of the tetrapeptide inhibitor largely define the catalytic site environment. Caspase residues within 5 Å of the inhibitor are underlined. Segments are patched and those with low conservation are discarded to avoid insignificant hits. We add the amino acid distribution from HSSP data for each site of the remaining segments. It is thus possible to construct HMMs and visualize the profiles as sequence logos [40].

A B
Assessing accuracy and significance of ATP-binding descriptors Figure 2 Assessing accuracy and significance of ATP-binding descriptors. A. Precision-recall curve for ATP-binding descriptors derived from protein structures with bound ATP or ADP tested against Swiss-Prot, shown as curve with red circles. Each circled point corresponds to a different E-value cutoff. The Prosite patterns for "ATP-binding" and "ADP-binding" are included as well (green crosses). Overall, Prosite achieves a recall of 31% with a precision of 62% (blue cross). For all E-values, our method performs better than Prosite. B. Distribution of E-values for the ATP-binding descriptors. To assess the significance of hits, the descriptors were tested both against Swiss-Prot (black line) and a shuffled Swiss-Prot version (red line). The cumulative number of hits below a certain E-value threshold is shown. The inlet shows a magnification of the lower right corner. Below an E-value of 1 (dotted vertical line), ~53,000 hits are found in Swiss-Prot whereas only ~1,200 hits are found in the shuffled Swiss-Prot.
from Prosite are generally below this curve. Prosite improves on this by adding highly specific full length sequence profiles with high precision but very low recall.
Some descriptors exhibited great similarity to the P-loop motif, e.g. the descriptor derived from PDB entry 2BEK, a chromosome partitioning ATPase ( Figure 5): the central Figure 3 Correlation of length and quality of HMM descriptors. ATP-binding descriptors as well as face type descriptors for protein-protein interactions were run against original and shuffled versions of Swiss-Prot and uncharacterized NCBI sequences. We define the length of a profile Hidden Markov Model descriptor as its number of states. Quality is measured as difference between log E-values of best hit against original sequences and shuffled sequences. For Swiss-Prot, longer descriptors have better quality and therefore produce more significant hits. For uncharacterized sequences, this does not hold. One explanation could be that these sequences are depleted of significant matches by similarity searches.

Correlation of length and quality of HMM descriptors
binding motif conserves Glycine and Lysine that build hydrogen bonds to ATP's phosphate tail. The Prosite pattern only disagrees on its first position (requires Alanine or Glycine, mostly Glutamine found) and does not give any specification to the following four residues. These however, are found to be well conserved in a wide variety of ATP-binding homologs. We thus have automatically developed a variation of the P-loop motif which is more precise for a range of orthologous chromosome partitioning ATPases that are picked by HSSP in other species.
Figures 3A-D illustrate the use of E-value ratios of unshuffled vs shuffled sequence databases. In case of the ATPbinding descriptors searched against Swiss-Prot, the comparison of the two E-value distributions (shuffled/not shuffled) allows the identification of a significance threshold ( Figure 2B).
Note that small motifs like e.g. the poly-proline (PxxP) motif occur frequently in sequence where only 5% are functional. Nevertheless, hits of small motifs are helpful to identify candidate sets that should undergo manual postprocessing.

Assessing the performance of PPI descriptors Cross validation with structure data sets
The initial data set comprised 740 interface descriptors of protein-protein interactions, each having at least three non-redundant instances (not more than 98% sequence identity). In order to check the ability of face descriptors to recognize faces from structures in the PDB, we performed 2-, 4-and 8-fold cross validations on interface types with at least 8 non-redundant instances. This yields a set of 45 interface types, i.e. 90 face types. This set was run against domains from PDB that are classified by SCOP. For 61 face types (67%), a reasonable recall of 70%-100% was achieved (additional file 1). Face types with low recall often have small interfaces with short, dispersed segments producing insignificant hits or have low face conservation. Another source of errors is misalignments of sequences of an interface type. In five cases, the recall could be improved by adding homologous sequences from HSSP [31]. The recall for interaction prediction by requiring both face types to be present is upper bound by the minimal recall of both faces. Hence, the average recall for interface detection dropped to 39%. This problem is most eminent for predicting interactions of promiscuous faces. Using the Structural Classification of Protein-Protein Interactions (SCOPPI [33]), it is then still possible to provide candidate interaction partners. In particular, for dedicated faces, i.e. those with just one opposite binding face, recognition of one face type suffices.

Literature protein-protein interaction benchmark
To investigate how well our method is suited to detect protein-protein interactions, we benchmark it against a set of high quality literature-curated interactions. To this end, we use a subset of NetPro, an expert curated and annotated database containing ~15,000 protein-protein interactions [34]. These were extracted from PubMed abstracts by a semi-automated method and then cross-checked by human experts.
Using 740 multiple motif descriptor pairs, we search the NetPro benchmark set for matches where a descriptor pair matches two interacting proteins. If we maximally relax our E-value cutoff criterion, we are able to predict ~80% of the literature interactions. At a stricter E-value of 0.001 we still validate 230 interactions (See details in Table 2).

Using the descriptors to annotate uncharacterized sequences
We obtained a corpus of 32,000 uncharacterized proteins from the NCBI's non-redundant protein sequence database. Face descriptors were run against these sequences and a shuffled version. The result is shown in Figure 3D.

ZP_00820831 --YSCRAGICGSCRITlVD
This suggests that uncharacterized Fe-S protein could be part of the Complex II of Yersinia bercovieri interacting with succinate dehydrogenase flavoprotein subunit. The latter was found to be present in the available Yersinia pestis genome via sequence similarity search.
In order to estimate the quality of descriptors, E-value ratios for all descriptors in shuffled and original databases were analyzed in dependence of descriptor length ( Figure  3). Not surprisingly, most descriptors perform well against SwissProt, in particular long descriptors. Significant hits are rarer in uncharacterized sequences, which had less or no influence in the generation of descriptors ( Figure 3C,D). Interestingly, the best results were achieved by short and medium size descriptors. This suggests that long descriptors are less likely to discover binding behaviour in unknown sequences.

Applications
The match of a descriptor pair allows identification of the putative residues responsible for the interaction. This information can be used to guide site-directed mutagenesis experiments that aim at disrupting the interaction.
The proposed method for binding site prediction can be applied by itself or in combination with other methods. A common technique for protein interaction prediction is by identifying interacting homologues. This concept of socalled interologues was first noted by Walhout et al. [35]  ATP binding motifs Figure 5 ATP binding motifs. The ATP-binding descriptor derived from PDB entry 2BEK and HSSP is compared to the the Ploop pattern (below). A single conflict occurs in the first position, and more specific detail is given about the second to the fifth residue. The descriptor therefore correctly detects other chromosome partitioning ATPases.

[AG]-x(4)-GK-[ST]
and was applied in PPI prediction in e.g. [36,37]. The usage of descriptors can serve here as a refinement: the assumption that the residues responsible for the interaction are present can be easily confirmed by descriptor matches of the according families. Since a match provides residue correspondences to the original structures used to create the descriptor, the match alignment serves as an initial setup for homology modeling of the interface region.

Limitations
For some interface types only few or no structures are available. This implies that protein-protein interaction descriptors are inaccurately or not at all represented by the descriptor library. On the other side, a pair of homologous sequences from HSSP does not necessarily preserve the interaction of the structural template and thus does not belong to a certain interface type. These sequences "contaminate" the alignments of interface descriptors. It is therefore essential to assure that recruited sequence pairs not only maintain an interaction but also agree in interface type, i.e. have similar sets of interface residues.
A related problem is that of interaction specificity. An interface descriptor with N and M hits for each face, respectively, induces N × M candidate interaction pairs. Aytuna et al. [38] argue analogously for structural face descriptor pairs and Deane et al. [37] verify interactions between pairs of yeast proteins by known paralogous interactions. Although the latter report only 1% false positives, results should be -as with any computational method -ideally supported by further evidence from experimental results.
Our technique to construct interface descriptors is inadequate for short, strongly dispersed or highly variable interfaces (e.g. loops in immunoglobulins). However, it is possible to create a descriptor with our method that spans over the surrounding secondary structures, which are often well conserved.

Conclusion
We provide a library of Hidden Markov Model based descriptors that capture important structural features such as protein interfaces, ligand binding sites and active sites of enzymes. The implications for predicting binding sites and binding partners of proteins are many-fold. It provides insights into the biological processes the matched protein might be involved in. Furthermore the method can pinpoint interacting residues. It thus bears the potential for functional annotation and for assisting in discovering new drug targets.
Cross validation with the available structural data for a number of interface types reveals that two thirds of the face descriptors have a recall between 70% and 100%. Interaction prediction by recognizing both faces is intrinsically harder than just one-sided binding site detection. The cross validation results reflect this fact, as the recall for predicted interactions drops to 39%.
To demonstrate the biological significance of our descriptors, we compare the descriptors to NetPro, a PPI database with literature evidence. This way we could validate the predicted interactions and moreover provide insights about the critical interacting residues.
We created a benchmark for ATP-binding site detection. From a database of Swiss-Prot annotated sequences, our descriptors successfully recognized 30% while producing much less false positives than Prosite's regular expression for the P-loop motif.
Finally, an example for a significant hit for a binding site in an uncharacterized protein from Yersinia bercovieri is presented, which suggests a possible function as Complex II (succinate dehydrogenase) subunit for this protein.

Methods
The workflow of our method is illustrated in Figure 6. The method uses protein structural data that describes a bind- ing site to generate Hidden Markov Models. These are then used to search Swiss-Prot and uncharacterized sequences.

Protein-protein interaction (PPI) descriptors
For PPI descriptors, data is taken from the SCOPPI database [30]. SCOPPI is a collection of domain interactions and their interfaces of proteins in the Protein Data Bank (PDB). An interface in SCOPPI consists of two faces. Similar faces are clustered into face types, and similar interfaces are clustered into interface types [20]. For this study, we use a non-redundant set of domain sequences for every interface type at 98% sequence identity. SCOPPI provides a multiple sequence alignment (MSA) constructed with MUSCLE [39] for domains of the same SCOP [29] family. Figure 7 shows such an alignment for one face type. Interacting residues are depicted in uppercase. Columns of the MSA are marked as interface columns if 50% of the sequences contain an interacting residue at that position. Interface columns are extended by first adding adjacent columns (flanking) and then filling gaps of length 1 (padding). The results are segments of continuous interface columns. For every segment, a Hidden Markov Model (HMM) is generated with the HMMer software [28]. Segment HMMs are merged into one HMM by linking them with insert states. The resulting HMM serves as face type descriptor for a PPI. The linking of HMMs is adopted from Meta-Meme [32], with the difference that insert and delete states remain unchanged in our approach. The probability for a self loop in the segment linking insert states is set to l/(l+1), where l is the average length of the linker region between two interacting segments. Finally, the merged HMMs are calibrated using HMMer's calibration with 5000 random sequences. The resulting HMMs are directly usable with the HMMer package [28] which provides sound E-value calculation without assuming segment scores to occur independently.
Despite the fact that homodimers account for a large set of interface types, we omit them because it is often unclear whether a homodimer interface is genuine or an artifact from crystallization.

Protein-ligand interaction (PLI) descriptors
Due to the current lack of a comprehensive classification for geometrical associations of protein-ligand interactions, we generate PLI descriptors from single structural templates: 0. For each structure in the PDB containing a protein and a ligand:

Select a ligand
Work flow Figure 6 Work flow. a) All instances for interactions between family A and family B with identical geometric interface classification are retrieved from SCOPPI. Interface residues are indicated in the accompanying multiple sequence alignments. b) Interface columns are defined by columns with more than 50% interface residues. Conservation of residues is calculated by using the von-Neumann-Entropy in combination with the substitution matrix BLOSUM62 (details are given in [18]).
We evaluate the descriptors' accuracy in terms of standard precision and recall, where precision is defined as TP/ (TP+FP) and recall is defined as TP/(TP+FN). TP, FP, and FN denote the numbers of true positives, false positives and false negatives, respectively.