The FunFOLD2 server for the prediction of protein–ligand interactions

The FunFOLD2 server is a new independent server that integrates our novel protein–ligand binding site and quality assessment protocols for the prediction of protein function (FN) from sequence via structure. Our guiding principles were, first, to provide a simple unified resource to make our function prediction software easily accessible to all via a simple web interface and, second, to produce integrated output for predictions that can be easily interpreted. The server provides a clean web interface so that results can be viewed on a single page and interpreted by non-experts at a glance. The output for the prediction is an image of the top predicted tertiary structure annotated to indicate putative ligand-binding site residues. The results page also includes a list of the most likely binding site residues and the types of predicted ligands and their frequencies in similar structures. The protein–ligand interactions can also be interactively visualized in 3D using the Jmol plug-in. The raw machine readable data are provided for developers, which comply with the Critical Assessment of Techniques for Protein Structure Prediction data standards for FN predictions. The FunFOLD2 webserver is freely available to all at the following web site: http://www.reading.ac.uk/bioinf/FunFOLD/FunFOLD_form_2_0.html.


INTRODUCTION
Proteins have an essential cellular role in all living organisms; thus, they are crucial in the maintenance of cellular and organism homeostasis. The ubiquitous role of proteins in cellular systems, make the determination of protein function (FN), ligand binding site residues and potential binding partners, essential to gain a more indepth knowledge of cellular functionality (1)(2)(3). The predicted structure of proteins can aid in the determination of a proteins cellular function, and hence bioinformatics tools such as the FunFOLD2 server have been developed to predict protein-ligand binding via the use of 3D models (1,2).
The FunFOLD2 server integrates our cutting edge function prediction algorithms, to predict protein-ligand binding sites from a single sequence via the production of 3D structures using the IntFOLD2-TS protocol (4). The server is intended for use by both expert and non-expert users alike. Non-expert users can use the ligand-binding site predictions as a guide to the likely binding sites and potential ligands, whereas expert users can look more closely into the data provided. For submission of a query sequence, an easy-to-use web interface is available, which allows the non-expert user to predict a variety of protein function prediction features, including ligand-binding site residues for the top predicted binding site, putative binding site ligands, 3D models of the likely protein-ligand interactions, protein-ligand binding site feature scores (1) used to predict the overall quality of the prediction [predicted Matthews Correlation Coefficient (MCC) (5) and Bindingsite Distance Test (BDT) (6) scores] and the probability that each proposed binding site residue binds to particular ligand types (ions, organic ligands, peptides and nucleotides). In addition, users have the option of downloading the 3D structure, which includes the superposed predicted ligands within the predicted binding site. A comprehensive help page is included for the server, containing details on the required input and output from the server and an example results page.
The original FunFOLD server (2) has been operational since late January 2011, and the outputs have been extensively used by researchers from within the UK, France [external groups at Reading (7,8) and Genoscope] and international groups during the CASP10 prediction season, which ran from April to August 2012. This article describes a novel server implementation of the FunFOLD (2) protocol that now includes improved model ranking using local binding site residue quality scores [FunFOLDQA (1)], along with the addition of new output scores. The FunFOLD2 predictions were independently validated by the CASP10 assessors using numerous performance benchmarks. The server is also being continuously evaluated as part of CAMEO (9) (Continuous Automated Model EvaluatiOn), in the ligand binding prediction category (http://www.cameo3d. org). Although other freely available servers exist for the prediction of function and ligand-binding site residues (10)(11)(12)(13)(14)(15)(16)(17), to our knowledge, the FunFOLD2 server is the first server to directly integrate a ligand-binding site quality assessment method for use in protein function prediction. Figure 1 shows the implementation of the FunFOLD2 server, which emphasizes the interdependency between the FunFOLD (2) and the FunFOLDQA (1) algorithms. The first key step is the generation of IntFOLD2-TS models (4), ranked according ModFOLDclust2 (18) global model quality scores. The ranked models and a list of non-redundant parent templates are subsequently processed by the FunFOLD algorithm. For each model, the FunFOLD algorithm produces a list of residues from the target sequences that are most likely to bind a ligand [in Critical Assessment of Techniques for Protein Structure Prediction (CASP) FN format], along with a list of putative binding ligands. The FunFOLD results for each model, along with the model used and parent template list, are then fed into the FunFOLDQA algorithm, which assesses both global and per-residue ligandbinding site prediction quality. The FunFOLDQA algorithm outputs the predicted BDT (6) and predicted MCC (5) scores (and the component feature scores from the neural network inputs). In addition, the propensity that each predicted ligand-binding site residue is in contact with the four different ligand types is assessed [Ions (I), Organic ligands (O), Nucleotides (N) or Peptides (P)], as specified in the CAMEO (9) Ligand Binding (LB) category. Finally, predicted MCC and BDT scores are used to rank the FunFOLD predictions, outputting the top-ranked prediction as the best prediction to the web server. The use of FunFOLDQA to rank FunFOLD predictions, on the top 10 models, has been shown to result in significantly improved predictions [see  The FunFOLD method (2) is based on one key concept: proteins containing ligands within the PDB, with a similar fold as the 3D model of the target protein being studied, are likely to having similar binding sites (2). A ligand is defined as a biologically relevant molecule, which binds to a structurally elucidated enzyme in the PDB. The FunFOLD algorithm uses the TM-align (19) method to superpose templates containing biologically relevant ligands with the predicted 3D structure from IntFOLD2-TS. Each model-to-template superposition is subsequently used if the TM-score 0.4 [TM-scores from 0.4 to 0.6 have previously been shown to mark the transition from unrelated to significantly related folds (20)]. The superpositions are then combined and reoriented using a PyMOL script (http://www.pymol.org), to determine putative ligand clusters. An agglomerative hierarchical clustering algorithm is subsequently used to identify each continuous mass of contacting ligands, thus suggesting potential ligand binding pockets. The criteria for determining contacts between ligands are defined as less than or equal to the Van der Walls radius of an atom plus 0.5 Å . The cluster with the largest number of ligands is then selected as the location of the ligandbinding site pocket. To determine the ligand binding site residues in the selected binding pocket, a novel 'residue voting' algorithm is used. Residues are determined to be in contact with the ligand cluster, if the residue has at least one contact with 2 ligands and at least 25% of the ligands within the cluster. The criterion for determining if a residue is in contact with a ligand is a distance of less than or equal to the Van der Walls radius of an atom plus 0.5 Å . The output from the FunFOLD algorithm is a list of putative ligand-binding site residues plus a list of the ligands within the binding site cluster (2).

IMPLEMENTATION
The FunFOLD algorithm has been extensively benchmarked on both the CASP8 and the CASP9 data sets (2). An early implementation of the algorithm was used in the CASP9 competition (2010), where it ranked amongst the top 10 methods (21). For a more in-depth description of the FunFOLD algorithm and methods benchmarking, see Roche et al. (2). Furthermore, the FunFOLD algorithm has recently been used in two large-scale genomic studies (7,22).

FunFOLDQA
The FunFOLDQA algorithm (1) was developed to help determine the reliability of our FunFOLD predictions (2) by the assignment of quality assessment scores. The final binding site quality scores that the method produces are based on the MCC (5) and BDT (6) metrics, which are used for the assessment of ligand-binding site residue predictions compared with crystal structures. The FunFOLDQA method combines four binding sitedependent protein feature scores and one structural dependent feature score, using a neural network, trained on either the MCC or BDT metrics, to produce local ligand-binding site quality predictions. The five feature scores are called: 'BDTalign, Identity, Rescaled BLOSUM62, Equivalent Residue Ligand Distance and Model Quality'. The 'BDTalign score' establishes the distance between residues that are equivalent between the model binding site and each template binding site. The Identity score compares binding site residues between the model-and template-binding site, which are 'equivalent' in 3D space, according to their amino acid sequence. The 'Rescaled BLOSUM62' score is similar to the 'Identity' score, but it scores equivalent residues between model and template binding site, using the BLOSUM62 scoring matrix. The 'Equivalent Residue Ligand Distance score' scores the equivalent residues between the model and the template in relation to their distance from the bound ligand. The 'Model Quality Score' is the global quality score for the starting model, calculated using ModFOLDclust2 (18). For a detailed description of the scoring metrics and their associated algorithms, see Roche et al. (1).
In addition to combining the strengths of the FunFOLD and FunFOLDQA algorithms, the FunFOLD2 server also integrates a new metric for both the global and per-residue scoring of functional propensity (See Supplementary Methods). The function is determined as the propensity of binding to specific ligands: Ions (I), Organic ligands (O), Nucleotides (N) or Peptides (P), in accordance with CAMEO LB category requirements. All quality scores are between zero and one, with scores close to one signifying a high confidence prediction and scores close to zero signifying a low confidence prediction.

INPUTS AND OUTPUTS
The FunFOLD2 server provides an easy to use web interface for submission of jobs: the only input required is a protein sequence in single letter amino acid code. However, optionally users may provide a name for the protein sequence and an email address. On submission of the sequence to the server, a unique URL is generated for the output, which can be bookmarked. Additionally, if a user has provided an email address, an email will be sent containing a reminder of the results URL, once the job has been completed. The time for job completion is in line with similar ligand-binding site prediction servers, which can take >24 h to return results, although typically users should expect to receive their results within the same day. Several factors influence server response time including, the server load, the size of the protein sequence under analysis and the number of templates available.
The server results page contains a graphical representation of the ligand-binding site, with predicted ligands and binding site residues highlighted (Figure 2), which have been rendered using PyMOL (http://www.pymol.org). Additionally, a list of ligand-binding site residues, predicted ligands and binding site residue propensities is provided in CAMEO format. Furthermore, an interactive model with predicted binding site residues and ligands can be visualized using the Jmol plug-in (http://jmol. sourceforge.net/). A link to download a PDB file of the top model with the putative ligands is also provided. If the user provides an email address on submission of their

CASE STUDY-AMINOPEPTIDASE N FAMILY PROTEIN (PDB ID 4fgm AND CASP10 T0726)
The aminopeptidase N family protein Q5QTY1 from Idiomarina loihiensis (PDBID 4fgm and CASP10 target T0726) provides an example of output from the FunFOLD2 server ( Figure  2), where using FunFOLDQA to rank predictions results in improved prediction quality. Figure 2 (1,2) and in two recently completed genomic-scale studies (7,22).

LIMITATIONS
Predicting ligand-binding site data is a difficult task, and there are several limitations to current prediction methods.
The following is a list of the most common limitations specific to the current implementation of the FunFOLD2 server: (i) If the server is unable to build a starting model for the target sequence, then it cannot predict any ligand-binding sites, although, fortunately, for the majority of protein targets, a reasonable 3D model can be obtained. (ii) If no structural similarity can be found, between the target model and structurally elucidated proteins with bound biologically relevant ligands, then a prediction is not made. (iii) Only one ligand-binding site is predicted per target sequence-the site with the largest identified ligand cluster. However, the server does also provide the data showing all putative ligand clusters, and these clusters can be made visible to users on the results page using the Jmol plugin. (iv) The FunFOLD2 server currently outputs predictions based on the best predicted IntFOLD2-TS model, and the top predicted model may not always be the best model.

CONCLUSIONS
The FunFOLD2 server provides biologists with an intuitive interface for the prediction of protein-ligand interactions from amino acid sequences. Graphical output and plug-ins are provided to facilitate interactive visualization of predicted interactions in 3D, and machine-readable files are provided for developers. The algorithms within the FunFOLD2 server have been independently tested in the recent international CASP10 competition where the method was found to rank amongst the top few. Additionally, both the FunFOLD (2) and FunFOLDQA algorithms have been extensively benchmarked on both the CASP8 and CASP9 data sets (1,2).   (2) and FunFOLDQA (1) into the FunFOLD2 server improves predictive quality. Example of a binding site prediction from CASP10 target T0726 comparing the FunFOLD2 server with the original FunFOLD method. The green sticks represent residues in the model that have been correctly predicted as binding to the ligands. The red sticks represent residues that were incorrectly predicted as potential ligand-binding residues. The blue sticks represent the observed ligand-binding site residues in the experimental structure. The white spheres represent ligands either predicted (B and C) or observed (A). (A) An example of the observed CASP10 target T0726 (4fgm), with the observed binding site residues (273, 277 and 307) and ligand (ZN) shown. (B) The predicted binding site from the original FunFOLD method for T0726 with the predicted binding site residues (273, 277, 307 and 310) and ligands (ZN-8) shown, with a predicted MCC score of 0.872 and a predicted BDT score of 0.777. (C) An example where FunFOLD2 produces a perfect prediction for CASP10 target T0726 (4fgm), with the predicted binding site residues (273, 277 and 307) and ligands  shown. In this case, the predicted MCC score is 0.882, and the predicted BDT score is 0.801.