Exploring the potential of 3D Zernike descriptors and SVM for protein–protein interface prediction

The correct determination of protein–protein interaction interfaces is important for understanding disease mechanisms and for rational drug design. To date, several computational methods for the prediction of protein interfaces have been developed, but the interface prediction problem is still not fully understood. Experimental evidence suggests that the location of binding sites is imprinted in the protein structure, but there are major differences among the interfaces of the various protein types: the characterising properties can vary a lot depending on the interaction type and function. The selection of an optimal set of features characterising the protein interface and the development of an effective method to represent and capture the complex protein recognition patterns are of paramount importance for this task. In this work we investigate the potential of a novel local surface descriptor based on 3D Zernike moments for the interface prediction task. Descriptors invariant to roto-translations are extracted from circular patches of the protein surface enriched with physico-chemical properties from the HQI8 amino acid index set, and are used as samples for a binary classification problem. Support Vector Machines are used as a classifier to distinguish interface local surface patches from non-interface ones. The proposed method was validated on 16 classes of proteins extracted from the Protein–Protein Docking Benchmark 5.0 and compared to other state-of-the-art protein interface predictors (SPPIDER, PrISE and NPS-HomPPI). The 3D Zernike descriptors are able to capture the similarity among patterns of physico-chemical and biochemical properties mapped on the protein surface arising from the various spatial arrangements of the underlying residues, and their usage can be easily extended to other sets of amino acid properties. The results suggest that the choice of a proper set of features characterising the protein interface is crucial for the interface prediction task, and that optimality strongly depends on the class of proteins whose interface we want to characterise. We postulate that different protein classes should be treated separately and that it is necessary to identify an optimal set of features for each protein class.

Protein-protein interactions (PPIs) are of particular interest as they tell us how proteins come together to construct metabolic and signalling pathways in order to fulfil their functions [2]. Dysfunction or malfunction of pathways and alterations in protein interactions have shown to be the cause of several diseases such as neurodegenerative disorders [3] and cancer [4], and hence the identification of the exact location on a protein's surface where it is likely to bind to its partners, i.e. the binding interface, has become one of the most popular targets for rational drug design [5]. In addition to practical applications, reliable identification of protein-protein interfaces is an important goal for basic research on the mechanisms of macromolecular recognition. For instance, PPI interface predictions can greatly aid protein-protein docking algorithms by being used in scoring functions or to constrain the available search space [6][7][8].
There are several experimental techniques available which can be employed for the characterisation of protein-protein interfaces at residual and even atomic level. For instance, both X-ray crystallography [9,10] and nuclear magnetic resonance (NMR) spectroscopy [11] have been used to determine protein interfaces at atomic level. Cryo-electron microscopy [12] has increasingly gained popularity as it allows the examination of native structural features of hydrated molecules in solution. Other techniques provide structural elucidation of interactions at lower resolutions. Alanine scanning mutagenesis [13], Hydrogen/Deuterium exchange [14] and chemical cross-linking [15] have been used to experimentally characterize protein-protein interfaces at residue level.
Although impressive progress has been made, there are several limitations to the existing experimental methods in the determination of protein-protein interfaces. X-ray crystallography requires crystallizing the specimen and placing them in non-physiological environments, which can be inherently difficult and occasionally lead to functionally-irrelevant conformational changes. NMR spectroscopy is suitable for macromolecules in solution (closer to real functional environments or foldings) and can yield information on the dynamics of various parts of a given the protein or complex, but its applicability is limited to small polypeptides (less than 50 kDa). Cryoelectron microscopy has no sample size constraints and can guarantee a reduced radiation damage to the sample compared to X-ray crystallography, but is generally more difficult, time consuming, and requires operating constantly at temperatures lower than -135°C. These technical challenges make such experiments both labourintensive and time-consuming, while on the other hand, the ongoing proteomics and structural genomics research continues producing large amounts of data, which need to be interpreted in a timely manner. Efficient computational methods are therefore needed to correctly predict the potential binding sites for a deeper understanding of PPIs.
Several computational methods for the prediction of PPI sites are available to date [16] which can be roughly categorised into sequence-based and structure-based approaches [17,18]. In sequence-based methods, a sliding window of fixed length (typically varying from 3 to 30 residues) is scanned across the protein sequence and a number of overlapping local sequence segments are extracted. For each of these segments, a feature vector is constructed using various amino acid properties (physicochemical, statistical and structural features), and is used as the input of a classification problem. These methods are particularly useful as they allow the PPI site prediction when a protein's structure information is not yet available.
In [19], a two-stage classifier is employed consisting of a Support Vector Machine (SVM) and a Bayesian network classifier that identifies interface residues primarily on the basis of sequence information. A 9-residue-long sliding window is employed, which is encoded using a 20 bit per residue feature vector (180 bit) for the first stage, and a 1 bit per residue (excluding the central one) feature vector (8 bit) for the second stage. In [20], a sliding window approach is combined with a Random Forests classifier to predict protein interaction sites using sequence information, both alone and in combination with structure-derived parameters. The input feature vectors were derived using a window length of 9 residues and employing 17 features per residue. Murakami and Mizuguchi predict interaction sites in protein sequences with a Naïve Bayes classifier using sequence features only: a position-specific scoring matrix (PSSM) and the predicted accessibility [21]. In [22], 24 independent neural network models are built using sparsely encoded sequence features for each amino acid (20-dimensional binary encoding for each residue) and a PSSM, and the average score of the 24 predictors is returned as the final score. Sriwastava et al. employ 21-residue-long local sequence segment pairs of protein sequences to identify interaction sites in protein complexes [23]. The input samples are built by assigning 8 properties to each residue in the local sequence segment pair, yielding 2 × 21 × 8 = 336-dimensional feature vectors classified by an SVM. In [24], a wide range of features (physicochemical properties, evolutionary conservation, amino acid distances and a PSSM) is extracted from protein sequences without using any structure data, then, a random forest-based integrative model is employed to effectively utilize these features and to deal with imbalanced data. Garcia-Garcia et al. propose a sequence-based computational method that infers possible interacting regions between two proteins by searching minimal common sequence fragments of the interacting protein pairs [25]. A two-dimensional matrix is derived by computing a score for each pair of residues that relates to the presence of similar regions in interolog protein pairs. The potential interface regions are reflected in query proteins by representing the scoring matrix as a heat map.
Structural features associated with the atomic coordinates of proteins are important discriminative attributes for PPI interface prediction, and the absence of such information is therefore expected to reduce the performance of sequence-based predictors compared to structure-based ones. For instance, most interface residues are also located on the protein surface, so structure-based methods can simply identify surface residues and ignore all internal residues. PPI interfaces are comprised of residues that can be located close to each-other in 3D space, while having distant positions in the primary sequence of the proteins. Finally, geometrical complementarity can be evaluated from 3D structures. Structure-based computational approaches offer several advantages over sequence-based ones, but are limited by the availability of protein 3D structures. However, the number and quality of available protein 3D structures has been steadily increasing over the past years and several structural repositories are available to date (i.e. Protein Data Bank (PDB) [26], The PeptideAtlas Project [27], Global Proteome Machine Database (GPMD) [28], The Proteomics Identifications database (PRIDE) [29]), enabling the development of structurebased interface predictors. Currently, most structurebased machine learning interface predictors exhibit better performance than sequence-based methods [16].
Porollo and Meller use "fingerprints" derived from the difference between the predicted and actual relative accessible surface area (rASA) of residues as features for interface prediction [30]. The prediction of PPI sites is done by a consensus method that combines the output of 10 Neural Networks with majority voting. Kufareva et al. developed an alignment-independent method of PPI interface prediction from local statistical properties of the protein surface at the atomic-group level [31]. The classification is done using a partial least-squares regression algorithm on the solvent accessibility values of 12 significantly over-represented and under-represented atomic groups at the interface, and can be further complemented by evolutionary conservation scores. In [32], interface regions for a query protein are determined by clustering and ranking the known interfaces in structural homologs. Zhang et al. propose a structural homology-based PPI interface prediction method [33]. For each query protein, its structural neighbours are identified by structural alignment, and their interface is mapped onto the query protein structure. The frequency of the mapped contacts are calculated for each residue in the query protein, and a logistic function is used to normalize the contact frequencies and generate the final prediction score for each residue. In [34], information from both proteins in a complex is used to predict pairs of interacting residues from the two proteins. Sequence (PSSM and predicted rASA) and structure (rASA, residue depth, half sphere amino acid composition, protrusion index) information about residue pairs is captured through pairwise kernels that are used for training a SVM classifier.
Experimental evidence supports the hypothesis that the location of binding sites is imprinted in the structures of proteins, and that this information can be extracted even without the knowledge of the binding partner [17,35]. Interface surface portions share common physicochemical properties which distinguish them from the non-interface ones, thus, only specific areas of the protein surface are amenable to be engaged in PPIs. It has been observed that interaction sites are characterised by a high number of hot spots, i.e. energetically critical residues that contribute significantly to the free energy of binding [36]. Clusters of hydrophobic residues [37] and aromatic side chains [38,39] are more abundant in the binding site, while hydrophilic residues are infrequent. Aromatic residues can form strong hydrophobic interactions between the bulky hydrophobic side chains, and the parallel arrangement of two aromatic rings creates tighter packing with better geometric fit. Cys-Cys residue contacts and the contacts between residues with opposite charges are more frequent in PPI sites [39]. Besides, protein interface regions are less flexible [40] and demonstrate higher sequence conservation rates [38,41] than other non-binding regions. Conserved interfaces are critical for the maintenance of PPIs throughout evolution.
There are also differences among the interfaces of the various types of PPIs [2]. Depending on the interaction type and its function, the properties that characterise interfaces can vary a lot. For instance, various classes of PPIs differ on the interface propensities of residues [42]. Interfaces of homodimers (complexes made of identical protein chains) are rich in nonpolar and aromatic residues while depleted in polar and charged residues [43], except for Arg which is not excluded in spite of its charge [44]. Interfaces of permanent complexes (i.e. complexes where the constituent proteins remain irreversibly bound after the initial interaction) are more hydrophobic if compared to those of transient complexes (the two proteins can associate and dissociate during their lifetime) [45]. Proteins forming transient complexes should be stable on their own, thus their interfaces are less hydrophobic. The interfaces of obligate complexes (i.e. stable complexes whose constituent proteins do not exhibit well-folded structure when apart) present higher sequence conservation rates [46] and are more hydrophobic [47] than transient complexes. Salt-bridges and hydrogen bonds occur more frequently in the interfaces of transient complexes [2] while covalent disulphide bridges are quite rare, as they can be found in a few, relatively small, permanent complexes [48].
Proteins belonging to the same functional category recognize their interacting partners by certain types of molecular interactions that are specific to their protein family and local environments. As a result, proteins can show specific binding interactions according to their functional classes of PPI interfaces. In [49], basic differences between homodimeric, heterodimeric, protein-antibody and enzyme-inhibitor protein complexes are explored. Cho et al. [50] showed that three functional classes of transient complexes could be distinguished by only four interaction types (NH· · ·NH, ion-ion, amine-cation and C α − H· · ·O = C). Moreover, C α − H· · ·O = C interactions were found to be predominant in proteaseinhibitor interfaces while ion-ion interactions were found to be specific to signal transduction complexes. In [51], six types of PPI interfaces were studied and significant differences were found in their residue composition and their residue-residue contact preferences, in the interactions between permanent and transient interfaces, and between interactions associating homo-oligomers and hetero-oligomers. Antibody-antigen complexes were found to exhibit quite peculiar binding mechanism, as they do not undergo correlated mutations (the antibody adapts to bind a particular antigen) and their amino acid contact propensities are quite different from those of other protein complexes [52].
Although significant research has been done in the area of protein-protein interactions, the problem of PPI interface prediction is still not fully understood [23]. The selection of an optimal set of biological and physicochemical features characterising the protein surface is one of the main unresolved issues. There are no known features which can singularly distinguish between interface and non-interface regions of the protein surface, and, the complex, non-linear combinations of features required to describe interaction sites can vary widely from one class of PPIs to another. Moreover, protein interface prediction is an imbalanced classification problem, because the the number of interacting residues of a protein is generally much smaller than that of non-interacting ones. Despite these limitations, several computational methods were reported to achieve good performance in the task of interface prediction for specific protein classes. In [53], Gao et. al. predict interface residues in enzymes with a Random Forest classifier employing the maximum relevance minimum redundancy method followed by incremental feature selection. In [54], a genetic algorithms which searches for known interface 3D templates is used to predict enzyme binding sites. In [55], B-cell epitopes (antigen interface) are predicted from the corresponding protein sequence using a combination of two classifiers, a naïve Bayesian and a random forest classifier, through a voting algorithm. Jespersen et. al. predict B-cell epitopes from antigen sequences with a random forest algorithm trained on the interfaces of known antibody-antigen protein complexes [56]. In [57], paratope (antibody interface) prediction is carried by deriving a set of consensus regions from the structural alignment of known sequentially similar antibodies. In [52], antibody-specific statistics are used to annotate residues with a score indicating their likelihood to belong to the antibody paratope.
In view of the above, we decided to perform binding interface prediction on different classes of proteins in order to gain a better understanding of the various PPI interfaces. In this work we introduce a methodology for the binding interface prediction of proteins given their experimentally-solved 3D structures (PDB files), without any knowledge on their possible binding partners. In order to effectively discriminate between interacting sites and non-interacting sites, we used a set of eight high quality amino acid indices (HQIs) of physico-chemical and biochemical properties extracted from AAindex1 dataset and first introduced in [58]. This set of properties has been employed and validated in several recent publications [23,[59][60][61][62][63]. We mapped these HQIs onto the voxelised representation of the protein surface, obtaining a geometrical representation of the latter enriched with the physicochemical and biochemical properties of the underlying residues. Spherical patches are then uniformly sampled from the protein surface and, for each patch, a rotationally invariant local descriptor based on 3D Zernike moments is computed. The 3D Zernike descriptors (3DZDs) possess several attractive features such as a compact representation, rotational and translational invariance, and have been shown to adequately capture global and local protein surface shape [64][65][66] and to naturally represent physico-chemical properties on the molecular surface [67]. 3DZDs are employed to quickly evaluate the shape and physico-chemical similarity of local surface patches, since similar patches have similar descriptors. In order to handle the class imbalance between interface and noninterface local surface patches, we used a combination of undersampling of the majority class and oversampling of the minority class. We employed the stability selection method know as Randomized Logistic Regression as a feature selection algorithm on the 3DZDs in order to reduce the overall number of features. The resulting reduced descriptors were then used as samples for a binary classification problem: Support Vector Machines were used as a classifier to distinguish interface local surface patches (surface patches belonging to the protein-protein interaction interface) from non-interface ones. This is the first time that 3D Zernike descriptors of eight HQIs mapped on the corresponding protein surfaces are employed in the prediction of PPI interfaces. The proposed method was tested and validated on 16 classes of proteins obtained from the Protein-Protein Docking Benchmark 5.0, for both their bound and unbound states and compared to other state-of-the-art protein interface predictors.

Protein surface representation
In this work we employed the voxelised representation of the Solvent Excluded surface (SES) [68], which can be defined as follows. If we imagine a probe-sphere of radius equal to the size of the solvent molecule as it rolls over the external atoms of the protein, we can define the SES as the union of two surfaces: the portion of the outer atoms' surface touched by the probe-sphere while it rolls over them, and the inward-facing surface portions of the probe when it touches two or more atoms. The SES represents a continuous functional surface of the molecule, i.e. the surface that is available to interact with. Voxelised surface representations (also known as dot-surfaces or grid-based representations), although simple, are widely appreciated for their accuracy and applicability in various contexts. A voxel (volumetric pixel) represents a single, discrete data point on a regular grid in the 3D space, and can contain multiple values in order to represent various properties of a certain portion of space in a simple and effective way.
The voxelised SES of proteins were computed with the region-growing Euclidean distance transform methodology described in our previous works [69,70] at a resolution of 64 voxels per Å 3 , using a 1.4Å radius for the solvent probe. Patch centres are extracted from each protein surface uniformly and at a minimum separation of 1.8Å, while local surface patches are extracted using a sphere with a 6.0Å radius centred at each patch centre. This ensures that there is plenty overlap among patches with neighbouring centres. The 6.0Å patch radius is a recurring value in many algorithms which employ spherical patches [66,68,[71][72][73], because it is an approximation of the radius of an amino acid [71]. The 3D Zernike Descriptors used in this work were computed up to a maximal order of 20, which corresponds a vector of 121 invariants per descriptor. 3DZDs of maximal order 20 have been shown to adequately capture shape complementarity at the protein-protein interface [66].

Interfacial regions of the protein surface
The recognition of PPI interface regions can be seen as a classification problem, i.e., each local surface patch is assigned to one of the two classes: interface surface patches, and non-interface surface patches. Consequently, the problem may be solved using statistical and machine learning techniques such as Support Vector Machines. A clear definition of interacting local surface patches is required in order to predict whether a given patch is involved in protein-protein interactions. However, many alternative definitions are being used to define an interaction site based on 3D structural data [74] which can be grouped into two main approaches: (i) inter-atomic distance between non-hydrogen atoms of different protein chains and (ii) change in accessible surface area (ASA) upon complex formation.
In this work, we used the following definition of interface and non-interface local surface patches. Let P 1 and P 2 be two proteins in a given complex whose 3D structure is known, and let SES(P 1 ) and SES(P 2 ) be the corresponding voxelised SES representations. The interface I P 1 of protein P 1 is defined as the set of voxels from SES(P 1 ) which are within a 4.5Å distance from some heavy atom in P 2 , i.e.: (1) Equivalently, the interface I P 2 of protein P 2 is defined as: A patch is an interface patch if at least 80% of its surface voxels are located in the current protein's interface, otherwise the patch is categorised as a non-interface patch.

Residue feature set
In order to reliably predict PPI interface residues, the physico-chemical characteristics (features) that can best discriminate between interacting and non-interacting sites must be identified. The choice of such features is critical for the success of a predictor [16]. The AAindex [75] is a database of numerical indices representing various physicochemical and biochemical properties of residues and residue pairs derived from published literature. An amino acid index is a set of 20 numerical values representing any of the different physicochemical and biological properties of each amino acid: the AAindex1 section of the database is a collection of 566 such indices (Release 9.2, February 2017). By using a consensus fuzzy clustering method on all available indices in the AAindex1, Saha et al. [58] identified three high quality subsets (HQIs) of all available indices (544 at the time), namely HQI8, HQI24 and HQI40. In this work we used the features of the HQI8 amino acid index set (see Table 1) which were identified as follows. Using the correlation coefficient between indices as a distance measure, Saha et al. divided all the available indices in the AAindex1 section into 8 clusters: the elements of the HQI8 subset consist of the medoids (centres) of these clusters.

3D Zernike descriptors
The 3D Zernike descriptors (3DZD) were first used as a representation of the protein surface shape in [64], and have since been employed in several tasks such as global protein structure comparison [65], surface property comparison [67], local surface classification [76], binding ligand prediction by pocket-pocket similarity detection [77][78][79] and pocket-ligand complementarity evaluation [80,81], and protein-protein docking prediction [66] with quite satisfactory results. 3DZDs present several advantages over other surface representations. For instance, they can represented protein surfaces and the corresponding properties very compactly as a vector of numbers. 3DZDs are invariant to rotations and translations, i.e. they Table 1 The HQI8 subset of amino acid indices from the AAindex database Entry name Description BLAM930101 Alpha helix propensity of position 44 in T4 lysozyme [99].
are not affected by the initial orientation of the molecular surface. Because of this property, time-consuming spatial alignments of proteins are not required and the descriptors can be precomputed and stored. The 3DZDs can be computed for any 3D image, and are thus suitable for representing physico-chemical properties on the molecular surface as the electrostatic potential or the hydrophobicity [67]. Lastly, by changing the order of the series expansion, the resolution of the surface representation can be easily controlled.
Each patch of the enriched protein surface is represented by the 3D Zernike descriptors. The 3DZD are a series expansion of a 3D function which exhibit several desirable properties such as compactness of the representation, roto-translational invariance and minimum information redundancy (orthonormality). In what follows we will provide a brief description of the 3DZD. Refer to [82] for the exhaustive mathematical derivation and to [83] for the implementation details. The 3D Zernike functions Z m nl of order n and repetition m are defined as Y m l (θ, φ) are the spherical harmonics in polar coordinates of l th degree, where l ≤ n, m ∈{−l, −l+1, −l+2, . . . , l−1, l}, with n − l an even number. R nl (r) are the radial polynomials of radius r which guarantee the orthonormality of the Z m nl (r, θ, φ) polynomials in Cartesian coordinates. The expression of Z m nl can be rewritten in Cartesian coordinates as a linear combination of monomials of order up to n: The 3D Zernike moments m nl of function f (x), x ∈ R 3 are defined as: Using Eq. 4, the 3D Zernike moments m nl of an object can be written as a linear combination of geometric moments of order up to n where M rst is the geometric moment of the object scaled to fit in the unit ball where x ∈ R 3 is the vector x = (x, y, z) . The 3D Zernike moments m nl are not invariant under rotations. In order to achieve invariance, moments are collected into (2l , and the rotationally invariant 3D Zernike descriptors F nl are defined as norms of vectors nl : Given the maximum moment order N, the number of 3D Zernike descriptors can be easily determined by using the following formula:

Patch representation using 3D Zernike descriptors
The physico-chemical and biochemical properties described in the HQI8 amino acid index set are mapped on the voxelised representation of the protein's SES. Depending on the amino acid it belongs to, each atom in the protein is assigned the corresponding numeric values of the properties scaled by the atom's radius. For a given amino acid index, each voxel in the protein's SES is assigned the corresponding value of the atom occupying that voxel. If a voxel belongs to two or more atoms (i.e. if two or more atoms overlap), then the sum of the corresponding values of the overlapping atoms is assigned to that voxel. If a voxel does not belong to the SES of the current protein, its value is set to zero. Eight 3D functions are thus defined, each describing one of the properties of the HQI8 set. For a given protein P, these functions are formally defined as follows. Let A P be the set of atoms in the current protein P, and let i : A P → R the function which assigns to each atom the numeric value of the corresponding amino acid for a given amino acid index i ∈ HQI8. Then, for a given amino acid index i ∈ HQI8, the corresponding property is mapped on the SES(P) according to the following 3D function: where r a is the radius of atom a, and 1 a (v) is the indicator function for atom a defined as: Zernike descriptors cannot be used to distinguish positive valued functions from negative valued ones (see the Additional file 1 for a concise mathematical justification). For instance, a surface patch with a certain charge distribution pattern would be indistinguishable from another patch with the same shape and inverted electrostatic charges in terms of 3DZDs. This can be avoided by considering a 3D function f (x) as the difference of its pos- , and by computing the 3DZDs of these two functions separately.
Three of the amino acid indices in HQI8 can assume both positive and negative values, namely BLAM930101, BIOV880101 and MIYS990104, while the remaining five indices assume positive values only. The positive and negative parts were considered separately for these three indices, yielding a total of 11 3DZDs describing the HQI8 properties for each local surface patch. The maximal order 20 was used for the calculation of the 3DZDs, thus, according to Eq. 9, each patch is characterised with a total of 11 × 121 = 1331 features.

Support vector machine
Support vector machine (SVM) is a binary classification technique introduced by Vapnik et al. [84][85][86]. While traditional binary classification methods generally minimize the empirical training error, SVM minimizes the upper bound of the generalization error by maximizing the margin between the separating hyperplane and the data, abiding to the structure risk minimization principle for model selection. Striking feature of SVM is the property of compacting information contained in the training data, and providing a sparse representation even when using a small number of data points.
A binary classification problem usually involves separating data into training and test sets. The instances (samples) of the training set are the pairs (x i , y i ), where x i is a vector representing the features or attributes of the given sample and y i ∈ {−1, +1} is the corresponding class label. The goal of SVM is to produce a model based on the training data which predicts the class labels of the test data given only the feature vectors of the test data. This is achieved by solving the following optimisation problem: where φ(x i ) maps x i into a higher-dimensional (and potentially even an infinite-dimensional) space, and C > 0 is the penalty parameter of the error term. In practice the dual formulation of this problem is solved instead, due to high dimensionality of the vector variable w: where e = [1, 1, . . . , 1] is the vector of all ones. After solving the dual problem, the optimal w is given by and by setting K( , the decision function is given by: Please note that there is no need to compute the mapped feature vectors φ(x) explicitly. Instead, only the dot products between mapped feature vectors are calcu- SVM can perform non-linear classification in the feature space by finding a separating hyperplane with maximal margin in the higher dimensional space generated by φ(·). This is easily done by using different kernel functions generating φ(·). The most used kernels are given in Table 2. Although the performance of SVM mostly depends on the choice of an appropriate kernel function, there is no optimal way to choose an optimal kernel function within a data-driven approach.
In this work, interface local patch descriptors are labelled as positive samples (+1) and non-interface ones are labelled as negative samples (−1). Therefore, our interface recognition problem is actually a binary classification problem which can be handled by a SVM. In this work we used the SVM implementation provided in the scikit-learn Python module for machine learning version 0.18.1 [87].

Performance measures
The PPI interface prediction based on local surface patch descriptors is a binary classification problem, thus, a number of commonly used measures can be employed to evaluate the performance. These methods include accuracy (A), precision (P), recall (R), F 1 score (F 1 ) and the Matthews correlation coefficient (MCC) (see Table 3).
The Receiver Operating Characteristic (ROC) and the Precision-Recall (PR) curve plots and their Area Under the Curve (AUC) can also be used to assess the quality of a binary classifier. The ROC curve is the most commonly used way to visualize the performance of a binary classifier, and AUC is a very good way to summarize its performance in a single number. In this work, the ROC curve of an SVM classifier is created by plotting the True Positive Rate (the fraction of true positives out of the total predicted positives) against the False Positive Rate (the fraction of false positives out of the total predicted negatives), at various threshold values of the intercept term b in Eq. 15. The PR curve is obtained by plotting the precision values against the corresponding recall for all threshold values of b.

Dataset
The Protein-Protein Docking Benchmark 5.0 (DB5) [88] was used as dataset in this work. The benchmark consist of 230 non-redundant, high quality structures of proteinprotein complexes along with the unbound structures of their components. Non-redundancy is set at the family level of SCOPe 2.03 [89]: two complexes were considered redundant when the pairs of interacting domains were the same at the SCOPe family level. Antibody-antigen complexes were considered redundant only when the SCOP families of the antigens were identical, and at least 80% of the antigen interface residues were shared between the two complexes. The complexes are divided into 8 different classes: (1)  In order to assess the predictive capabilities of the proposed method on different protein complex classes, we considered the 8 different classes in the DB5 separately. For each class, we also separated the receptor proteins from the ligand ones, thus obtaining 16 separate datasets. We maintained the separation between classes A and AB, although not being biologically different, in order to be able to evaluate the performance variations due to conformational changes upon binding, as there are no unbound structures available for the receptor proteins in the AB class. For each of the 16 datasets, we further reduced redundancy to a maximum of 90% sequence identity between pairs of different (unbound) proteins with the CD-HIT tool [90,91]. Each dataset was then randomly split into two disjoint sets: a training set of approximately 60% of the number of complexes and a test set of the remaining ∼ 40% (see Table 4).
The interaction interface generally corresponds to a small portion of a protein's surface, thus, a uniform sampling of the protein surface into local surface patches results in a highly-imbalanced classification problem where the interface patches are the minority class. Most machine learning algorithms do not perform well when the number of instances of one class far Returns a value between −1 and +1: +1 represents a perfect prediction, 0 no better than random prediction and −1 indicates total disagreement between prediction and observation.    exceeds the other, especially when classification accuracy is employed as a figure of merit. This can lead to classifiers that tend to label all the samples as belonging to the majority class, thus trivially obtaining a high accuracy measure.
In this work we used a combination of undersampling of the majority class and oversampling of the minority class in order to balance the training set. The surface of each protein in the training set was first sampled into local surface patches with a minimum separation of 4.5Å between patch centres. Then, only the interface regions were sampled with a minimum separation of 1.0Å between patch centres. This procedure yields more balanced training sets (see Table 5) and guarantees that both the interface and non-interface protein surface regions are sampled in a fairly uniform fashion. We also used the F 1 score (instead of classification accuracy) as a figure of merit during model evaluation on the training samples. The test samples, on the other hand, were obtained by uniformly sampling the surfaces of the proteins in the test set with a minimum separation of 1.8Å between patch centres, thus retaining the original distribution of positive and negative samples. Table 5 also reports the unbalanced version of the training set obtained with the same parameters.

SVM model selection
Choosing an appropriate kernel function with the corresponding best hyper-parameters (which include the penalty C and the kernel parameters) is critical for achieving good classification performance with SVMs. Although grid-search is currently the most widely used method for hyper-parameter optimisation in learning algorithms, it can be prohibitively time-consuming since not all hyper-parameters are equally important to tune. Grid search experiments might end up allocating too many trials to the exploration of dimensions with low impact on the final performance and suffer from poor coverage of the more important ones. On the other hand, randomised search experiments were recently proven more efficient in several learning algorithms and datasets [92], and have thus been gaining popularity in several applications.
Feature selection was also performed (on the training samples only) in order to reduce the number of features to a subset of relevant ones, since its benefits are manifold (model simplification, shorter training times, better generalisation and avoiding curse of dimensionality) [93]. In this work, we employed a relatively novel feature selection procedure know as Randomized Logistic Regression [94]. This method works by sub-sampling the training data and fitting a L1-regularised Logistic Regression model where the penalty of a random subset of coefficients has been scaled. By performing this double randomization several times, the method assigns high scores to features that are repeatedly selected across randomizations (see the Additional file 1 for a more detailed description of the feature selection algorithm).
After the feature selection, we performed a randomized search over the hyper-parameters for each of the kernel functions described in Table 2: each parameter was sampled from either a distribution over possible values or a list of discrete choices. The penalty parameter C was sampled from the continuous exponential distribution with mean 2000 for all kernel functions. The γ parameter was sampled from the continuous exponential distribution with mean 0.01 for the polynomial, RBF and sigmoid kernel functions. The degree d parameter of the polynomial kernel was sampled from the discrete uniform distribution U {2, 10} (the polynomial kernel of degree 1 is actually the linear kernel), while the r parameter of the polynomial and sigmoid kernels was sampled from the continuous uniform distribution U (−2, 2). The computation budget, i.e. the total number of sampled candidates or sampling iterations, was set to 200 iterations for each kernel function.
The hyper-parameter evaluation was carried out through leave-one-out cross-validation (LOOCV) at the protein level. If the training set consists of k proteins, in turn, each protein is removed from the training set, and a model is trained on the samples of the remaining k − 1 proteins. The resulting model is then validated on the samples of the protein that was left out. The performance measure reported by LOOCV is then the average of the values computed in the loop. We used the F 1 score as a performance measure throughout all experiments.

Interface residue prediction
In order to predict the set of interface residues in a target protein the predicted interface surface patches must be mapped on the underlying residues. The mapping procedure can be summarized as follows. Each residue in the query protein is assigned an initial score of 0. Then, for each predicted interface surface patch we identify the set of its underlying residues, that is, all the residues with at least one atom within a 6Å distance from the patch centre. The score of each underlying residue is incremented by 1/ (1 + d), where d is the minimum distance from its atoms to the current patch centre. At the end of the procedure, each residue in the query protein will be assigned a score which indicates its likelihood of belonging to the PPI interface. Each residue can then be classified as interacting or non-interacting by thresholding on this score. Table 6 summarises the results of the feature selection procedure with the Randomized Logistic Regression algorithm, describing the number of selected features for each amino acid index, while Table 7 describes the best model chosen by the Randomized Search with leave-one-out cross-validation procedure (see Additional file 2 for the indices of the selected features for each protein class). A relatively small portion of the overall number of features (1331) are extracted for each protein class. This is probably due to the fact that we are mapping residue-wise properties on the molecular surface, and the resulting patterns that arise on the local surface patches are relatively simple. This means that only a few terms of the 3DZDs of order 20 (121-dimensional vectors) are required in order to capture such patterns, and thus distinguish between interface and non-interface surface patches.

Model selection results
It is also worth noticing that the number of selected features of a given amino acid index varies from one protein class to another. For instance, 24 features are selected for the NAKH920108 amino acid index property for the bound version of protein class A r , while, for the bound version of protein class ES l the algorithm selects no features at all for the same amino acid index property. This is consistent with the hypothesis that interfaces of proteins belonging to different classes and carrying different functions can vary widely. Moreover, the number of selected features for a given amino acid index can be used to measure its importance in the characterisation of the PPI interface in a given protein class. The Randomized Logistic Regression algorithm only selects important features which correlate with the classification labels: if few features of an amino acid index property are selected in a given protein class, it means that the given property is not important in discriminating interface from noninterface surface regions for the current class. On the other hand, since the classification labels depend on the selected features, key properties which drive protein interactions in the current protein class will have many of their features selected by the algorithm.

Prediction results on the test set
The performance results for the proposed methodology at the surface patch level on the test set are presented in Table 8 (see Additional file 3 for the prediction results on the test set for each protein). Figure 1 describes the Receiver Operating Characteristic curve for each protein class. The performance of the proposed methodology varies widely from one protein class to another: from a very high AUC-ROC of 94% for class A r (95.4% in the bound case) to a much less satisfactory prediction for class A l (in both bound and unbound cases). The effect of the conformational changes proteins undergo upon binding can be observed in the differences between the obtained performance values for the bound and unbound versions of the protein classes: for most protein classes, the bound versions obtain better prediction results than the unbound ones.
To investigate the reasons behind the different performance rates achieved for different protein classes, we measured the average pairwise sequence identity for each protein class (see Table 9, we excluded the pairwise sequence identity measures for chains within the same protein). No particular correlation emerges between the classification performance at the patch level and the average pairwise sequence identity of the different protein classes. For instance, the average pairwise sequence identity in the unbound version of protein class A l is 41.52%, which is higher than in some other classes. However, we achieve the lowest classification performance in this class. For this reason we conclude that the performance discrepancies are due to the varying capability of the HQI8 index to adequately represent the diverse interaction patterns that characterise PPIs in the different protein classes.
In order to further demonstrate the necessity of developing class-specific protein interface predictors, we trained a generic SVM model based on all the structures in the training set, only differentiating between bound and unbound structures, and evaluated its performance on the test set structures for each protein class. The comparison of the average ROC curves of the class-specific and generic models are given in Fig. 1 for each protein class. In general, the class-specific models obtain better classification performance in terms of ROC-AUC, especially for the bound versions of protein classes A r , AB r , OG r and the unbound versions of protein classes A r , AB r , AB l , EI r , ER l , ES r , ES l . Interestingly enough, the class-specific and generic models both obtain very similar results in classes OG, OR and OX (except for the bound version of OG r ). These are the most generic classes in DB5 (i.e. Others, Table 6 The number of selected features belonging to each physico-chemical property and for each protein class. The  G-protein containing (OG), Others, Receptor containing (OR) and Others, miscellaneous (OX)), thus the benefits of using a class-specific training set are less evident.

Post-processing
By analysing the results in Table 8 we noticed that for some protein classes the prediction performance in terms of ROC-AUC and recall was high while the other prediction metrics were low. This is due to the fact that the default threshold (t = 0) used by the SVM classifier (on the b term in Eq. 15) does not yield optimal binary classification results, since the employed training set is balanced and does not reflect the natural distribution of interface and non-interface patches. We selected the best threshold value that maximises the average F 1 score on the training set proteins for each protein class: we used the unbalanced  Interface regions usually consist of continuous portions of the protein surface. For this reason, the spatial relations among the predicted interface patch centres can be exploited in order to reduce the number of false positive local surface patches. This can be achieved by retaining predicted surface patches which form continuous clusters on the protein surface while discarding the spatially isolated ones. The Isolation Forest (IF) algorithm for outlier detection [95] was used to reduce the number of spatially-isolated false positive local surface patches. Interface regions are composed of contiguous surface patches, thus isolated patches marked as positive by the SVM classifier can be safely discarded. For each query protein, an IF classifier is trained on the coordinates of the LSPs identified as interface patches by the SVM classifier, using their distances from the separating hyperplane as weights. Then, the IF classifier is used on the whole set of surface patches of the query protein to identify the ones belonging to the PPI interface. A contamination parameter must be provided to the IF algorithm: we identified the optimal parameter values for each protein class by testing all contamination values from 0.00 to 0.5 with a constant increment of 0.01, and selected the ones that Table 9 Average pairwise sequence identity (in %) for each protein class Protein class  yielded the best average F 1 score on the training set of the corresponding protein class. Because the IF for outlier detection is a random algorithm, the F 1 score was averaged over 100 runs for each contamination value. When the best average F 1 score was obtained for a contamination value equal to zero we skipped the IF step. The best contamination values obtained for each protein class are reported in the Additional file 1.

Comparison with other methods
Homology-based (or template-based) approaches constitute the best performing PPI interface prediction methods to date (given the availability of close homologous structures) [16]. These methods infer the biological properties of a query protein from its homologs based on the assumption that homologs share significant similarity in sequence, structure and functional sites. For this reason, in order to assess the prediction capabilities of the proposed methodology, we compared it with two stateof-the-art homology-based PPI interface prediction algorithms: NPS-HomPPI [96] and PrISE [97], and with the well-known structure-based approach SPPIDER [30,74]. NPS-HomPPI infers interfacial residues for a query protein from the interfacial residues of its homologs. Based on interface conservation thresholds derived from a systematic interface conservation analysis, NPS-HomPPI classifies the templates into either Safe, Twilight or Dark Zone, and uses multiple templates from the best available zone to infer interfaces for query proteins. PrISE is a family of local structural similarity-based computational methods for predicting PPI interface residues. For each target residue in a query protein structure, the spatial neighbours of the target are extracted and represented by their atomic composition and accessible surface areas. PrISE then searches its pre-calculated database for similar structural elements with experimentally determined interface information, and weights them according to their similarity with the structural element of the query protein.
SPPIDER is a consensus method that combines the output of 10 Neural Networks using the majority voting. It uses the difference between the predicted and the actual rASA in an unbound structure of a residue as a feature (fingerprint) to predict interfaces.
The assessment was carried out on the structures of the test set described in Table 4, and the performance evaluation was done separately for each protein class. We used the following common definition of the PPI interface for all methods: a residue is considered as interfacial if at least one of its heavy atoms is within a 5Å distance from any other heavy atom of the interacting protein. When possible (i.e. for NPS-HomPPI and PrISE), the interface definition parameter was set accordingly. In the homologybased methods, all homologous structures with sequence identity with the query protein of 90% or above were not considered. We also required the predictions to be expressed as scores or probabilities estimating the likelihood of a residue being in the interface. The default settings were used for all the remaining parameters.
By thresholding on the residue score values we computed the average ROC curves and average Precision-Recall curves for each method which are shown in Figs. 2 and 3 respectively. The proposed methodology outperforms the competitor predictors in both the bound and unbound versions of protein classes A r and AB r : the ROC-AUC and PR-AUC values obtained by our predictor are significantly higher than the others. In the unbound version of protein class A r , our method achieves a ROC-AUC of 94.2% and a PR-AUC of 67.7% while, for the competitors, the maximum ROC-AUC is 78.0% (for NPS-HomPPI) and the maximum PR-AUC is 12.4% (for SPPI-DER). Similarly, in the bound version of protein class A r , our method achieves a ROC-AUC of 95.4% and a PR-AUC of 56.4% while, for the competitors, the maximum ROC-AUC is 79.6% (for NPS-HomPPI) and the maximum PR-AUC is 12.2% (for SPPIDER). In the unbound version of AB r , the proposed method achieves a ROC-AUC of 84.0% and a PR-AUC of 39.2% while the maximum ROC-AUC for the competitors is 78.9% (for PrISE) and the maximum PR-AUC is 13.5% (for SPPIDER). For the bound version of AB r our method obtains a ROC-AUC score of 81.3% and a PR-AUC of 33.5%. The best ROC-AUC obtained by the competitors in the same class is 77.6% (for PrISE) and the best PR-AUC is 14.6% (for SPPI-DER). Noticeably better prediction performance is also achieved in the unbound and bound versions of class EI r : the achieved ROC-AUC values are 74.4% for the unbound The obtained results agree with the initial hypothesis that proteins belonging to different classes exhibit diverse interaction mechanisms. To this end, the choice of a correct set of physico-chemical and biochemical properties characterising the interaction site is crucial, although it might not be possible to identify a comprehensive set of features that works well for all protein classes as the recognition patterns can be very different. Our results suggest that, although a given set of features can effectively discriminate between interface and noninterface surface regions for a given protein class, it can perform very poorly when used on other protein classes. Interface prediction could be further improved by better feature representation and selection methods that can effectively capture complex protein recognition patterns in diverse types of interactions, however, different protein classes should be treated separately.
The HQI8 amino acid index set of physico-chemical and biochemical properties showed very good discriminative capabilities for the interface recognition of some protein classes (A r , AB r ) while under-performing in others (A l ). Other sets of properties could exhibit better discriminative power in the protein classes where we obtained low prediction performances. Ideally, an optimal set of features could be selected for each protein class in order to correctly identify the class-specific PPI patterns. The proposed methodology can be easily extended to other sets of amino acid properties, which can be similarly mapped on the voxelised protein surface and represented by 3DZDs.
Although not considered in this work, binding partner specificity has recently been reported to greatly affect the quality of predicted PPI interfaces [16], especially in transient protein interactions [96]. Partner-specific interface prediction methods have been shown to outperform several state-of-the-art non-partner-specific ones [22,34,98], and apparently, specific interacting partners should be considered in order to reliably predict interface regions. The methodology introduced in this work could be extended in order to predict pairs of interacting local surface patches by feeding the SVM classifier with the concatenation of the corresponding descriptors. However, class imbalance should be handled very carefully, as the number of negative samples (non-interacting patch pairs) would significantly increase with respect to the non-parter-specific case, while the number of positive samples (interacting patch pairs) would roughly remain the same as in the previous case.

Conclusions
Existing structure-based PPI interface predictors employ 3D structural information to encode statistical properties of surface patches as input feature vectors for binary classifiers while information about the spatial arrangement of atoms and residues is usually ignored. In this study we introduced a novel method for the prediction of PPI interface regions based on 3D Zernike descriptors, HQI8 amino acid index set and SVMs. We demonstrated that 3D Zernike Descriptors of physico-chemical and biochemical amino acid properties mapped on local patches of the protein surface can be used to characterise the latter in order to distinguish between interface and non-interface regions. The 3DZDs are able to capture the similarity among patterns of physico-chemical and biochemical properties mapped on the protein surface arising from the various spatial arrangements of the underlying residues. It is also worth noticing that this is the first time the physicochemical and biochemical properties of the HQI8 set were mapped directly onto the 3D representation of the protein surface instead of being used to characterise the protein sequence.
This method was tested on 16 protein classes extracted from the Protein-Protein Docking Benchmark 5.0, on both the bound and unbound versions, and was compared with three other state-of-the-art PPI interface predictors, namely SPPIDER, PrISE and NPS-HomPPI. With a resulting ROC-AUC of, respectively, over 94% and 81%, we obtained very good classification results on protein classes A r and AB r (i.e. antibodies) and even outperformed the competitors also in terms of precisionrecall. These results are very encouraging, thus we are planning to develop a specific antigen-binding interface (also known as paratope) prediction method for antibodies with known structure using the 3D Zernike descriptors and the HQI8 amino acid index set. The field of paratope residue prediction appears to be somewhat underdeveloped, with a general paucity of specific predictors, thus any future development in this direction should be quite promising.
Our results show that the choice of a proper set of features characterising the protein interface is crucial for the interface prediction task, and that the optimal set of features strongly depends on the specific protein class. For further improvement of prediction performance, it is necessary to identify an optimal set of features for each protein class or interaction type. As a future development, we plan to test several sets of features on different protein classes in order to widen the predictive capabilities of the proposed method. Including informations regarding possible binding partners in the prediction procedure is also expected to increase the overall performance, although tackling the resulting class imbalance will not be trivial.
The comparison of the class-specific interface prediction models with a generic one, trained on all training samples regardless of the protein class, also confirmed the hypothesis that interface prediction model development should be carried separately for different protein classes. In a certain way, this is similar to the homology-based interface prediction approach: for a given query protein, its closest homolog proteins with known binding sites are retrieved, and the query's binding site is determined by comparison with the known structures. However, these methods cannot yield good predictions when adequate homologs are not available. Similarly to homology-based predictors, the proposed method requires the availability of several protein classes in order to reliably predict interface regions. Given the ever increasing number of available high-resolution 3D protein structures in public repositories, we expect that more benchmark sets and databases such as the Protein-Protein Docking Benchmark 5.0 which classify proteins into biologically relevant classes will be available. As a future work, we will expand the proposed methodology to other protein classes in order to increase the coverage of its predicting capabilities to as many proteins as possible. A pre-processing step for the determination of the most adequate protein class will be required for proteins with unknown classes. The predictor will then be made available to users throughout a dedicated web server.
The majority of the available protein interface identification methods make predictions at the residue resolution level. Protein-protein docking algorithms, however, require high-resolution atomic level knowledge in order to correctly predict native binding configurations between interacting proteins. The predictions at the local surface patch level can be readily used to guide protein-protein docking methods by limiting the docking search space to the sole surface patches which were predicted as belonging to the interaction interface. Docking algorithms based on local surface descriptor matching can greatly benefit from the proposed approach since this will sensibly limit the number of candidate patch pairs to be evaluated, thus reducing the conformational search space, and consequently reducing both the number of false positives and the required calculation time.  Table 8. (XLSX 67.9 kb)