ProtNN: fast and accurate protein 3D-structure classification in structural and topological space

Background Studying the functions and structures of proteins is important for understanding the molecular mechanisms of life. The number of publicly available protein structures has increasingly become extremely large. Still, the classification of a protein structure remains a difficult, costly, and time consuming task. The difficulties are often due to the essential role of spatial and topological structures in the classification of protein structures. Results We propose ProtNN, a novel classification approach for protein 3D-structures. Given an unannotated query protein structure and a set of annotated proteins, ProtNN assigns to the query protein the class with the highest number of votes across the k nearest neighbor reference proteins, where k is a user-defined parameter. The search of the nearest neighbor annotated structures is based on a protein-graph representation model and pairwise similarities between vector embedding of the query and the reference protein structures in structural and topological spaces. Conclusions We demonstrate through an extensive experimental evaluation that ProtNN is able to accurately classify several datasets in an extremely fast runtime compared to state-of-the-art approaches. We further show that ProtNN is able to scale up to a whole PDB dataset in a single-process mode with no parallelization, with a gain of thousands order of magnitude in runtime compared to state-of-the-art approaches.


Introduction
Proteins are ubiquitous in the living cells. They play key roles in the functional and evolutionary machinery of species. Studying protein functions and structures is paramount for understanding the molecular mechanisms of life. High-throughput technologies are yielding millions of protein-encoding sequences that currently lack any functional characterization [1][2][3]. The number of proteins in the Protein Data Bank (PDB) [4] has more than tripled over the last decade. Alternative databases such as SCOP [5] and CATH [6] are undergoing the same trend. However, the classification of protein structures remains a difficult, costly, and time consuming task. Manual protein classification methods are no longer able to follow the rapid increase of data. Accurate computational and machine learning tools present an efficient alternative that could offer considerable boosting to meet the increasing load of data.
Proteins are composed of complex three-dimensional folding of long chains of amino acids. This spatial structure is an essential component in protein functionality and is thus subject to evolutionary pressures to optimize the inter-residue contacts that support it [7]. Existing computational methods for protein classification try to simulate biological phenomena that define the structure and function of a protein. The most conventional technique is to perform a similarity search between an unknown protein and a reference database of annotated proteins. The query protein is assigned with the same class of the most similar (based on the sequence or the structure) reference protein. There exists several classification methods based on the protein sequence (e.g. Blast [8], ProtFun [9], SVM-Prot [10,11] . . . ); or on the protein structure (e.g. Combinatorial Extension [12], Sheba [13], FatCat [14], Fragbag [15], . . . ). These methods rely on the assumption that proteins sharing the most common sites are more likely to belong to the same class. This classification strategy is based on the hypothesis that structurally similar proteins could share a common ancestor [16]. Another popular approach for protein functional classification is to look for relevant subsequences or substructures (also so-called motifs) among known proteins, then use them as features to classify unknown proteins. Such motifs could be discriminative [17], representative [18], cohesive [7], etc. Each of the mentioned protein classification approaches suffers different drawbacks. Sequence (and subsequences)-based classification do not incorporate spatial information of amino acids that are not contiguous in the primary structure but interconnected in 3D space. This makes them less efficient in the classification of structurally similar proteins with low sequence similarity (remote homologues). Both structure and substructure-based classification techniques do incorporate spatial information which makes them more efficient than sequence-based classification. However, such consideration makes these methods subject to the "no free lunch" principle [19], where the gain in accuracy comes with an offset of computational cost. Hence, it is essential to find an efficient way to incorporate 3D-structure information with low computational complexity.
In this paper, we present PROTNN, a novel approach for protein 3D-structure classification. PROTNN incorporates protein 3D-structure information via the combination of a rich set of structural and topological descriptors. This guarantees an informative multiview representation of the structure that considers spatial information through different dimensions. Such a representation transforms the complex protein 3D-structure into an attribute-vector of fixed size which guarantees the computational efficiency. For classification, PROTNN assigns to a query protein the class having the highest number of votes across the set of its k most similar reference proteins, where k is a user-defined parameter. Experimental evaluation shows that PROTNN is able to accurately classify different benchmark datasets with a gain of up to 47x of computational cost compared to gold standard approaches from the literature such as Combinatorial Extension [12] and FatCat [14]. We further show that PROTNN is able to scale up to a PDB-wide dataset in a singleprocess mode with no parallelization, where it outperformed state-of-the-art approaches with thousands order of magnitude in runtime on classifying a 3D-structure against the entire PDB.

Graph representation of protein 3D-structures
A crucial step in computational studies of protein 3D-structures is to look for a convenient representation of their spatial conformations. Graphs represent the most appropriate data structures to model the complex structures of proteins. In this context, a protein 3Dstructure can be seen as a set of elements (amino acids and atoms) that are interconnected through chemical interactions [7,16,18,20]. These interactions are mainly: -Covalent bonds between atoms sharing pairs of valence electrons, -Ionic bonds of electrostatic attractions between oppositely charged components, -Hydrogen bonds between two partially negatively charged atoms sharing a partially positively charged hydrogen, -Hydrophobic interactions where hydrophobic amino acids in the protein closely associate their side chains together, -Van der Waals forces which represent transient and weak electrical attraction of one atom for another when electrons are fluctuating.
These chemical interactions are supposed to be the analogues of graph edges. Figure 1 shows a real example of the human hemoglobin protein and its graph representation. The Figure shows clearly that the graph representation preserves the overall structure of the protein and its components.
Protein Graph Model: Let G be a graph consisting of a set of nodes V and edges E. L is a label function that associates a label l to each node in V. Each node of G represents an amino acid from the 3D-structure, and is labeled with its corresponding amino acid type. Let be a function that computes the euclidean distance between pairs of nodes (u, v), ∀u, v ∈ V , and δ a distance threshold. Each node in V is defined by its 3D coordinates in IR 3 , and both and δ are expressed in angstroms (Å). Two nodes u and v (∀u, v ∈ V ) are linked by an edge e(u, v) ∈ E, if the distance between their Fig. 1 The human hemoglobin protein 3D-structure (PDBID: 1GZX) and its corresponding graph representation. Nodes and edges represent, respectively, amino acids from the structure and links between them. Blue edges represent links from the primary structure and gray edges are spatial links between distant amino acids C α atoms is below or equal to δ. Formally, the adjacency matrix A of G is defined as follows:

Graph embedding
Graph-based representations are broadly used in multiple application fields including bioinformatics [16,18,21]. However, they suffer major drawbacks with regards to processing tools and runtime. Graph embedding into vector spaces is a very popular technique to overcome both drawbacks [21]. It aims at providing a feature vector representation for every graph, allowing to bridge the gap between the representational power of graphs, the rich set of algorithms that are available for feature-vector representations, and the need for rapid processing algorithms to handle the massively available biological data. In PROTNN, each protein 3D-structure is represented by a graph according to Eq. 1. Then, each graph is embedded into a vector of structural and topological features under the assumption that structurally similar graphs should give similar structural and topological feature-vectors. In such manner, PROTNN guarantees accuracy and computational efficiency.

Structural and topological attributes
In order to avoid the loss of structural information in the embedding and to guarantee PROTNN accuracy, we use a rich set of structural and topological attributes from the literature that have shown to be interesting and efficient in describing connected graphs [22][23][24][25][26][27]. It is important to mention that this list could be extended as needed. In the following, we list the set of attributes that are used in PROTNN:

A1-
where k u is the number of neighbors of u and e u is the number of connected pairs of neighbors. The average clustering coefficient of a graph G, is given as the average value over all of its nodes. Formally: A6-Average effective eccentricity: For a node u, the effective eccentricity represents the maximum length of the shortest paths between u and every other node v in G, e(u) = max{d (u, v) : v ∈ V , u = v}, where d(u, v) is the length of the shortest path from u to v. The average effective eccentricity is defined as A7-Effective diameter: It represents the maximum value of effective eccentricity over all nodes in the graph G, i.e., diam(G) = max{e(u) | u ∈ V } where e(u) represents the effective eccentricity of u as defined above. A8-Effective radius: It represents the minimum value of effective eccentricity over all nodes of G, rad(G) = min{e(u) | u ∈ V }. A9-Closeness centrality: The closeness centrality measures how fast information spreads from a given node to other reachable nodes in the graph. For a node u, it represents the reciprocal of the average shortest path length between u and every other reachable node in the graph G, is the length of the shortest path between the nodes u and v. For G, we consider the average value of closeness centrality of all its nodes,

Complexity
The computational complexity of the structural and topological attributes differ from one attribute to another.

PROTNN: nearest neighbor protein functional classification
We propose PROTNN, a protein structure classification approach based on the principal of the k-nearest neighbor algorithm [28]. The general classification pipeline of PROTNN can be described as follows: first a preprocessing is performed on the reference protein database in which a graph model G P is created for each reference protein P, ∀P ∈ , according to Eq. 1. A structural and topological description vector V P is created for each graph model G P , by computing the corresponding values of each of the structural and topological attributes described in Section "Structural and topological attributes". The resulting matrix M = V P , ∀P ∈ , represents the preprocessed reference database that is used for prediction in PROTNN. In order to guarantee an equal participation of all used attributes in the classification, a min-max normalization (x normalized = x−min max−min , where x is an attribute value, min and max are the minimum and maximum values for the attribute vector) is applied on each attribute of M independently such that no attribute will dominate in the prediction. It is also worth mentioning that for real world applications M is computed once, and it can be incrementally updated with other attributes as well as newly added protein 3D-structures with no need to recompute the attributes for the entire set. This guarantees a high flexibility and easy extension of PROTNN in real world application.
The prediction step in PROTNN is described in Algorithm 1. In prediction, a query protein 3D-structure Q with an unknown function, is first transformed into its corresponding The distance between vectors of query protein Q and the reference protein P.
Select the k nearest reference protein neighbors C Q ← The class with the highest number of votes across the set of NN k Q reference proteins; graph model G Q . The structural and topological attributes are computed for G Q forming its query description vector V Q . The query protein Q is scanned against the entire reference database , where the distance between V Q and each of the reference vectors ∀V P ∈ M is computed and stored in Vdist Q , with respect to a distance measure. The k most similar reference proteins NN k Q are selected, and the query protein Q is predicted to belong to the class with the highest number of votes across the set of NN k Q reference proteins, where k is user-defined.

Benchmark datasets
To assess the classification performance of PROTNN, we performed an experiment on six well-known benchmark datasets of protein structures that have previously been used in [17,[29][30][31]. Each dataset is composed of positive protein examples that are from a selected protein family, and negative protein examples that are randomly sampled from the PDB [4]. Table 1  Vertebrate phospholipase A2: Phospholipase A2 are enzymes from the class of hydrolase, which release the fatty acid from the hydroxyl of the carbon 2 of glycerol to give a phosphoglyceride lysophospholipid. They are located in most mammalian tissues. G-proteins: G-proteins are also known as guanine nucleotide-binding proteins. These proteins are mainly involved in transmitting chemical signals originating from outside a cell into the inside of it. G-proteins are able to activate a cascade of further signaling events resulting a change in cell functions. They regulate metabolic enzymes, ion channels, transporter, and other parts of the cell machinery, controlling transcription, motility, contractility, and secretion, which in turn regulate diverse systemic functions such as embryonic development, learning and memory, and homeostasis.  Proteasome subunits: Proteasomes are critical protein complexes that primarily function to breakdown unneeded or damaged proteins. They are located in the nucleus and cytoplasm. The proteasome recycles damaged and misfolded proteins as well as degrades short-lived regulatory proteins. As such, it is a critical regulator of many cellular processes, including the cell cycle, DNA repair, signal transduction, and the immune response.
Protein kinases, catalyc subunits: Protein kinases, catalytic subunit play a role in various cellular processes, including division, proliferation, apoptosis, and differentiation. They are mainly proteins that modify other ones by chemically adding phosphate groups to them. This usually results in a functional change of the target protein by changing enzyme activity, cellular location, or association with other proteins. The catalytic subunits of protein kinases are highly conserved, and several structures have been solved, leading to large screens to develop kinase-specific inhibitors for the treatments of a number of diseases.

The protein data bank
In order to assess the scalability of PROTNN to large scale real-world applications, we evaluate the runtime of our approach on the entire Protein Data Bank (PDB) [4] which contains the list of all known protein 3D-structures. We use 94126 structures representing all the available protein 3D-structures in the PDB by the end of July 2014.

Protocol and settings
Experiments were conducted on a CentOS Linux workstation with an Intel core-i7 CPU at 3.40 GHz, and 16.00 GB of RAM. All the experiments are performed in a single process mode with no parallelization. To transform protein into graph, we used a δ value of 7Å. The evaluation measure is the classification accuracy, and the evaluation technique is Leave-One-Out (LOO) where each dataset is used to create N classification scenarios, where N is the number of proteins in the dataset. In each scenario, a reference protein is used as a query instance and the rest of the dataset is used as reference. The aim is to correctly predict the class of the query protein. The classification accuracy for each dataset is averaged over results of all the N evaluations.

Results using different distance measures
The classification algorithm of PROTNN supports any user-defined distance measure.
In this section, we study the effect of varying the distance measure on the classification accuracy of PROTNN. We fixed k=1, and we used nine different well-known distance measures namely Euclidean, standardized Euclidean (std-euclidean), Cosine, Manhattan, Correlation, Minkovski, Chebyshev, Canberra, and Braycurtis. See [32] for a formal definition of these measures. Figure 2 shows the obtained classification results of a LOO evaluation on each of the benchmark datasets using each of the distance measures.
Overall, varying the distance measure did not significantly affect the classification accuracy of PROTNN on the six datasets. Indeed, the standard deviation of the classification accuracy of PROTNN with each distance measure did not exceed 4 % on the six datasets. A ranking based on the average classification accuracy over the six datasets suggests the following descending order: (1) Manhattan, (2) Braycurtis, (3) std-Euclidean, (4) Canberra, (5) Cosine, (6) Euclidean -Minkowski, (8) Correlation, (9) Chebyshev.

Results using different numbers of nearest neighbors
In the following, we evaluate the classification accuracy of PROTNN on each of the six benchmark datasets using different numbers of nearest neighbors k ∈ [1,10]. For the sake of generalization, we perform the same experiment using each of the top-five distance measures. For simplicity, we only plot the average value of classification accuracy for each value of k ∈ [1,10] over the six datasets using each of the top-five measures. Note that the standard deviation of ' value of k did not exceed 2 %. Figure 3 shows the obtained results. The number of nearest neighbors k has a clear effect on the accuracy of PROTNN. The results suggest that the "optimal" value of k ∈ {1,2}. The overall tendency shows that the accuracy decreases with higher values of k. This is due to the structural similarity that a query protein may share with other evolutionary close proteins that could belong to the same structural class but exert different functions. High values of k engender considering too many neighbors which may causes a misclassification. However, it is worth noting that for datasets with low intra-class similarity among protein structures, PROTNN could need a higher value of k.

Analysis of the used attributes
In the following, we study the importance of the used attributes in order to identify the most informative ones. We follow the Recursive Feature Elimination (RFE) [33] using PROTNN as the classifier. In RFE, one feature is removed at each iteration, where the remaining features are the ones that best enhance the classification accuracy. We stop the pruning when no further enhancement is observed or no more features are left. The remaining features constitute the optimal subset for that context. In Table 2, we record the ranking of the used attributes in our experiments. For more generalization, RFE was performed on each of the six datasets using a combination of each of the top-five distance measures and each of the top-five values of k. The total number of RFE experiments is 150. For each attribute, we count the total number of times it appeared in the optimal subset of attributes. A score of totalcount numberof experiments is assigned to each attribute according to its total count.
It is clear that the best subset of attributes is dataset dependent. The five most informative attributes are respectively: A15 (energy), A17 (link impurity), A12 (number of distinct eigenvalues), A16 (neighborhood impurity), and A13 (spectral radius). All spectral attributes showed to be very informative. Indeed, three of them (A15, A12, and A13) ranked in the top-five, and A14 (second largest eigenvalue) ranked in the top-ten (9 th ) with a score of 0.52 meaning that for more than half of all the experiments, all spectral attributes were selected in the optimal subset of attributes. Unsurprisingly, A11 (percentage of end points) ranked last with a very low score. This is because proteins are dense molecules and thus very few nodes of their respective graphs will be end points (extremity amino acids in the primary structure with no spatial links). Label attributes also showed to be very informative. Indeed, A17, A16, and A18 (label entropy) ranked respectively 2 nd , 4 th , and 6 th with scores of more than 0.61. This is due to the importance of the distribution of the types of amino acids and their interactions. Both have to follow a certain harmony in order to produce a particular structural form (for instance an α-helix or a β-sheet) and to exert a specific function. A9 (closeness centrality), A5 (average clustering coefficient) and A8 (effective radius) ranked in the top-ten with scores of more than 0.5 (A8 scored 0.49 0.5). However, all A1 (number of nodes), A2 (number of edges), A3 (average degree), A4 (density), A6 (average effective eccentricity), A7 (effective diameter), and A10 (percentage of central nodes) scored less than 0.5. This is because each one of them is represented by one of the top-ten attributes and thus presents a redundant The boldface numbers highlight the best performance information. A6 and A9 are both expressed based on all shortest paths of the graph. Both A7 and A8 are expressed based on A6. A10 is expressed based on A8 and thus on A6 too. A1, A2, A3, and A4 are all highly correlated to A5.

Analysis of the used classifier
In this section, we perform a comparative analysis on the usage of the principle of KNN classifier in PROTNN versus using another classifier. We chose the Support Vector Machine (SVM) [34] for comparison and we term this approach PROTSVM. We use PROTSVM with a linear kernel SVM (PROTSVM(linear)) than with a non-linear RBF (Radial Basis Function) kernel (PROTSVM(rbf )). Table 3 shows the accuracy results of PROTSVM(linear), PROTSVM(rbf ) and PROTNN (using the std-Euclidean distance and k=1). All the three approaches are used with RFE. We notice that PROTNN scored better than both PROTSVM(linear) and PROTSVM(rbf ) on the six datasets with an average classification accuracy of 0.93 compared to 0.81 and 0.74 respectively for PROTSVM(linear) and PROTSVM(rbf ).

Comparison with other classification techniques
We compare our approach with multiple state-of-the-art approaches for protein structure classification namely: sequence alignment-based classification (using Blast [8]), structural alignment-based classification (using Combinatorial Extension (CE) [12], Sheba [13], and FatCat [14]), and substructure(subgraph)-based classification (using GAIA [30], LPG-BCMP [31], and D&D [17]). For sequence and structural alignment-based classification, we align each protein against all the rest of the dataset. We assign to the query protein the class of the reference protein with the best hit score. For the substructure-based approaches, all the selected approaches are mainly for mining discriminative subgraphs. LPGBCMP is used with max var = 1 and d = 0.25 for, respectively, feature consistency map building and overlapping. In [31], LPGBCMP outperformed several other approaches from the literature including LEAP [35], gPLS [36], and COM [29] on the classification of the same six benchmark datasets. GAIA showed in [30] that it outperformed other state-of-the-art approaches namely COM and graphSig [37]. D&D have showed in [17] that it also outperformed COM and graphSig, and that it is highly competitive to GAIA. For all these approaches, the discovered substructures are considered as features for describing each example of the original data. The constructed description matrix is used for training in the classification. For our approach, we show the classification accuracy results of PROTNN with RFE using std-Euclidean distance. We also show the best  Table 4 shows the obtained results.
The alignment-based approaches FatCat and Sheba outperformed CE, Blast, and all the subgraph-based approaches. Indeed, FatCat scored best with three of the first four datasets and Sheba scored best with the two last datasets. Except CE, all the other approaches scored on average better than Blast. This shows that the spatial information constitutes an important asset for protein classification by emphasizing structural properties that the primary sequence alone do not provide. For the subgraph-based approaches, D&D scored better than LPGBCMP and GAIA on all cases except with DS1 where GAIA scored best. On average, PROTNN* ranked first with the smallest distance between its results and the best obtained accuracies with each dataset. This is because PROTNN considers both structural information, and hidden topological properties that are omitted by the other approaches. However, all the top four classification methods, namely PROTNN*, FatCat, PROTNN (without parameter optimization) and Sheba, have shown close and very competitive classification results.
In order to make the classification evaluation more challenging we construct a seventh dataset out of the previous six benchmark datasets. This dataset contains seven classes that represent the six positive classes as well as a seventh class that contains all the negative instances from the six benchmark datasets. The fusion of all the negatives into a single large class makes the dataset imbalanced with 29, 33, 38, 38, 35 and 41 instances respectively for the six first classes and 214 instances for the seventh class. This makes the classification even more challenging. We evaluate the classification performance of our approach (namely PROTNN and PROTNN*) compared to FatCat and CE which are the structural alignment approaches used in the PDB website 1 . The classification results on this dataset were 0.53, 0.84, 0.88 and 0.95 respectively for CE, PROTNN, PROTNN* and FatCat. Although FatCat showed a better performance than our approach on this dataset, overall all the approaches did not show a large variation compared to the results on the six first datasets. By counting these results with those on Table 4, both PROTNN* and FatCat have equivalent average classification accuracy of respectively 0.94±0.03 and 0.94±0.07 on all the datasets, while CE and PROTNN respectively scored 0.58±0.15 and 0.91±0.07.

Scalability and runtime analysis
Besides being accurate, an efficient protein 3D-structure classification approach has to be very fast in order to provide practical usage that meets the increasing load of data in real-world applications. In this section, we study the runtime of our approach and Fat-Cat, the most competitive approach according to our previous comparative experiments. We analyze the variation of runtime for both approaches with increasing number of proteins ranging from 10 to 100 3D-structures with a step-size of 10. In Fig. 4, we report the runtime results in log 10 -scale. A huge gap is clearly observed between the runtime of PROTNN and that of FatCat. The gap gets larger with higher numbers of proteins. Indeed, FatCat took over 5570 s with the 100 proteins while PROTNN (all) did not exceed 118 s for the same set which means that our approach is 47x faster than FatCat on that experiment. The average runtime of graph transformation of PROTNN was 0.8 s and that of the computation of attributes was 0.6 s for each protein. The total runtime of similarity search and classification for PROTNN was only 0.1 on the set of 100 proteins. Note that in  real-world applications, the preprocessing (graph transformation and attribute computation) of the reference database is performed only once and the latter can be updated with no need to recompute the existing values. This ensures computational efficiency and easy extension of our approach.

Scalability to a PDB-wide classification
We further evaluate the scalability of PROTNN in the classification of the entire Protein Data Bank (described in The protein data bank). We also show the runtime for FatCat and CE (the structural alignment approaches used in the PDB website). We recall that the experiments are on a single process mode with no parallelization for all the approaches. Note that in the PDB website, the structural alignment is whether pre-computed for structures of the database, or only performed on a sub-sample of the PDB for customized or local files. Table 5 shows the obtained results. It is clear that the computation of attributes is the most expensive part of our approach as some attributes are very complex. However, building the graph models and the computation of attributes represent the preprocessing step and are only performed once for the reference database. The classification step took almost three hours with an average runtime of 0.1 s for the classification of each protein against the entire PDB. All PROTNN runtime was less than a week with an average runtime of 5.9 s for the preprocessing and classification of each protein 3D-structure against the entire PDB. On the other hand, both FatCat and CE did not finish running within two weeks. We computed the average runtime for each approach on the classification of a random sample of 100 proteins against all the PDB. On average FatCat and CE took respectively more than 42 and 32 h per protein making our approach faster than both approaches with thousands orders of magnitude on the classification of a 3D-structure against the entire PDB.

Conclusion
In this paper, we proposed PROTNN, a new fast and accurate approach for protein 3D-structure classification. We defined a graph transformation and embedding model that incorporates explicit as well as hidden structural and topological properties of the 3D-structure of proteins. We successfully implemented the proposed model and we experimentally demonstrated that it allows to classify protein 3D-structures efficiently. Empirical results of our experiments showed that considering structural information constitutes a major asset for an accurate classification of proteins. They also showed that the alignment-based classification as well as subgraph-based classification present very competitive approaches. Yet, as the number of pairwise comparisons between proteins grows tremendously with the size of dataset, enormous computational costs would be the results of more detailed models. Here, we highlight that PROTNN could accurately classify multiple benchmark datasets from the literature with very low computational costs. With all large-scale studies, it is an asset that PROTNN scales up to a PDB-wide dataset in a singleprocess mode with no parallelization, where it outperformed state-of-the-art approaches with thousands order of magnitude in runtime on classifying a 3D-structure against the entire PDB. In future works, we aim to study and integrate more attributes in our model in order to further enhance the accuracy of our classification system. Endnote 1 http://www.rcsb.org/pdb/.