ProtNN: Fast and Accurate Nearest Neighbor Protein Function Prediction based on Graph Embedding in Structural and Topological Space

. Studying the function of proteins is important for understanding the molecular mechanisms of life. The number of publicly available protein structures has increasingly become extremely large. Still, the determination of the function of a protein structure remains a diﬃcult, costly, and time consuming task. The diﬃculties are often due to the essential role of spatial and topological structures in the determination of protein functions in living cells. In this paper, we propose ProtNN , a novel approach for protein function prediction. Given an unannotated protein structure and a set of annotated proteins, ProtNN ﬁnds the nearest neighbor annotated structures based on protein graph pairwise similarities. ProtNN assigns to the query protein the function with the highest number of votes across the set of k nearest neighbor reference proteins, where k is a user deﬁned parameter. Experimental evaluation demonstrates that ProtNN is able to accurately classify several datasets in an extremely fast runtime compared to state-of-the-art approaches.


Introduction
Proteins are ubiquitous in the living cells. They play key roles in the functional and evolutionary machinery of species. Studying protein functions is paramount for understanding the molecular mechanisms of life. The advances in sequencing techniques provide high amounts of biological data including protein structures. In fact, the number of proteins in the Protein Data Bank (PDB) [1] has more than tripled over the last decade. Alternative databases such as SCOP [2] and CATH [3] are undergoing the same trend. However, the determination of the function of protein structures remains a difficult, costly, and time consuming task. Manual protein functional classification methods are no longer able to follow the rapid increase of data. Accurate computational and machine learning tools present an efficient alternative that could offer considerable boosting to meet the increasing load of data.
Proteins are composed of complex three-dimensional folding of long chains of amino acids. This spatial structure is an essential component in protein functionality and is thus subject to evolutionary pressures to optimize the inter-residue contacts that support it [4]. Existing computational methods for protein function prediction try to simulate biological phenomena that define the function of protein. The most conventional technique is to perform a similarity search between an unknown protein and a reference database of proteins with known functions. The query protein is assigned with the same functional class of the most similar (based on the sequence or the structure) reference protein. There exists several classification methods based on the protein sequence (e.g. Blast [5], ...); or the protein structure (e.g. Combinatorial Extention [6], Sheba [7], FatCat [8], ...). These methods rely on the assumption that protein sharing the most common sites are more likely to share functions. This classification strategy is based on the hypothesis that structurally similar proteins could share a common ancestor [9]. Another popular approach for protein functional classification is to look for relevant substructures (also so-called motifs) among proteins with known functions, then use them as features to identify the function of unknown proteins. Such motifs could be discriminative [10], representative [11], cohesive [4], etc. Each of the mentioned protein functional classification approaches suffers different drawbacks. Sequence-based classification do not incorporate spatial information of amino acids that are not contiguous in the primary structure but interconnected in 3D space. This makes them less efficient in predicting the function for structurally similar proteins with low sequence similarity (remote homologues). Both structure and substructure-based classification techniques do incorporate spatial information in function prediction which makes them more efficient than sequence-based classification. However, such consideration makes these methods subject to the "no free lunch" principle [12], where the gain in accuracy comes with an offset of computational cost. Hence, it is essential to find an efficient way to incorporate 3D structure information with low time complexity.
In this paper, we present ProtNN, a novel approach for function prediction of protein 3D structures. ProtNN incorporates protein 3D structure information via the combination of a rich set of structural and topological descriptors that guarantee an informative multi-view representation of the structures that considers spatial information through different dimensions. Such a representation transforms the complex protein 3D structure into an attribute vector of fixed size allowing computational efficiency. For classification, ProtNN assigns to a query protein the function with the highest number of votes across the set of its k most similar reference proteins, where k is a user defined parameter. Experimental evaluation shows that ProtNN is able to accurately classify different benchmark datasets with a gain of up to 47x of computational cost compared to gold standard approaches from the literature.

Graph Representation of Protein 3D Structures
A crucial step in computational studies of protein 3D structures is to look for a convenient representation of their spatial conformations. Graphs represent the most appropriate data structure to model the complex structure of protein. In this context, a protein 3D structure can be seen as a set of elements (amino acids and atoms) that are interconnected through chemical interactions [4,9,11]. Figure 1 shows a real example of the human hemoglobin protein and its graph representation. The Figure shows clearly that the graph representation preserves the overall structure of the protein and its components. The human hemoglobin protein 3D-structure (PDBID: 1GZX) and its corresponding graph. Nodes and edges represent, respectively, amino acids from the structure and spatial links between them. Blue edges represent the primary structure and gray edges are spatial links between distant amino acids.
Protein Graph Model Let G be a graph consisting of a set of nodes V and edges E. L is a label function that associates a label l to each node in V . Each node of G represents an amino acid from the 3D structure, and is labeled with its corresponding amino acid type. Let ∆ be a function that computes the euclidean distance between pairs of nodes ∆(u, v), ∀u, v ∈ V , and δ a distance threshold. Each node in V is defined by its 3D coordinates in IR 3 , and both ∆ and δ are expressed in angstroms (Å). Two nodes u and v (∀u, v ∈ V ) are linked by an edge e(u, v) ∈ E, if the distance between their C α atoms is below or equal to δ. Formally, the adjacency matrix A of G is defined as follows:

Structural and Topological Embedding of Protein Graphs
Graph Embedding In ProtNN, each protein 3D structure is represented by a graph according to Equation 1. Then, each graph is embedded into a vector of structural and topological features under the assumption that structurally similar graphs should give similar structural and topological feature-vectors. In such manner, ProtNN guarantees accuracy and computational efficiency. It is worth noting that even though structurally similar graphs should have similar topological properties, ProtNN similarity should not necessarily give the same results of structure matching (as in structural alignment). But it should enrich it, since ProtNN considers even hidden similarities (like graph density and energy) that are not considered in structural matching.
Structural and Topological Attributes In ProtNN, the pairwise similarity between two protein graphs is measured by the distance between their vector representations. In order to avoid the loss of structural information in the embedding, and to guarantee ProtNN accuracy, we use a set of structural and topological attributes from the literature that have shown to be interesting and efficient in describing connected graphs [13,14]. In the following, we present the list of the used attributes, see the Appendix for formal definitions: (A1) number of nodes, (A2) number of edges, (A3) average degree, (A4) density, (A5) average clustering coefficient, (A6) average effective eccentricity, (A7) effective diameter, (A8) effective radius, (A9) closeness centrality, (A10) percentage of central nodes, (A11) percentage of end points, (A12) number of distinct eigenvalues, (A13) spectral radius, (A14) second largest eigenvalue, (A15) energy, (A16) neighborhood impurity, (A17) link impurity, and (A18) label entropy.

ProtNN: Nearest Neighbor Protein Functional Classification
The general classification pipeline of ProtNN can be described as follows: first a preprocessing is performed on the reference protein database Ω in which a graph model G P is created for each reference protein P , ∀P ∈ Ω, according to Equation 1. A structural and topological description vector V P is created for each graph model G P , by computing the corresponding values of each of the structural and topological attributes described in Section 2.2. The resulting matrix M Ω = V P , ∀P ∈ Ω, represents the preprocessed reference database that is used for prediction in ProtNN. In order to guarantee an equal participation of all used attributes in the classification, a min-max normalization (x normalized = x−min max−min , where x is an attribute value, min and max are the minimum and maximum values for the attribute vector) is applied on each attribute of M Ω independently such that no attribute will dominate in the prediction. It is also worth mentioning that for real world applications M Ω is only computed once, and can be incrementally updated with other attributes as well as newly added protein 3D structures with no need to recompute the attributes for the entire set. This guarantees a high flexibility and easy extension of ProtNN in real world application. The prediction step in ProtNN is described in Algorithm 1.

Datasets
We use six benchmark datasets of protein structures that have previously been used in [10,15,16,17]. Each dataset is composed of positive protein examples VQ ← GQ is embedded into a vector V using the attributes; The distance between vectors of query protein Q and the reference protein P .
Select the k nearest reference protein neighbors 9 CQ ← The functional class with the highest number of votes across the set of NN k Q reference proteins; that are from a selected protein family, and negative protein examples that are randomly sampled from the PDB [1]. The selected positive protein families are Vertebrate phospholipase A2, G-protein family, C1-set domains, C-type lectin domains, and protein kinases, catalytic subunits. Table 1 summarizes the characteristics of the six datasets.

Protocol and Settings
Experiments were performed on CentOS Linux workstations with Intel core i7 CPU at 3.40 GHz, and 16.00 GB of RAM. To transform protein into graph, we used a δ value of 7Å. The evaluation measure is the classification accuracy, and the evaluation technique is Leave-One-Out (LOO) where each dataset is used to create N classification scenarios, where N is the number of proteins in the dataset. In each scenario, a reference protein is used as a query instance and the rest of the dataset is used as reference. The aim is to correctly predict the class of the query protein. The classification accuracy for each dataset is averaged over results for all the N evaluations.

ProtNN Classification Results
Results Using Different Distance Measures We study the effect of varying the distance measure on the classification accuracy of ProtNN. We fixed k=1, and we used nine different distance measures namely Euclidean, standardized Euclidean (std-euclidean), Cosine, Manhattan, Correlation, Minkovski, Chebyshev, Canberra, and Braycurtis. See [18] for a formal definition of these measures. Figure 2 shows the obtained results. Overall, varying the distance measure did not

Results Using Different Numbers of Nearest Neighbors
In the following, we evaluate the classification accuracy of ProtNN on each of the six benchmark datasets using different numbers of nearest neighbors k ∈ [1,10]. The same experiment is performed using each of the top-five distance measures. For simplicity, we plot the average value of classification accuracy for each value of k ∈ [1,10] over the six datasets using each of the top-five measures. Figure 3 shows the obtained results. The number of considered nearest neighbors k has a clear effect on the accuracy. The obtained results suggest that the "optimal" value of k ∈ {1,2}. The overall accuracy tendency shows that it decreases with higher values of k. This is due to the structural similarity that a query protein may share with other evolutionary close proteins exerting different functions. High values of k engender considering too many neighbors which may causes a misclassification.

Analysis of the Used Attributes
In the following, we study the importance of the used attributes in order to identify which ones are the most informative. We   follow the Recursive Feature Elimination (RFE) approach [19] with ProtNN as the classifier. In RFE, one feature is removed at each iteration, where the remaining features are the ones that best enhance the classification accuracy. The pruning stops when no further enhancement is observed or no more features are left. The remaining features constitute the optimal subset for that context. In Table 2, we record the ranking of the used attributes in our experiments. For more generalization, RFE was performed on each of the six datasets using a combination of each of the top-five distance measures and each of the topfive values of k. The total number of experiments is 150. For each attribute, we count the total number of times it appears in the optimal subset of attributes. A score of total count number of experiments is assigned to each attribute according to its total count. It is clear that best subset of attributes depends on the dataset. The five most informative attributes are respectively: A15 (energy), A17 (link impurity), A12 (number of distinct eigenvalues), A16 (neighborhood impurity), and A13 (spectral radius). All spectral attributes showed to be very informative. Indeed, three of them (A15, A12, and A13) ranked in the top-five, and A14 (second largest eigenvalue) ranked in the top-ten (9 th ) with a score of 0.52 meaning that for more than half of all the experiments, all spectral attributes were selected in the optimal subset of attributes. Unsurprisingly, A11 (percentage of end points) ranked last with a very low score. This is because proteins are dense molecules and thus very few nodes of their respective graphs will be end points (extremity amino acids in the primary structure with no spatial links). Label attributes also showed to be very informative. Indeed, A17, A16, and A18 ranked respectively 2 nd , 4 th , and 6 th with scores of more than 0.61. This is due to the importance of the distribution of the types of amino acids and their interactions. Both have to follow a certain harmony in order to exert a particular function. A9 (closeness centrality), A5 (average clustering coefficient) and A8 (effective radius) ranked in the top-ten with scores of more than 0.5 (A8 scored 0.49 0.5). However, all A1 (number of nodes), A2 (number of edges), A3 (average degree), A4 (density), A6 (average effective eccentricity), A7 (effective diameter), and A10 (percentage of central nodes) scored less than 0.5. This is because each of these attributes is represented by one of the top-ten attributes and thus presents a redundant information. A6 and A9 are both expressed based on all shortest paths of the graph. Both A7 and A8 are expressed based on A6. A10 is expressed based on A8 and thus on A6 too. A1, A2, A3, and A4 are all highly correlated to A5.

Comparison with Other Classification Techniques
We compare our approach with multiple state-of-the-art approaches for protein function prediction namely: sequence alignment-based classification (using Blast [5]), structural alignment-based classification (using Combinatorial Extension (CE) [6], Sheba [7], and FatCat [8]), and substructure(subgraph)-based classification (using GAIA [17], LPGBCMP [15], and D&D [10]). For sequence and structural alignment-based classification, we align each protein against all the rest of the dataset. We assign to the query protein the function of the reference protein with the best hit score. For the substructure-based approaches, all the selected approaches are mainly for mining discriminative subgraphs. LPGBCMP is used with max var = 1 and d = 0.25 for, respectively, feature consistency map building and overlapping. For all these approaches, the discovered substructures are considered as features for describing each example of the original data. The constructed description matrix is used for training in the classification. For our approach, we show the classification accuracy results of ProtNN with RFE using std-Euclidean distance. We also show the best results of ProtNN (denoted ProtNN*) with RFE using each of the top-five distance measures. We use k = 1 both for ProtNN and ProtNN*. Table 3 shows the obtained results.
The alignment-based approaches FatCat and Sheba outperformed CE, Blast, and all the subgraph-based approaches. Except CE, all the other approaches scored on average better than Blast. This shows that the spatial information constitutes an important asset for functional classification. For the subgraphbased approaches, D&D scored better than LPGBCMP and GAIA on all cases except with DS1 where GAIA scored best. On average, ProtNN* ranked first with the smallest distance between its results and the best obtained accuracies with each dataset. This is because ProtNN considers both structural information, and hidden topological properties that are omitted by the other approaches. Average classification accuracy of each classification approach over the six datasets. 2 Average of distances between accuracy of each approach and the best obtained accuracy with each dataset.

Scalability and Runtime Analysis
In this section, we study the computational cost of ProtNN and FatCat, the most competitive approach. We analyze the variation of runtime for both approaches with higher numbers of protein 3D-structures ranging from 10 to 100 proteins with a step-size of 10. In Figure 4, we report the runtime results in seconds (left) and in log 10 -scale (right). A huge gap is clearly observed between the runtime of ProtNN and that of FatCat. The gap gets larger with higher numbers of proteins. Indeed, FatCat runtime took over 5570 seconds with the 100 proteins while ProtNN runtime did not exceed 118 seconds for the same set which means that our approach is 47x faster than FatCat on that experiment. The average runtime of graph transformation of ProtNN was 0.8 second and that of the computation of attributes was 0.6 second for each protein. The total runtime of similarity search and function prediction of ProtNN was only 0.1 on the set of 100 proteins. In real world applications, the graph transformation and attribute computation for the reference database are computed only once and can be updated with no need to recompute the existing values. This ensures computational efficiency and easy extension of our approach.

Conclusion
In this paper, we proposed ProtNN, a new fast and accurate approach for protein function prediction. We defined a graph transformation and embedding model that incorporates explicit as well as hidden structural and topological properties of the 3D-structure of proteins. We successfully implemented the proposed model and we experimentally showed that it allows to detect similarity and to predict the function of protein 3D-structures efficiently. Empirical results of our experiments showed that considering structural information constitutes a major asset for identifying the functions of proteins correctly. They also showed that the alignment-based classification as well as subgraph-based classification present very competitive approaches. Yet, as the number comparisons between pairs of proteins grows tremendously with the size of dataset, enormous computational costs would be the results of more detailed models. ProtNN showed that it is able to accurately classify multiple benchmark datasets from the literature with very low computational costs.
In future works, we aim to integrating more protein information in our model to further enhance the accuracy of our function prediction system. We also plan to apply our approach on large scale dataset that includes the entire PDB, which will takes ProtNN beyond theoretical proposition to become a reference bioinformatics tool for real world applications.
A6-Average effective eccentricity: For a node u, the effective eccentricity represents the maximum length of the shortest paths between u and every other node v in G, e(u) = max{d(u, v) : v ∈ V, u = v}. The average effective eccentricity is defined as Ae(G) = 1 |V | |V | i=1 e(u i ). A7-Effective diameter: It represents the maximum value of effective eccentricity over all nodes in the graph G, i.e., diam(G) = max{e(u) | u ∈ V } where e(u) represents the effective eccentricity of u as defined above. A8-Effective radius: The effective radius represents the minimum value of effective eccentricity over all nodes of G, rad(G) = min{e(u) | u ∈ V }. A9-Closeness centrality: The closeness centrality measures how fast information spreads from a given node to other reachable nodes in the graph. The neighborhood impurity of G is the average ImpDeg over all nodes. A17-Link impurity: An edge {u, v} is considered to be impure if L(u) = L(v).
The link impurity of a graph G with k edges is defined as: |{u,v}∈E:L(u) =L(v)| k . A18-Label entropy: It measures the uncertainty of labels. For a graph G of k labels, it is defined as E(G) = − k i=1 p(l i ) log p(l i ), where l i is the i th label.