Network Embedding the Protein–Protein Interaction Network for Human Essential Genes Identification

Essential genes are a group of genes that are indispensable for cell survival and cell fertility. Studying human essential genes helps scientists reveal the underlying biological mechanisms of a human cell but also guides disease treatment. Recently, the publication of human essential gene data makes it possible for researchers to train a machine-learning classifier by using some features of the known human essential genes and to use the classifier to predict new human essential genes. Previous studies have found that the essentiality of genes closely relates to their properties in the protein–protein interaction (PPI) network. In this work, we propose a novel supervised method to predict human essential genes by network embedding the PPI network. Our approach implements a bias random walk on the network to get the node network context. Then, the node pairs are input into an artificial neural network to learn their representation vectors that maximally preserves network structure and the properties of the nodes in the network. Finally, the features are put into an SVM classifier to predict human essential genes. The prediction results on two human PPI networks show that our method achieves better performance than those that refer to either genes’ sequence information or genes’ centrality properties in the network as input features. Moreover, it also outperforms the methods that represent the PPI network by other previous approaches.


Introduction
Essential genes are a group of genes that are indispensable for cell survival and cell fertility [1]. Scientists focus on studying essential genes in bacteria to find potential drug targets for new antibiotics [2]. Identifying essential genes in prokaryotic or simple eukaryotic organisms helps us understand the primary requirement for cell life, but also points the way to synthetic biology research that aims to create a cell with a minimal genome. The development of wet-lab experimental techniques, such as single-gene deletions [3], RNA interference [4], and conditional knockouts [5], has accumulated several essential genes in some simple eukaryotic organisms such as S. cerevisiae [6], C. elegans [7], and A. thaliana [8], in the past few years. However, it is hard to apply these techniques to explore human essential genes due to technical difficulties and species limitations [9].
The close relationships between essential genes and human diseases found by previous researches has motivated scientists to identify essential genes in humans to seek guidance for disease treatment [10][11][12]. However, it is hard to apply the computational methods that have successfully

Methods
Our method involved two steps to predict humane essential genes. First, every node in the human PPI network represents a vector that can preserve the structure of the input network. Secondly, the vector is put into a classifier to train the classifier and make predictions. Similar to the node2vec model [38], our method learns the feature representation for every node in the PPI network by the following steps: The first step is to conduct a bias random walk on the PPI network starting from a single node and, then, to obtain a node sequence that records the network context of the starting node. Next, a window slides along the node sequence to get the node pairs that have a close relationship in the network. These node pairs are put into a skip-gram-based architecture to learn the feature representation of every node. Figure 1 shows an overview of our method for identifying essential genes. model [38], our method learns the feature representation for every node in the PPI network by the following steps: The first step is to conduct a bias random walk on the PPI network starting from a single node and, then, to obtain a node sequence that records the network context of the starting node. Next, a window slides along the node sequence to get the node pairs that have a close relationship in the network. These node pairs are put into a skip-gram-based architecture to learn the feature representation of every node. Figure 1 shows an overview of our method for identifying essential genes.

Bias Random Walk
Network embedding methods originate for learning features for words in the natural language processing field. The input of the methods is a word sequence that represents the neighborhood relationship of the word in the language context. The word and its neighbor are the input and output layers of an artificial neural network. The weights of the hidden layer are the "node vectors" that we are trying to learn. In this work, the input is a PPI network, which is not linear and is composed of nodes with complicated neighborhood relationships. Hence, to resolve this issue, a bias random walk is simulated on the network to get an ordered node sequence starting from every node in the network.
Let G = (V,E,W) denote a PPI network, where a vertex v∈ V is a gene and an edge e (u,v) ∈ E connects the genes v and u and w (u,v) ∈ W is the weight of edge e (u,v), which measures the reliability of the connection between u and v. Here, all edges in the PPI network are treated equally, and their weights are set to one. Considering the module nature of the PPI network, a vertex walks to its neighbors at different probabilities. That is, given a vertex v, there is an edge e (u,v), where vertex u is visited by vertex v at the previous step. The vertex v walks to one of its neighbor k by evaluating the transition probabilities T (v,k) on edge e (v,k). The transition probabilities T (v,k) is defined as follows: with π(u, k) where d uk denotes the shortest path distance from the previous vertex u to the next vertex k. Especially, d uk = 0 means that the vertices u and k are the same vertex and the vertex v jumps back to its previous vertex u, d uk = 1 means that the vertex k is the common neighbor of the vertices u and v, and d uk = 2 means that the u and k connect indirectly, and the vertex k is not their common neighbor. The parameter p controls the likelihood of the vertex v traveling back to its previous vertex u. The parameter q controls the probability of the vertex v jumping to the next vertex k. Setting different values for the parameters p and q guides a bias walk on the network. If the parameters p and q are larger than one, the vertex v is more likely to walk to the vertex that is the common neighbor of it and its previous step, i.e., the vertex u. After computing the transition probabilities for every edge in the PPI network, we row normalize the transition probability matrix T to ensure the sum of the walk-out probabilities for each node to be one.
In bias random walk, a node tends to walk to its neighbor along the edge with the highest transition probability. Suppose a node has number N of neighbors, it usually needs O(N) time to find the walk-out edge with the highest transition probability. This work adopts an alias sampling method to find the edge in O(1) time [38]. The outcome of the bias random walk is a vertex sequence that records the walking trace in the network starting from a vertex, which reserves the neighborhood relationship of the starting vertex in the network. Algorithm 1 gives the pseudocode of the bias random walk. Algorithm 1. Bias random walk algorithm.
Input G = (V, E, W), Len_walkLists, parameters w, p and q; Output vertex sequence lists: walkLists T = computing transition probabilities (G, p, q, w)//computing transition probabilities for every edge in the network T norm = normalizing T by Equation (2) G' = (V, E, T norm ) walkLists = {} for iter = 1 to Len_walkLists do for every node u ∈ V do Append u to seq while len(seq) < w: t = seq [-1] // getting the last node of the set seq N(t) = sort (GetNeighbors(t, G')) // sorting neighbor list of current vertex in alphabetic order n = AliasSampling(N(t), T norm ) //applying alias sampling with respect to the normalized transition probabilities to select a next visiting neighbor node Append n to seq Append seq to walkLists return walkLists
Give an undirected graph G = (V,E,W), for every vertex v∈V, we can get its sequence of neighbor {v m } in the network by a bias random walk. Suppose there are number m of vertices in the sequence. Let f: v∈V->R d be the function that maps a vertex v to its feature representation vector and f(v) denotes the feature representation of the vertex v. Here, d is the size of the feature representation vector, which is empirically set to 64 [33]. Given a vertex v in the input vertex sequence, N(v) denotes its neighborhood set. The SGNS architecture aims to build a model that maximizes the log-probability of neighbors N(v) around a vertex v under the assumption that its feature representation is f (v). The objective function is formally defined as follows: , The underlying assumption of this objective function is the vertices that are close in the original network should have similar feature representations in latent space. The SGNS uses a two-layer artificial neural network to train the feature representation of every vertex. Figure 1 shows the workflow of SGNS architecture. Suppose there are m distinct vertices in an input sequence, the SGNS applies an m-dimensional one-hot vector to represent the input sequence. The one-hot vector of the ith vertex in an input sequence is a binary vector, whose elements are all zeros except for the ith element is one. The input vertex and its neighbors are encoded into corresponding one-hot vectors. Then, these vectors are placed at the input layer and the output layer of the artificial neural network, respectively. The stochastic gradient descent (SGD) technique is adopted to train the weights of the hidden layer. Finally, merely multiplying the 1 × m one-hot vector from left with an m × d weight matrix results in the feature representation vector of an input vertex. Since computing (4) is too time-consuming for the large scale network, negative sampling is adopted to address the computational problem [33].

Classification
The previous steps help us learn feature vectors for genes while maintaining their topological properties in the PPI network. The features of some labeled genes are put into a classifier to build a model. Then, the classifier determines the unlabeled genes to be essential according to their feature vectors. A variety of classifiers are available to finish the prediction task. This work aims to verify that using a network embedding method to learn genes' features in the PPI network is helpful to predict human essential genes. Hence, one of popular classifiers is selected to predict the human essential genes, including random forest (RF) [40], support vector machine (SVM, RBF kernel) [6], decision tree (DT) [41], logistic regression (LR) [42], extra tree (ET), k-nearest neighbor (KNN) [43] and Naive Bayes (NB) [44].

Datasets
Two kinds of datasets were involved in this work. The one dataset was the human essential gene dataset, which was from the supplementary files of [16]. The human essential genes came from three recent works [13][14][15], which consist of 1516 essential genes and 10,499 non-essential genes.
The other dataset was the human PPI network. We tested our method on two different human PPI networks. One network (namely FIs) was downloaded from the Reactome database [45], involving 12,277 genes and 230,243 interactions. The other network was from the InBio Map database (called InWeb_IM dataset) [46], consisting of 17,428 genes and 625,641 interactions. To compare with the Guo's method [16], the genes which appeared in both the PPI network and the supplementary files of [16] remained in the training and testing datasets. Consequently, the FIs dataset included 1359 essential genes and 5388 non-essential genes, and the InWeb_IM dataset included 1512 essential genes and 9036 non-essential genes. Table 1 shows the details of the two human PPI networks.

Evaluation Metrics
The five-fold cross-validation was selected to test the prediction performance of our method. All genes in the benchmark dataset partitioned into five parts at random. Four of the five parts were the training set, and the remaining part was the testing set. According to the benchmark dataset, some popular statistical metrics measured the prediction performance. The metrics included precision, recall, Sp (specificity), NPV (negative predictive value), F-measure, ACC (accuracy), and MCC (Matthews correlation coefficient). Their definitions are as follows: where TP (true positive) and FP (false positive) measure the number of predicted essential genes that match or not match with known essential genes. FN denotes the number of known essential genes that are missed by predictions. TN indicates the number of true-negative genes. In addition, the area under the Receiver Operating Characteristic (ROC) Curve (AUC) and the area under precision-recall curve (AP) measure the overall performance of each method with selecting different thresholds to get candidate essential genes. Due to the unbalance between essential genes and non-essential genes, we took two different strategies in the course of validation. In one strategy, each fold data maintains the same ratio of essential genes to non-essential genes in the original data, i.e., about 1:4 in the FIs dataset and 1:6 in the InWeb_IM dataset. The other strategy keeps the ratio of essential genes to non-essential genes as 1:1 in each fold data.

Parameter Selection
As mentioned in the Method section, the transition probability T (v,k) mainly depends on the value of π (u,k) (see Equation (1)), since, in this work, we only focus on the unweighted human PPI network. Equation (1) shows that the value of π (u,k) changes with different values of the parameters p and q. If the p and q equal one, the node v has an equal probability of jumping back to its previous node u or walking to the common neighbor of it and its last step node u or guiding to its new neighbors. On the one hand, if the p and q exceed one, the node v more likely walks to the node that is the common neighbor of it and its last step node u. On the other hand, if the value of q is less than one, the node has more likelihood of jumping to the nodes far away from it and if the value of p is less than one, the node is more likely to be guided back to its previous nodes. To investigate the effect of parameters on the prediction performance of our method, for the FIs dataset, we fix the values of w, c, and p to 20, 10, and one respectively and set the values of q to 0.5, 0.8 and the integers ranging from one to 10. Figure 2a shows that, when the p and q equal one, the AUC values and F-measure values by keeping the ratio of essential genes to non-essential genes as 1:1 or their original rate on this dataset are 0.9179, 0.9227, 0.8415, 0.7098, respectively. Under this condition, the performance of our method is lower than when the q is 0.8 or ranges from two to six. It suggests that the bias random walk on the human PPI network can improve the performance of human essential gene predictions. Moreover, the AUC values and F-measure values when q ranging from two to 10 vary within a very narrow band and are higher than that when q is less than one. In this work, the q is three on this dataset because the performance of our method receives the best. Similarly, for the parameter p, we fix the values of w, c, and q to 20, 10 and three, respectively, and set the p to 0.5, 0.8 and the integers ranging from one to 10. We notice from Figure 2b, that the AUC values of our method drop quickly when the p is less than one, which suggests that setting more probability to jump back to the previous node reduces the prediction performance. When the parameter p has the same value three as the parameter q, the AUC values are lower than that when the p is one. Under most conditions, setting the p to one achieves better performance than setting p to other values. Hence, in this work, the parameter p is set to one on this dataset. The parameter c controls the size of the sliding window along the node sequence. It decides which two nodes in the node sequence are regarded to have a close relationship in the PPI network and will be the input and output of the artificial neural network. Setting the c values too small ignores many truly related gene pairs and setting the c values too large introduces many unrelated gene pairs. To investigate the effect of the parameter c, we fix the values of w, p, and q to 20, one and three, respectively, and the range of c from two to 20 on the FIs dataset. Figure 2d shows that the AUC values and F-measure values of our method achieve relatively high points when the c is ten on this dataset. Consequently, the c is set to 10 on the FIs dataset in this work. For the InWeb_IM dataset, we fixed the values of w, c, and p to 25, four, and one, respectively, and set the q to 0.5, 0.8 and the integers ranging from one to 10. Figure 3a shows that when the values of q equal four, the AUC values and the F-measure values achieve the highest. Hence, the q is four on this dataset. For parameter p, we fixed the values of w, c, and q to 25, four, and four, respectively The parameter w decides the walk distance in the network. The larger the w value is, the farther the node walks away from the start node and the more global information of the network is explored. To investigate the effect of the parameter w, the values of c, p, and q are fixed to 10, one and three, respectively, and the w ranges from 10 to 45 on the FIs dataset. Figure 2c shows that the AUC and the F-measure values of our method achieve relatively high values when the w is 20 on this dataset. Consequently, in the following tests, w was set to 20 on this dataset.
The parameter c controls the size of the sliding window along the node sequence. It decides which two nodes in the node sequence are regarded to have a close relationship in the PPI network and will be the input and output of the artificial neural network. Setting the c values too small ignores many truly related gene pairs and setting the c values too large introduces many unrelated gene pairs. To investigate the effect of the parameter c, we fix the values of w, p, and q to 20, one and three, respectively, and the range of c from two to 20 on the FIs dataset. Figure 2d shows that the AUC values and F-measure values of our method achieve relatively high points when the c is ten on this dataset. Consequently, the c is set to 10 on the FIs dataset in this work.
For the InWeb_IM dataset, we fixed the values of w, c, and p to 25, four, and one, respectively, and set the q to 0.5, 0.8 and the integers ranging from one to 10. Figure 3a shows that when the values of q equal four, the AUC values and the F-measure values achieve the highest. Hence, the q is four on this dataset. For parameter p, we fixed the values of w, c, and q to 25, four, and four, respectively and set the p to 0.5, 0.8 and the integers ranging from one to 10. Figure 3b shows that setting p to six helps our method achieve the best performance. Similarly, we test the parameters w and c and set the w and c to 25 and four, respectively, on the InWeb_IM dataset.

Comparison with Existing Methods
To test the effectiveness of our method, we compared it with four other existing approaches, including the Z-curve method [16], the centrality-based method, the DeepWalk [34] method, and the LINE [35] method. The Z-curve method is the latest method to predict human essential genes, which inputs DNA sequence features extracted by a λ-interval Z-curve method into an SVM classifier to make predictions [16]. The centrality-based method makes use of genes' central indices in the network as input features, including DC, BC, CC, SC, EC, IC, and NC. CytoNCA [47] calculated these centrality indices. The remaining two are the DeepWalk [34] and the LINE [35] that learn the feature representation for every node in the human PPI network by different network embedding methods. All these methods select the SVM classifier to make predictions. Table 2; Table 3 show their prediction performances on two different human network datasets, including the FIs dataset and the InWeb_IM dataset. The ratio of essential genes to non-essential genes in the course of validation shows in the brackets at the first column of the tables. For example, we maintained the ratio of the two kinds of genes as 1:4 on the FIs dataset or 1:6 on the InWeb_IM dataset. The best and comparable results are in bold font. Genes 2020, 11, 153 9 of 18

Comparison with Existing Methods
To test the effectiveness of our method, we compared it with four other existing approaches, including the Z-curve method [16], the centrality-based method, the DeepWalk [34] method, and the LINE [35] method. The Z-curve method is the latest method to predict human essential genes, which inputs DNA sequence features extracted by a λ-interval Z-curve method into an SVM classifier to make predictions [16]. The centrality-based method makes use of genes' central indices in the network as input features, including DC, BC, CC, SC, EC, IC, and NC. CytoNCA [47] calculated these centrality indices. The remaining two are the DeepWalk [34] and the LINE [35] that learn the feature representation for every node in the human PPI network by different network embedding methods. All these methods select the SVM classifier to make predictions. Table 2; Table 3 show their prediction performances on two different human network datasets, including the FIs dataset and the InWeb_IM dataset. The ratio of essential genes to non-essential genes in the course of validation shows in the brackets at the first column of the tables. For example, we maintained the ratio of the two kinds of genes as 1:4 on the FIs dataset or 1:6 on the InWeb_IM dataset. The best and comparable results are in bold font.   Tables 2 and 3 show that our method has the best performance among all the methods compared on the two datasets, no matter maintaining which ratio of essential genes to non-essential genes in the course of validation. When the ratio of the two types of genes was 1:4 on the FIs dataset, the F-measure, MCC, AUC, and AP values of our method were 0.692, 0.641, 0.913, and 0.769, which achieved 31%, 57%, 9.5%, and 47.3% improvements as compared with the Z-curve method and achieved 23%, 38%, 21.2%, and 39.8% improvements as compared with the centrality method. For the comparison with the DeepWalk and LINE, the four indicators showed improvements of 0.6%, 2.6%, 0.9%, 4.8% and 28.4%, 25%, 6.7%, 11%, respectively. When the ratio of the two kinds of genes was 1:6 on the InWeb_IM dataset, the F-measure, MCC, AUC, and AP values of our method were 0.665, 0.641, 0.915 0.762, respectively, which were 40.6%, 65.2%, 8.98%, 67.1% higher than that of the Z -curve method, and were 22.2%, 37.8%, 10.5%, 49.1% higher than that of the centrality method. For the comparison with DeepWalk and LINE, the F-measure, MCC, AUC, AP values of our method showed improvements of 9%, 12.3%, 1.2%, 14.2%, and 141.8%, 93.1%, 9.8%, 39.3%, respectively. When the ratio of the two kinds of genes was kept as 1:1 in the course of cross-validation, the evaluation metrics of most comparing methods, including precision, recall, F-measure, ACC, MCC, AUC, and AP, increase their values to some degree. Compared with DeepWalk that has the highest performance among the other existing methods, the F-measure, MCC, AUC and AP values of our method were 0.847, 0.699, 0.914, 0.902 on the FIs dataset and reached up to 0.857, 0.713, 0.928, 0.921 on the InWeb_IM dataset. Our method still possesses the highest F-measure, MCC, AUC, and AP values on the two datasets.
The Z-curve method used the gene's sequence information as prediction features. The centrality method combined seven genes' central indices in the network into feature vectors to make predictions. The success of our method proved that learning the genes' latent features from the PPI network, which extremely maintains the structure of the network, was able to improve the identification of human essential genes. Compared with DeepWalk and LINE that also extracted genes' features by network-embedding, our methods showed better performance, which proves that our method takes better network-embedding strategies to predict human essential genes.

Comparison of Different Classifiers
In this work, we extract the gene's features in the PPI network by using a network-embedding method. To assess the effectiveness of the features on human essential gene predictions, besides the SVM classifier, the other six frequently used classifiers are selected to predict human essential genes in our method. These classifiers include deep neural network (DNN), decision tree (DT), Naive Bayes (NB), k-nearest neighbor (KNN), logistic regression (LR), random forest (RF), and extra tree (ET). The batch_size, epochs, learning rate, and activation function of the DNN model are 64, 20, 0.0001, and sigmoid, respectively. For the DNN with one hidden layer, the number of neurons is 32. For the DNN with three hidden lays, the numbers of neurons are 32, 16, and one, respectively. Tables 4 and 5 list the performance comparisons of our method on two different datasets with different classifiers and different ratios of essential genes to non-essential genes in cross-validation. Among all comparing classifiers, the extra tree algorithm (ET) based on random achieved the highest prediction performance. By maintaining the original essential and non-essential gene ratio in the training and testing dataset, the F-measure, MCC, AUC, and AP values of our method by using ET classifier achieved 0.727, 0.676, 0.932, 0.806 on the FIs dataset, and 0.692, 0.659, 0.943, 0.779 on the InWeb_IM dataset. When keeping the same ratio of essential gene to non-essential, the F-measure, MCC, AUC, and AP values of our method using ET classifier are 0.7%, 1.9%, 0.98%, and 2.2% higher than that by using the SVM classifier on the FIs dataset, and are 1.05%, 2.2%, 0.65%, and 0.76% higher than that by using the SVM classifier on the InWeb_IM dataset. The results suggest that choosing other efficient classifiers can further improve the prediction performance of our method.

Feature Representation of Human Essential Genes
Our method represents every node in the human PPI network as a latent vector and uses the vector as the features of the node to predict essential proteins. To investigate whether the feature representation learned by our method can preserve network structure and the properties of the nodes in the network, we rebuild the network according to the Euclidean distance between the nodes' feature vectors. Then, we calculate some centrality indexes for nodes in the PPI network, including DC, BC, CC, NC, and IC. Table 6 shows the Pearson's correlation coefficient between these centrality indexes of the human essential genes in the original PPI network and the rebuilt network. We note that on the FIs dataset and InWeb_IM dataset, the Pearson's correlation coefficients between the centrality indexes of the human essential genes in the original and the rebuilt network are above 0.8 and some values even close to one. It suggests the essential genes have very similar topology properties in the two PPI networks and the feature representation learned by our method can preserve network structure. To probe why the feature representations can improve the performance of the human essential genes' prediction, we clustered the essential genes into 20 subgroups according to their features by a k-means method. For comparison, we select two different kinds of features for clustering. One type of gene feature is the connection relationship between a gene and the other genes in the PPI network, which is a row of the adjacent matrix of the network. The other type of gene feature is a gene's feature representation learned by the network embedding method. The clustering results on two human PPI networks were visualized by a t-SNE tool. Figure 4; Figure 5 show that the human essential genes can be separated into several subgroups well, when using the genes' feature representations. On the FIs dataset, the largest subgroup consists of 289 essential genes, and the smallest one consists of 21 essential genes. On the InWeb_IM dataset, the largest subgroup consists of 176 essential genes, and the smallest one consists of 20 essential genes. However, when the clustering features are selected as the connection relationship between a gene and the other genes in the PPI network, most of the human essential genes are aggregated in a large subgroup while the remaining subgroups have small sizes. On the FIs dataset, the largest subgroup consists of 902 essential genes, and the smallest one consists of six essential genes. On the InWeb_IM dataset, the largest subgroup consists of 1022 essential genes, and the smallest one consists of one essential gene. Moreover, we also leverage silhouette value and Dunn value to evaluate the quality of the clustering results under two different clustering features. Table 7 lists the detailed information about these subgroups. The silhouette and Dunn values of the subgroups generated by the feature representation are higher than that of the subgroups generated by the connection relationship features. It suggests that selecting the feature representation as clustering features can successfully divide human essential genes into the subgroups with the high compactness within clusters and the high separation among clusters. silhouette value and Dunn value to evaluate the quality of the clustering results under two different clustering features. Table 7 lists the detailed information about these subgroups. The silhouette and Dunn values of the subgroups generated by the feature representation are higher than that of the subgroups generated by the connection relationship features. It suggests that selecting the feature representation as clustering features can successfully divide human essential genes into the subgroups with the high compactness within clusters and the high separation among clusters.  To further investigate the biological functions of these essential gene subgroups, we used the DAVID on-line database to perform the GO functional enrichment analysis for these subgroups in the GOTERM_BP_FAT category. In this work, we only present the enrichment analysis based on the GO biological process (BP) annotation. Because the module enriches in BP annotations indicates that the genes in the module have diverse molecular functions but work together to perform a particular silhouette value and Dunn value to evaluate the quality of the clustering results under two different clustering features. Table 7 lists the detailed information about these subgroups. The silhouette and Dunn values of the subgroups generated by the feature representation are higher than that of the subgroups generated by the connection relationship features. It suggests that selecting the feature representation as clustering features can successfully divide human essential genes into the subgroups with the high compactness within clusters and the high separation among clusters.  To further investigate the biological functions of these essential gene subgroups, we used the DAVID on-line database to perform the GO functional enrichment analysis for these subgroups in the GOTERM_BP_FAT category. In this work, we only present the enrichment analysis based on the GO biological process (BP) annotation. Because the module enriches in BP annotations indicates that the genes in the module have diverse molecular functions but work together to perform a particular To further investigate the biological functions of these essential gene subgroups, we used the DAVID on-line database to perform the GO functional enrichment analysis for these subgroups in the GOTERM_BP_FAT category. In this work, we only present the enrichment analysis based on the GO biological process (BP) annotation. Because the module enriches in BP annotations indicates that the genes in the module have diverse molecular functions but work together to perform a particular biological process, such as signaling in a pathway. An essential gene subgroup is regarded as being biological significance if its p-value is less than 0.01. All of the essential gene clusters generated by two different clustering features on the two human PPI networks have significant biological functions (see Supplementary Files). As we can see from Table 7, the essential gene subgroups clustered by the feature representation have higher vital biological functions (with higher Avg(-log(p-value)) values) than those gathered by connection relationship features. Table 8 lists ten example essential gene clusters with the smallest p-values on the FIs dataset. We notice that the feature representation can cluster a group of essential genes with the p-value of 6.64E-191. It consists of 116 essential genes, and 110 of them enrich in the function of RNA splicing (GO:0000377). A cluster includes 50 essential genes, and all of the genes in it have the function of rRNA processing (GO:0006364). However, these 50 essential genes are separated into three clusters by connection relationship features (see Supplementary Files). The bias random walk in the course of learning feature representation for genes can catch the modularity structure of the network. Therefore, these features contribute to the performance improvement of essential gene prediction because essential genes tend to cluster together.

Conclusions
Extracting the features related to essential genes is the critical step for designing a machine-learning method with powerful prediction performance. Compared with previous approaches that refer to the topological features in the PPI network or learn features from the DNA sequence, this work adopts a network embedding method to represent the nodes in the PPI network as latent feature vectors that maximally preserve their interactive relationships in the network. To measure the power of the latent features on predicting human essential proteins, we learn features from two different human PPI networks and these features are input into several popular classifiers. In the course of learning, a bias random walk is, first, implemented on the PPI network to get the network context of a given node and, then, the node pairs in the network context are input into an artificial neural network to learn their representation in the latent space. The human essential gene prediction results based on the two PPI networks show that our method using the SVM classifier outperforms the methods [16] that select network topological properties or the DNA sequences as input features. It proved that the network embedding method could learn feature vectors for the nodes from the PPI network to efficiently predict human essential genes. The features represented by our method have better performance on human essential gene prediction than that by other network embedding approaches, i.e., Deepwalk [34] and LINE [35]. It suggests that our approach is an efficient way to learn the feature representations for nodes in the human PPI network by adding bias in exploring neighborhoods. Moreover, using the ET classifier can further improve the prediction performance of our method.
In the future, we will focus on designing a more robust network embedding method to find the latent feature representation of the nodes in the PPI network, and developing more effective machine learning methods to predict human essential genes based on the features of genes. In addition, one human gene can produce many different protein isoforms and there exists noise in the human PPI network. Hence, in future work, establishing reliable gene-gene networks would be a potential solution to improve the essential gene prediction.

Supplementary Materials:
The following are available online at http://www.mdpi.com/2073-4425/11/2/153/s1, Table S1: GO enrichment analysis for the essential gene subgroups clustered by the feature representation on the FIs dataset, Table S2: GO enrichment analysis for the essential gene subgroups clustered by the connection relationship features on the FIs dataset, Table S3: GO enrichment analysis for the essential gene subgroups clustered by the feature representation on the InWeb_IM dataset, Table S4: GO enrichment analysis for the essential gene subgroups clustered by the connection relationship features on the InWeb_IM dataset