A Novel Computational Approach for Identifying Essential Proteins From Multiplex Biological Networks

The identification of essential proteins can help in understanding the minimum requirements for cell survival and development. Ever-increasing amounts of high-throughput data provide us with opportunities to detect essential proteins from protein interaction networks (PINs). Existing network-based approaches are limited by the poor quality of the underlying PIN data, which exhibits high rates of false positive and false negative results. To overcome this problem, researchers have focused on the prediction of essential proteins by combining PINs with other biological data, which has led to the emergence of various interactions between proteins. It remains challenging, however, to use aggregated multiplex interactions within a single analysis framework to identify essential proteins. In this study, we created a multiplex biological network (MON) by initially integrating PINs, protein domains, and gene expression profiles. Next, we proposed a new approach to discover essential proteins by extending the random walk with restart algorithm to the tensor, which provides a data model representation of the MON. In contrast to existing approaches, the proposed MON approach considers for the importance of nodes and the different types of interactions between proteins during the iteration. MON was implemented to identify essential proteins within two yeast PINs. Our comprehensive experimental results demonstrated that MON outperformed 11 other state-of-the-art approaches in terms of precision-recall curve, jackknife curve, and other criteria.


INTRODUCTION
Essential proteins are necessary for the survival of living organisms. The identification of essential proteins can help us to understand the basic requirements of living organisms, and it can also play an important role in drug design (Dubach et al., 2017), genetic disease diagnosis (Zeng et al., 2017), and drug synergy prediction in cancers . Traditional experimental approaches, such as gene knockouts (Narasimhan et al., 2016), RNA interference (Inouye, 2016), and Knockout Sudoku (Baym et al., 2016), are time-consuming and costly. Over the last few decades, high-throughput technologies have produced a tremendous amount of protein interaction network (PIN) data that provide us with new opportunities to detect essential proteins through the use of computational approaches. A number of network topology-based centrality approaches have been proposed to predict essential proteins, and these approaches include Degree Centrality (DC) (Hahn and Kern, 2004), Information Centrality (IC) (Stephenson and Zelen, 1989), Closeness Centrality (CC) (Wuchty and Stadler, 2003), Betweenness Centrality (BC) (Joy et al., 2005), Subgraph Centrality (SC) (Estrada and Rodriguez-Velazquez, 2005), and Neighbor Centrality (NC) (Wang et al., 2011).
Unfortunately, these approaches are often plagued by noise and errors, which can result in biases and low confidence in protein-protein interaction (PPI) networks. To provide accurate prediction results, the integration of different types of biological data has become an important and popular strategy. A number of approaches have been developed to facilitate the prediction of essential proteins by combining PINs with multisource biological data. For example, Gene Ontology (GO) annotations were used as a bioinformatics tool to predict essential proteins in several single-cell PINs, such as those from Escherichia coli, Saccharomyces cerevisiae, and Drosophila melanogaster (Hsing et al., 2008). A prediction model called integrating orthology with PPI network (ION) (Peng et al., 2012) was proposed to infer essential proteins by integrating orthologous information and the topological characteristics of PINs. In the United complex Centrality (UC) (Li et al., 2015) method, protein complexes were also combined with the topological features of PINs to detect essential genes. After analyzing the correlations between domain characteristics and essential proteins, Peng et al. (2015) designed an approach named unite domain and network centrality (UDoNC) for the prediction of essential proteins in yeast PINs. Li et al. (2012) and Zhang et al. (2013) developed two types of prediction models called prediction of essential proteins centrality (PeC) and co-expression weighted by clustering coefficient method (CoEWC) to infer essential proteins by fusing gene expressions and topological characteristics of PINs, respectively. In our previous studies, we proposed a prediction method called predictive model based on overlapping essential modules (POEM) (Zhao et al., 2014) to measure the essentiality of proteins by detecting overlapping essential modules based on the modularity of essential proteins. Lei et al. (2018) designed a method called AFSO_EP for the prediction of essential proteins based on the artificial fish-swarm algorithm. In this method, the network topology, gene expression, GO annotation, and subcellular localization information were utilized.  proposed a new method to discover essential proteins, named predicting essential proteins by integrating network topology, expression profile, GO annotation and subcellular localization (TEGS), based on integrating network topology, gene expression profiles, GO annotation information, and protein subcellular localization information. In the fusing the dynamic PPI networks (FDP) approach Zhang F. et al. (2019), active PINs were constructed first and then they were fused into a final network according to the networks' similarities. Finally, a new approach for identification of essential proteins was proposed by considering orthologous property and topological properties in the network.
A common characteristic and limitation of these approaches, however, is that they complete the prediction of essential proteins using only a single network of relationships between proteins. Currently, PINs are not the only large-scale network datasets, as protein-DNA interactions and signaling-regulatory pathway interaction data are also stored in dedicated databases (Valdeolivas et al., 2019). Additionally, other interactions such as the co-expression network established from gene expression profiles and the co-annotation network constructed from GO annotations can be derived. Each interaction data source has its own meaning or relevance and can play a different role in the prediction of essential proteins. These approaches mentioned above classically aggregated multiple interaction networks into a single and unique network, which tends to dismiss the topologies and features of the individual interaction networks. The convention of representing different types of interactions in a system with a single type of link is no longer a panacea for network science (De Domenico et al., 2015). The multiplex network offers us an alternative, in that it is a collection of networks sharing the same nodes; however, the edges belong to different categories or represent interactions of different natures (Didier et al., 2015). More recently, various applied studies have been adapted to multiplex networks. Valdeolivas et al. (2019) extended the Random walk algorithm to multiplex networks by building an nL × nL heterogeneous matrix in which n and L represent the number of nodes and layers of the multiplex network, respectively. Wang et al. (2018) compressed the multiple networks into two feature matrices and performed conserved functional modules detection by multi-view nonnegative matrix factorization. In a newly proposed link prediction algorithm (Samei and Jalili, 2019) for multiplex networks, both intra-layer information and inter-layer information are combined based on layer relevance. In our previous work, we constructed a multilayer protein network and applied it for the detection of protein complexes (Li et al., 2016) and for the prediction of protein functions (Zhao et al., 2016a). In this study, we propose a tensorial framework to represent the newly constructed multiplex biological network, and we aim to apply it for the identification of essential proteins by extending the random walk with restart algorithm. Our experimental results demonstrated that our proposed MON approach outperformed six types of centrality approaches, including DC (Hahn and Kern, 2004), IC (Stephenson and Zelen, 1989), CC (Wuchty and Stadler, 2003), BC (Joy et al., 2005), SC (Estrada and Rodriguez-Velazquez, 2005), and NC (Wang et al., 2011) and five types of network topological features and biological data sources fusionbased approaches such as PeC (Li et al., 2012), CoEWC (Zhang et al., 2013), POEM (Zhao et al., 2014), ION (Peng et al., 2012), and FDP (Zhang F. et al., 2019).

MATERIALS
To estimate the performance of MON, we used it to identify essential proteins in the PIN of Saccharomyces cerevisiae that was derived from the database of interacting proteins (DIP) (Xenarios et al., 2002) and Gavin datasets (Gavin et al., 2006). The PINs from Saccharomyces cerevisiae, which have been wellcharacterized by a number of studies, are the most complete and comprehensive. After removing self-interactions and repeated interactions, the DIP dataset finally obtained 5,093 proteins and 24,743 interactions, and the Gavin dataset consisted of 1,855 proteins and 7,669 interactions. The domain data for building the multiplex biological network was downloaded from the Pfam database (Punta et al., 2011). The gene expression profile (Tu et al., 2005) of the yeast was derived from GSE3431 in the GEO (Gene Expression Omnibus) that contained the expression values of 6,776 genes at 36 moments, where 4,985 and 1,827 of these genes were located in the DIP and Gavin PINs, respectively. The gene coverage rates of the two PINs in gene expression profile were all >95% (DIP: 4,985/5,093 = 97.88%, Gavin: 1,827/1,855 = 98.49%). Information on orthologous proteins was obtained from the InParanoid database (Östlund et al., 2009) (Version 7) that consisted of a collection of pairwise comparisons between 100 whole genomes. A benchmark set of essential proteins from Saccharomyces cerevisiae that consisted of 1,285 essential proteins was derived from the MIPS (MIPS: analysis and annotation of proteins from whole genomes in 2005) (Mewes et al., 2006), saccharomyces genome database (SGD) (Cherry et al., 2011), and database of essential genes (DEG) (Zhang and Lin, 2008) databases. Among the 5,093 proteins in the DIP network, 1,167 proteins were essential and 3,526 proteins were non-essential. In the Gavin dataset, the number of essential proteins and nonessential proteins was 714 and 1,141, respectively. Table 1 lists the details of the two yeast PINs.

METHODS
The outline for the entire MON approach includes (1) establishing a multiplex biological network by integrating the topology of PINs, protein domains, and gene expression profile, (2) extending the random walk with restart algorithm to the tensor model corresponding to the multiplex biological network, and (3) sorting proteins in descending order, with the top K of these proteins being exported. The flowchart for the MON approach is provided in Figure 1.

Construction of Multiplex Biological Networks
For our purpose, we consider a multiplex biological network G = (G 1 , G 2 ,. . . , G L ), where G i = (V, E i ) represents the network of the layer of i. V = {v 1 , v 2 ,..., v n } is a set of sharing proteins for all layers in G, and E i = {e i1 , e i2 ,..., e im } is a set of interactions at i-th layer in the multiplex biological network G.
In this study, we constructed a multiplex biological network G = (G 1 , G 2 , G 3 ) by integrating PINs, gene expression profiles, and protein domain information. In the first layer, a co-neighbor network (CN) was established through the analysis of the topology characteristics of PINs, while in the second layer, a costructure network was constructed according to the correlation analysis based on the protein domain information. In the third layer, a co-expression network was related to the property of co-expression derived from time course gene expression profiles.

Co-neighbor Network G 1
The CN was established by exploring common neighbors between pairs of proteins. Intuitively, the greater number of common neighbors that the two proteins possess, the more credible the interactions between these two proteins will be. If two proteins p i and p j interact with each other in PINs and share at least one common neighbor, they will connect to each other within the CN. The weight of interaction between p i and p j can be calculated by the following formula: where N i and N j represent the direct neighbors set of p i and p j , respectively, and N i ∩ N j denotes the common neighbors set for protein p i and protein p j .
Co-structure Network G 2 Domains are sequential and structural motifs that are found independently in different proteins and act as the stable functional blocks of proteins. Based on this, we created the costructure network based on data from protein domains. First, we analyzed the importance of proteins relative to the domains based on the association between proteins and domains. Given a protein p i , its domain score P_D can be calculated as follows: In Equation (2), D is a list of distinct categories of domains related to all proteins. NP j is the number of proteins that contain the domain d j . If the protein p i contains the domain d j , t ij is assigned the value of 1. Otherwise, t ij is set to 0. Finally, the P_D score of p i can be normalized and calculated as follows: From the above equation, we can easily determine that the value of P_D falls into the interval [0, 1]. From this perspective, the P_D score of a protein can be interpreted as its probability of becoming an essential protein. Moreover, previous studies (Stephenson and Zelen, 1989) have indicated that essential genes or proteins tend to form essential modules through their interactions. We assumed that the essential probabilities of proteins mentioned above were independent of each other. The probability (or weight) of interaction between two proteins p i and p j in the co-structure network can be calculated as follows.
is constructed with integration of protein interaction networks (PINs), gene expression profile, and protein domain information, firstly. And then, a restart vector is established according to orthologous proteins and module scores of proteins. Based on these, the random walk with restart algorithm is applied to score and rank essential proteins.

Co-expression Network G 3
The Pearson's correlation coefficient (PCC) was adopted to evaluate the co-expression probability of a pair of proteins based on gene expression profiles. Let g(p i , j) denote the expression value of the gene p i at the j-th time point, and then for a pair of genes p i and p j , the correlation between them can be calculated as follows: Two proteins were regarded as co-expressed if they interacted with each other in the original PINs and their correlation coefficient was not zero. The weight of interaction between p i and p j in the co-expression network was set to the absolute value of their correlation coefficient.

Random Walk With Restart on Multiplex Biological Networks
To study the multiplex network systematically, it is necessary to develop a precise mathematical model and appropriate tools. In this paper, we represent the newly constructed multiplex biological network G using the tensor model and extend the random walk with restart algorithm.
Let T = (t ijk ) ∈ R n×n×m denote the three-order adjacency tensor corresponding to the multiplex biological network G = (G 1 , G 2 , G 3 ), where n and m are the number of proteins and categories of interactions between proteins, respectively. Each element of T is defined as follows: Here 1 ≤ i, j ≤ n, 1 ≤ k ≤ m (m = 3) and e k (i, j) represents the weight of interaction between p i and p j at the k-th layer. We can thus extend the random walk with restart algorithm from a two-dimensional matrix to the tensor for scoring proteins. Studies show that the structural characteristics of different layers in multiplex networks are indeed correlated to each other (Jalili et al., 2017). Based on this, we propose that considering the importance of different types of interactions can enhance the performance for the discovery of essential proteins. Our statistics revealed mutually reinforcing relationships between important or key nodes with different types of links pointed to them in multiplex biological networks. Let the vectors x = [x 1 , x 2 , . . . , x n ] T ∈ R n and y = [y 1 , y 2 , . . . , y n ] T ∈ R n denote important scores of proteins and different categories of interactions between proteins, respectively. We formally described the relationships between x and y based on the tensor T using the following equation: The most critical task for us was to design reasonable functions f and g and to calculate y and z, respectively. We now propose the idea to define a higher-order Markov chain by normalizing the tensor. This leads to two probability transition tensors T (1) = (t (1) ijk ) ∈ R n×n×l andT (2) = (t (2) ijk ) ∈ R n×n×l that are calculated as follows: We can then easily obtain the following formulas: Equations (8) and (9) can be interpreted as the transition probabilities of two third-order Markov chains (X t ) t∈N and (Y t ) t∈N , respectively.
If the last state was the i-th node, then the next state is the j-th node through the k-th type of interaction with probability t (1) i,j,k . Similarly, t (2) i,j,k can be considered as the probability of selecting the k-th type of interaction from the j-th node to the i-th node. For the calculation of the random variables X and Y, the above two equations are deduced according to the total probability formula as follows: represents the joint probability distribution of X t−1 and Y t , and P[X t = i, X t−1 = j] denotes the joint probability distribution of X t−1 and X t . Considering the steady state of the Markov chain, we can obtain the following formulas: It is very difficult to calculate X and Y due to their coupling to each other and the observation that they contain two joint probability distributions in Equations (14) and (15). In this study, we assumed that the random variables X and Y were completely independent of each other. Thereafter, we could obtain these following formulas: Based on the above assumption and the fact that t continues to infinity, Equations (16) and (17) could be deduced as: Based on this, we designed the proper solutions for the functions f and g. Therefore, the random walk with restart algorithm in the multiplex biological network case could be described as follows: The restart vector, RV, represents the initial probability distribution. α is the restart probability. The overall framework of random walk with restart on multiplex biological networks can be illustrated by Algorithm 1.
Step 7. Output X

Identification of Essential Proteins
Thus far, the framework for assessing the importance of proteins in multiplex biological networks has been established. Now, we describe the MON approach that was designed for the identification of essential proteins from multiplex biological networks. Algorithm 2 details the MON approach. Based on a user-specified output number of top-ranking proteins, K, our approach first constructed the multiplex biological network G by integrating PINs, gene expression, and protein domains. Then, considering the conservative and modular features of proteins, a vector DR = [dr 1 , dr 2 , . . . dr n ] T was initialized using the follow equation: Algorithm 2 | MON Input: A PIN network, protein domain, gene expression, ortholog data sets, module scores of proteins, and parameter K Output: Top K proteins sorted by pr in descending order Step 1. Construct a multiplex biological network G according to Equations (1)-(5) Step 2. Calculate initial vector DR Step 3. pr = Algorithm1(G, dr, ǫ) Step 4. Sort proteins by the value of pr in descending order Step 5. Output top K of sorted proteins In the above equation, C_S(p i ) and M_S(p i ) represent conservative score and modular score of the protein p i , respectively. Conservative score of the protein p i is derived from information from orthologous proteins and is defined as follows (Zhao et al., 2016b): where N(p i ) denotes the number of homologous proteins that p i contains in reference organisms. The modular scores of proteins are output scores of the POEM approach with normalization processing (Zhao et al., 2014). Next, we applied the random walk with restart algorithm to the multiplex biological network G and generated a score vector pr. Finally, proteins were sorted in descending order according to pr, with the top K of them being exported.

RESULTS AND DISCUSSION
To evaluate the essential nature of proteins in PINs, they were ranked in descending order based on their ranking scores that were computed by our MON model and by the 11 other competing essential protein prediction approaches, which included DC (Hahn and Kern, 2004), IC (Stephenson and Zelen, 1989), CC (Wuchty and Stadler, 2003), BC (Joy et al., 2005), SC (Estrada and Rodriguez-Velazquez, 2005), NC (Wang et al., 2011), PeC (Li et al., 2012), CoEWC (Zhang et al., 2013), POEM (Zhao et al., 2014), ION (Peng et al., 2012), and FDP (Zhang F. et al., 2019). After this, the top 100, 200, 300, 400, 500, and 600 ranked proteins were selected as candidates for verification as essential proteins. According to the set of known essential proteins, the number of true essential proteins was determined to assess the performance of each approach. Here, we represent the results for the DIP dataset, in detail, and those for the Gavin dataset, in brief.

Effects of Parameters α and β
In this study, we introduced two self-defined parameters as α and β. The parameter α (0 < α < 1) was used to control the weight of two scores at step 4 of Algorithm 1. The parameter β (0 < β < 1) was adopted to adjust the contribution of conservative scores and modular scores of proteins in Equation (24). To study the effects of parameters α and β on the performance of our MON approach, we evaluated the identification accuracy by setting different values for α and β.  1, respectively. We selected top 100, top 200, top 300, top 400, top 500, and top 600 candidate proteins as detected by MON, respectively. The identification accuracy was evaluated by the percentage of true essential proteins in the top candidates. Figure 2 indicates that MON achieves the highest prediction accuracy when α is 0.3 and β is 0.5. Figure 3 shows that the

Comparison With 11 Other Approaches
To validate the performance of our MON approach, we made comprehensive comparisons of MON to the 11 other competing essential protein identification approaches. Proteins were ranked in descending order according to their scores obtained from each approach. Several of the top predicted proteins were viewed as essential proteins. Then, by comparing to the benchmark set, we determined how many of these candidate proteins were true essential proteins. Figure 4 reveals the percentage of essential proteins detected by MON and the 11 other prediction approaches within the yeast PIN. As shown in Figure 4, it is clear that MON allows for a higher predictive performance than that of the other competitive centrality methods. For the top 100 candidate proteins and the top 200 candidate proteins, the prediction accuracy of the MON approach was >86%. MON exhibited improvements of 70. 91, 38.10, 31.87, 25.65, 21.51, and 26.45% compared to the values achieved by NC, which possessed the highest prediction accuracy among the six network topology-based centrality methods (DC, IC, BC, CC, SC, and NC) when selecting from the top 100 to top 600 proteins. In particular, when selecting the top 200 proteins, the accuracy of MON in predicting essential proteins was still close to 90%, and this was higher than that of DC, IC, BC, CC, SC, NC, CoEWC, PeC, POEM, and ION for predicting the top 100 proteins. Compared to FDP, which obtained the best prediction accuracy of all 11 competitive approaches, the performance of MON was improved by 5.62, 6.10, 7.62, 3.21, 2.73, and 6.52% from the top 100 to top 600 proteins, respectively.

Validated by Precision-Recall Curves
Additionally, the precision-recall (PR) curve was adopted to evaluate the overall performance of MON and the other 11 approaches. First, the proteins in PINs were ranked in a descending order based on the scores obtained from each approach. Next, the top K proteins were selected and placed into the positive set (candidate essential proteins), while the rest of the proteins were stored in the negative set (candidate non-essential proteins). The cutoff parameter of K ranged from 1 to 5,093. Based on different selected values of K, the values of precision and recall were calculated by each approach. Finally, the PR curves were plotted according to values of precision and recall when K changed from 1 to 5,093. Figure 5A shows the PR curves of MON and six topology-based centrality methods (DC, IC, BC, CC, SC, and NC). Figure 5B illustrates the PR curves for MON and the other five approaches (PeC, CoEWC, POEM, ION, and FDP). Figure 5 indicates that the PR of MON is clearly higher than that of all competing approaches.

Validated by Jackknife Methodology
A further comparison between the novel approach MON and the 11 other competing approaches (DC, BC, CC, SC, IC, NC, UDoNC, PeC, CoEWC, POEM, ION, and FDP) was performed by adopting the jackknife methodology (Holman et al., 2009). The areas under the jackknife curve for each approach were used to evaluate their accuracy in identifying essential proteins. Additionally, 10 random assortments were also depicted for this comparison. Figure 6 illustrates the comparison results where the horizontal axis represents the proteins ranked in descending order according to their scores calculated by each approach and the vertical axis is the percentage of essential proteins related to ranked proteins. Figure 6A shows the comparison results between MON and three topology-based centrality methods (DC, IC, and SC). Figure 6B represents the comparison results between MON and three centrality methods (BC, CC, and NC). Figure 6C indicates the comparison results between MON and the remaining five approaches (PeC, CoEWC, POEM, ION, and FDP). As shown in Figure 6, it is clear that the jackknife curve for MON is evidently better than that of the 11 previously proposed approaches. Moreover, MON and the 11 other competing approaches had all achieved improved identification performance compared to that of randomized sorting.

Analysis of the Differences Between MON and Other Approaches
To analyze why and how MON obtains high performance for the identification of essential proteins, we investigated the relationship and differences between MON and the 11 other competitive approaches by detecting a small fraction of proteins. For each approach, the top 100 proteins were selected and compared. The number of top 100 identified proteins ranked by each approach is listed in Table 2. First, we compared MON to DC, BC, CC, SC, IC, NC, PeC, CoEWC, POEM, ION, and FDP by statistically analyzing the number of proteins that were commonly detected by MON and any of the 11 other competitive approaches. The number of common and different proteins between MON and any of the other competing approaches is shown in Table 2. In Table 2, |MON Mi| represents the number of overlapping proteins identified by MON and by a centrality measure Mi. {Mi -MON} denotes the set of proteins predicted by Mi and not by MON, and |Mi-MON| is the number of proteins predicted by Mi and not by MON.
As illustrated in Table 2, among the top 100 proteins, the proportions of overlapping proteins identified by both MON and DC, BC, CC, SC, and IC are all <10%, while the proportions of overlapping proteins detected by both MON and NC and FDP are not more than 50%. The proportion of common proteins predicted by both MON and PeC, CoEWC, POEM, and ION is <65%. Such a small overlap between proteins identified by MON and the 11 other approaches indicates that MON provides a special approach that is different from that of the other  approaches. The third column in Table 2 denotes the number of non-essential proteins among different proteins predicted by Mi but not by MON. We further analyzed these non-essential proteins that were identified by the 11 other approaches, and we found that more than 87% of these non-essential genes that were predicted by six network topology-based centrality measures (DC, IC, BC, CC, SC, and NC) possessed very low MON ranking scores (<0.45). Similarly, more than 50% of the non-essential proteins predicted by PeC, CoEWC, POEM, and ION possessed very low MON ranking scores (<0.45). Second, we analyzed the essentiality of different proteins detected by MON and by other competing approaches. Figure 7 shows the percentage of essential proteins in all of the various predicted proteins that were detected by MON and the 11 other competitive approaches. In Figure 7, the red dash line represents the percentage of essential proteins detected by MON while ignored by Mi, and blue solid line denotes the percentage of essential proteins predicted by Mi and not by MON. The experimental results shown in Figure 7 illustrate that among these different proteins, the proportion of essential proteins identified by the MON approach is significantly higher than that predicted by the other approaches. In this study, we chose two representative approaches (BC and POEM) as examples to analyze. The former exhibited the largest number of protein differences compared to our MON approach, and the POEM approach possessed the smallest difference compared to the MON approach. Compared to BC, for all of the top 100 predicted proteins, there were 96 different proteins identified by our MON approach. Among these 96 different proteins identified by MON, 93.75% were essential, while only 41.67% proteins predicted by BC were essential. As another example, there were 22 different proteins detected by either MON or by POEM. Among these different proteins, MON could predict more than 95% of the essential proteins, while POEM only discovered <64% of the essential proteins. The comparable results between MON and the other competitive approaches (DC, CC, SC, IC, NC, PeC, CoEWC, and ION) indicate that the proposed MON approach can identify more essential proteins than the other approaches.
Additionally, we selected top 10 identified candidate proteins by our approach as examples to analyze their functional annotations. To this purpose, GO Term (Ashburner et al., 2000) was adopted to characterize these candidate essential proteins, including molecular function (MF), biological process (BP), and cellular component (CC). Table 3 shows the results of functional annotation for these 10 proteins. Out of all the 10 candidate proteins, eight proteins were true essential proteins. And all proteins were annotated in terms of BP, MC, and CC.

Prediction Performance of MON Based on the Gavin Dataset
To further test the performance of the proposed approach, we also performed discovery for essential proteins using the Gavin dataset. The ranking scores for proteins were computed using MON (α = 0.3, β = 0.2) and 11 other existing competitive approaches (DC, BC, CC, SC, IC, NC, PeC, CoEWC, POEM, ION, and FDP). The percentage of essential proteins in the top 100, 200, 300, 400, 500, and 600 proteins ranked by these approaches are listed in Table 4. The jackknife curves of each approach are illustrated in Figure 8. All of these experimental results indicate that MON still outperforms the 11 other competitive approaches, using the Gavin dataset. Specifically, when selecting the top 100 ranked proteins, MON resulted in 95. 65, 104.55, 143.24, 104.55, 119.51, 63.64, 23.29, 21.62, 11.11, 16.88, and 1.12% improvements compared to the results obtained from DC, IC, CC, BC, SC, NC, PeC, CoEWC, POEM, ION, and FDP, respectively.

CONCLUSION
The detection of essential proteins is helpful for understanding the minimum requirements for cell survival and development. Many computational approaches have been proposed that integrate PINs and multi-omics data, and this has led to the identification of multiple interactions or links between proteins. Despite the advances in these approaches, designing efficient algorithms to fuse these multisource biological data remains challenging. A simple strategy is to aggregate a collection of heterogeneous data into a single network; however, this strategy can result in substantial information loss. Studies indicate that different types of biological data sources that possess inherent structural characteristics are correlated to each other. Moreover, high-throughput multi-omics biological data exhibit different degrees of quality and can play various roles in the prediction of essential proteins. The multiplex biological network provides an alternative means to address these problems. In this study, we constructed a multiplex biological network by combining PINs with multi-source biological information, and proposed a new essential proteins prediction approach named MON. In MON, we express the multiplex biological network in the tensor model and extend the random walk with restart algorithm by simulating a higher-order Markov chain. Additionally, the conservative and modular features of essential proteins are both taken into account to improve the performance of MON. The experimental results from two yeast PINs demonstrate that MON performs better than 11 other state-of-the-art approaches for predicting essential proteins.

DATA AVAILABILITY STATEMENT
Publicly available datasets were analyzed in this study. These data can be found here: https://github.com/husaiccsu/MON.

AUTHOR CONTRIBUTIONS
BZ, SH, and LW obtained the protein interaction data, domain data, gene expression profile, and information on orthologous proteins and drafted the manuscript together. BZ and SH designed the new approach, MON, and analyzed the results. XLiu, HX, XH, XLi, and ZZ participated in revising the draft. All authors have read and approved the manuscript.