Key protein identiﬁcation by integrating protein complex information and multi-biological features

: Identifying key proteins based on protein-protein interaction networks has emerged as a prominent area of research in bioinformatics. However, current methods exhibit certain limitations, such as the omission of subcellular localization information and the disregard for the impact of topological structure noise on the reliability of key protein identiﬁcation. Moreover, the inﬂuence of proteins outside a complex but interacting with proteins inside the complex on complex participation tends to be overlooked. Addressing these shortcomings, this paper presents a novel method for key protein identiﬁcation that integrates protein complex information with multiple biological features. This approach o ﬀ ers a comprehensive evaluation of protein importance by considering subcellular localization centrality, topological centrality weighted by gene ontology (GO) similarity and complex participation centrality. Experimental results, including traditional statistical metrics, jackknife methodology metric and key protein overlap or di ﬀ erence, demonstrate that the proposed method not only achieves higher accuracy in identifying key proteins compared to nine classical methods but also exhibits robustness across diverse protein-protein interaction networks.


Introduction
Proteins within an organism can be classified into two categories: non key proteins and key proteins.Key proteins play crucial roles throughout the cell cycle, and their absence can result in infertility, biological dysfunction and even fatality.Furthermore, key proteins have been implicated in the pathogenesis of various diseases [1,2].Consequently, the identification of key proteins has become a highly relevant research area within the field of bioinformatics [3][4][5].Traditional experimental approaches for identifying key proteins tend to be expensive, cumbersome, inefficient and limited in scope.Conversely, key protein identification methods based on protein-protein interaction (PPI) networks [6] offer a cost-effective, efficient and reliable alternative [7].
The protein-protein interaction (PPI) network exhibits a scale-free nature, characterized by uneven internal connectivity.A small subset of nodes within the network possesses a large number of connections, often corresponding to key proteins.Consequently, topological centrality methods such as degree centrality, information centrality, betweenness centrality, feature vector centrality, subgraph centrality, local average connection centrality, and neighborhood centrality have been utilized for key protein identification [1,2,6].However, the accuracy of these centrality-based methods is contingent on the quality of the PPI network, which is prone to incompleteness and includes numerous false positive and false negative data due to experimental limitations [8].To address this challenge, several approaches have been proposed, combining biological characteristics with network analysis.For instance, Li et al. introduced the PeC method [9], which integrates the topological structure of the PPI network with gene expression profiles to identify key proteins.Peng et al. developed the UDoNC method [10], which leverages protein-domain characteristics to identify key proteins.Shang et al. proposed the DLAC method [11], which incorporates RNA sequence data to enhance the accuracy of key protein prediction.Additionally, some researchers have combined biological characteristics with random walk methods to uncover key proteins [12].Moreover, the JDC method [13] offers a dynamic threshold approach to binarize gene expression data based on PPI network information and gene expression profiles.
Furthermore, it has been observed that key proteins tend to have a higher propensity for participation in protein complexes compared to non-key proteins.To capitalize on this characteristic, Luo et al. introduced the LIDC method [14], which predicts key proteins by considering the local interaction density and internal degree of the protein complex.Building upon this work, Qin et al. enhanced the LIDC method and proposed the LBCC method [15], which incorporates betweenness centrality to identify key proteins.The UC method [16], on the other hand, utilizes protein frequency information within the complex to identify key proteins.Shifting gears slightly, the Modality-DTA method [17] presents a novel deep learning approach for drug-target interaction prediction, leveraging the multimodal nature of both drugs and targets to enhance prediction accuracy.
The accuracy of key protein identification based on a single biological characteristic is often compromised due to variations in space-time dimensions and the influence of different physical and chemical environments [18].Consequently, an increasing number of researchers are exploring the integration of multiple biological characteristics to improve the accuracy of key protein mining.For instance, the TEO method [19] incorporates GO annotation information, gene expression data, and network topology to identify key proteins.Similarly, the JTBC method [20] utilizes both gene expression information and domain information in the process of mining key proteins.By leveraging these diverse biological characteristics, these methods aim to enhance the accuracy and reliability of key protein identification.
While existing methods have made progress in key protein identification, they still face certain limitations.First, these methods often overlook the crucial aspect of subcellular localization information.In reality, the importance and criticality of proteins can vary depending on their specific subcellular locations.Second, complex information plays a significant role in key protein identification.However, existing methods fail to account for the impact of proteins outside the complex that interact with proteins within the complex, thereby neglecting their potential influence on complex participation.
To address the aforementioned issues, this paper presents a novel method called CIBF (protein complex information and multi-biological features) for identifying key proteins.CIBF is designed to overcome the limitations of existing approaches by integrating complex information and multiple biological characteristics.The key contributions of this method can be summarized as follows: 1) The subcellular localization information plays a crucial role in determining the key index for various cellular locations.Then, this index is integrated with the neighborhood information of protein nodes to determine the subcellular localization centrality of each protein.By combining subcellular localization centrality with the protein's surrounding network, this method accurately assesses the protein's significance within its specific subcellular location.
2) The method introduces an edge clustering coefficient that considers the difference in public neighbor participation to quantitatively depict the interaction edge weight between proteins.Additionally, the GO similarity between protein nodes is computed using GO information.Biological characteristics are incorporated into the edge weight, enhancing its relevance.Furthermore, the method proposes a topological centrality measure with GO similarity weighting.
3) The proposed method takes into full consideration the interaction between proteins outside the complex and those inside the complex.It accurately determines the centrality of protein complex participation by integrating two key factors: the in-degree of the complex for protein nodes and the frequency of complex participation.
The identification results of key proteins in different PPI networks show that the CIBF method can effectively identify key proteins and has fine stability.

Materials and method
In this paper, we present a novel method for identifying key proteins by integrating complex information and multiple biological characteristics.Our approach involves a comprehensive evaluation of protein nodes, considering their centrality in subcellular location, topological centrality weighted by GO similarity, and centrality of complex participation.By examining key proteins from these diverse dimensions, we aim to enhance the accuracy of key protein mining within the protein-protein interaction (PPI) network.This integrated methodology provides a more precise and comprehensive approach to identify key proteins.

Subcellular localization centrality
Subcellular localization information identifies the location of proteins in cells, which is an important biological characteristic of proteins in space.By analyzing the subcellular localization distribution of proteins, key proteins appear more frequently in some locations than in others.Based on this phenomenon, subcellular localization information can be used to judge the spatial location centrality of proteins.There are 11 subcellular localization regions, as shown in Table 1.The key coefficient of subcellular localization region k is expressed by csl(k), and its calculation method is as follows: Among them, nep(k) represents the number of key proteins in the subcellular localization region k, and sep represents the total number of key proteins.
Table 1 shows the distribution of key proteins in 11 subcellular localization regions and the key coefficient of subcellular localization regions calculated by formula 2.1.It can be found that key proteins appear more frequently in nucleus, mitochondrion, endoplasma, cytosol and so on.In view of this, this paper calculates the spatial location centrality of protein v according to the key coefficient of subcellular location region: Among them, csl(k) k∈ [1,11] represents the key coefficient of protein v in 11 subcellular regions, and sl represents the subcellular localization region, d(v) k,k∈ [1,11] represents the number of neighbor proteins of protein v in subcellular localization region k, and n k represents the number of subcellular localization regions of protein v.

Topological centrality of GO similarity weighted
In this paper, we propose the utilization of GO similarity as a means to introduce weighted topological centrality as an additional indicator for identifying key proteins.The gene ontology (GO) framework is employed to describe the biological characteristics of genes and their corresponding products.The relationship structure within GO is organized in a tree-like structure, where nodes closer to the root encompass broader descriptions, while nodes farther away convey more specific details.Our method calculates the GO functional similarity between proteins within the PPI network.The presence of common GO annotations indicates a closer relationship and enhances the reliability of the edges in the PPI network.Additionally, recognizing that the GO functional similarity can be influenced by the number of GO annotations associated with a protein, we introduce an adjustment factor to account for this effect.The specific calculation method of GO functional similarity is: GO u and GO v represent GO annotation of protein u and v, σ u and σ v represents the corresponding adjustment factor, which punishes the protein with less GO annotations and rewards the protein with more GO annotations.The calculation method is as follows: GO represents the average number of GO annotations in the PPI network.
In PPI network, the edge clustering coefficient (ECC) evaluates the connection strength between two proteins from the topological structure.The calculation method is as follows: (2.5) Among them, z(u, v) represents the number of common neighbor nodes of protein u, v, and d u and d v represent the degree value of protein u, v.
The traditional edge clustering coefficient does not consider the difference of the participation degree of the public neighbors in the edge e(u, v).In this paper, the public neighbor participation p i is introduced to calculate the participation of different public neighbors of edge e(u, v).The calculation method is as follows: i represents the common neighbor of protein u, v, and d i represents the degree value of protein i itself.
From this, we can get the public neighbor difference edge clustering coefficient DnECC(u, v), which is calculated as follows: (2.7) p i represents the sum of the participation of all common neighbor proteins of edge e(u, v).The greater the value of DnECC(u, v), the higher the connection strength between protein nodes.
The topological centrality weighted by GO similarity is obtained by fusing GO similarity and common neighbor difference edge clustering coefficient: N represents the neighbor protein set of protein v.

Complex participation centrality
Complex information helps to identify key proteins.However, key proteins may appear inside or outside the complex.The existing methods do not consider the influence of proteins outside the complex that interact with proteins in the complex on the participation of the complex.As shown in Figure 1, although protein 8 is not inside complex A and B, it interacts with proteins 1 and 4 inside complex A, as well as proteins 5 and 6 inside complex B. The calculation of protein complex participation should be considered.In view of this, this paper distinguishes the two types of proteins inside and outside the complex to more accurately evaluate the protein complex participation, and the calculation method is as follows: d in−pc represents the in-degree of protein v in the complex, n in represents the number of times protein v appears in the complex, n out represents the number of complexes connected by protein v, d out−in in represents the number of connections between protein v and the protein in the complex, f in represents the frequency of protein v appearing in the complex, and its calculation method is as follows: n M represents the maximum number of proteins appears in the complex.

Method description
In this paper, we combine subcellular localization centrality, GO similarity weighted topological centrality and complex participation centrality with a linear weighted model to obtain a comprehensive protein criticality evaluation method: α, β and (1-α-β) are used to adjust the contribution of each part to the protein criticality.The experiment part will discuss value of α, β and (1-α-β).The CIBF method is described as follows:

Evaluation metrics
This paper uses three evaluation metrics: (1) Traditional statistical metrics.This paper uses the traditional evaluation metrics as shown in Table 3 for evaluation.
Among them, SN represents the proportion of correctly predicted key proteins in the total number of key proteins, and SP represents the proportion of correctly predicted non-key proteins in the to- tal number of non-key proteins.PPV represents the correct proportion of all key proteins predicted.F-measure is calculated from SN and PPV, which is a comprehensive measure of SN and PPV.It can more evenly evaluate the overall performance of different methods under SN and PPV metric.ACC is used to evaluate the overall accuracy of each method in identifying key proteins and non-key proteins.
It is used to evaluate the identification ability and stability of different methods for key proteins.
(3) Overlap/difference analysis of key proteins.This evaluation metric mainly determines the performance of each method by analyzing the overlap and difference of proteins identified by different methods.

Parameter analysis
In CIBF method, α, β and (1-α-β) are used to adjust the contribution of spatial location centrality, biological topology centrality and complex participation centrality to protein criticality.This section analyzes the influence of different parameter settings on key protein identification performance through experiments.When α = 1, only the spatial location centrality of protein is considered.When β = 1, only the biological topological centrality of protein is considered.When (1-α-β) = 1, only the centrality of protein complex participation is considered.The results on DIP, Krogan and MIPS data sets show that when α = 0.2, β = 0.4, the number of key proteins correctly identified by CIBF method is the largest.Therefore, in the experiments of this paper, α = 0.2, β = 0.4, (1-α-β) = 0.4.

Ablation analysis
The CIBF method involves three aspects in identifying key proteins: spatial location centrality, biological topological centrality and complex participation centrality.Through ablation experiments, this section demonstrates the identification ability of key proteins when only one or two factors are considered, providing the need for each component.
Table 4-6 show the F-measure value when only a single factor is considered in DIP, Krogan and MIPS network.and ACC.

Jackknife methodology evaluation results
The jackknife methodology metric analyzes the changes in the number of key proteins correctly identified by each method as the number of true key proteins increases.
Figure 2 shows the jackknife methodology evaluation results of different methods in DIP, Krogan and MIPS networks under the TOP-600 gradient.The x axis in the figure represents the cumulative number of key proteins, and the y axis represents the number of correctly identified key proteins.From the experimental results, with the increase of the number of key proteins, the CIBF method shows higher accuracy and stability than other methods.

Overlap/Difference analysis results of key proteins
This section mainly analyzes the overlap/difference of key proteins between different methods under the TOP600 gradient.The experimental results are shown in

Effectiveness and stability
To verify the performance of CIBF, PPI networks with various topological properties were selected for key protein identification and compared with other methods.The experimental results show that CIBF can identify more key proteins, and the performance of multiple evaluation indicators in different networks (such as F-measure, ACC, etc.) also proves that the CIBF method has good stability and effectiveness.

Limitations and deficiencies
Although CIBF method has made some progress in identifying key proteins, it still has the following defects and deficiencies: (1) Non-central key protein recognition.CIBF and existing methods in the identification of key proteins are mainly based on the node centrality, while some key proteins have low centrality in the network, so the accuracy of the recognition of such key proteins still needs to be improved; (2) High-quality PPI network construction.The processing environment of proteins in the organism is constantly changing, which will affect the accuracy of key protein identification, so we can build higher quality PPI network, such as the construction of dynamic PPI network fusion with fusion temporal characteristics; (3) More effective biological feature fusion methods.The existing methods to use protein biological features is relatively simple, need to improve the effectiveness of biological feature fusion.Graph representation learning performs well in the processing of graph data, which can be considered to improve the accuracy of key protein identification.

Conclusion
Identification of key proteins in the PPI network is not only helpful to analyze biological tissue structure and important to predict pathogenic genes and discover drugs.In this paper, we propose an key protein identification method that combines complex information and multiple biological characteristics.This method comprehensively evaluates the importance of proteins from the perspectives of subcellular localization centrality, GO function centrality, and complex participation centrality of proteins.The experimental results in different PPI networks show that the CIBF method can effectively identify key proteins and has good stability.

Use of AI tools declaration
The authors declare they have not used Artificial Intelligence (AI) tools in the creation of this article.

Figure 1 .
Figure 1.The relationship between protein and complex.

Algorithm 1 : 5 Calculate 6 Calculate
CIBF Method Input: PPI network G=(V, E); Subcellular localization data; GO data; Protein complex data; Parameter α, β Output: The rank list of protein nodes 1 for i=1 to n do 2 Calculate CSL(i) by formula (2); //subcellular localization centrality 3 end 4 for each e∈E do AGOsim of e by formula (3); DnECC of e by formula (7);

7 end 8 for i=1 to n do 9 Calculate
CBT(i) by formula(8); //topological centricity of GO similarity weighted 10 end 11 for i=1 to n do 12

Table 1 .
Subcellular localization and Coefficient of subcellular localization.

Table 4 .
Single factor F-measure in DIP network.

Table 12 .
Evaluation results of MIPS network.
Figure 2. Jackknife results of DIP, Krogan and MIPS networks.

Table 13 -
15. Ms represents other methods except CIBF method, |CIBF∩Ms| represents key proteins recognized by CIBF method and other methods at the same time.|CIBF-Ms| represents the proportion of true key proteins recognized by CIBF method but not by other methods.|Ms-CIBF| represents the proportion of true key proteins