Detection of spreader nodes in human-SARS-CoV protein-protein interaction network

The entire world is witnessing the coronavirus pandemic (COVID-19), caused by a novel coronavirus (n-CoV) generally distinguished as Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2). SARS-CoV-2 promotes fatal chronic respiratory disease followed by multiple organ failure, ultimately putting an end to human life. International Committee on Taxonomy of Viruses (ICTV) has reached a consensus that SARS-CoV-2 is highly genetically similar (up to 89%) to the Severe Acute Respiratory Syndrome Coronavirus (SARS-CoV), which had an outbreak in 2003. With this hypothesis, current work focuses on identifying the spreader nodes in the SARS-CoV-human protein–protein interaction network (PPIN) to find possible lineage with the disease propagation pattern of the current pandemic. Various PPIN characteristics like edge ratio, neighborhood density, and node weight have been explored for defining a new feature spreadability index by which spreader proteins and protein–protein interaction (in the form of network edges) are identified. Top spreader nodes with a high spreadability index have been validated by Susceptible-Infected-Susceptible (SIS) disease model, first using a synthetic PPIN followed by a SARS-CoV-human PPIN. The ranked edges highlight the path of entire disease propagation from SARS-CoV to human PPIN (up to level-2 neighborhood). The developed network attribute, spreadability index, and the generated SIS model, compared with the other network centrality-based methodologies, perform better than the existing state-of-art.


INTRODUCTION
The COVID-19 pandemic registered its first case on 31 December 2019 (World Health Organization, 2020b). First, it laid its foundation in the Chinese city of Wuhan (Hubei province) . Soon, it made several countries worldwide (Centers for Disease Control and Prevention (CDC), 2021) its victim by community spreading which ultimately compelled the World Health Organization (World Health Organization (WHO), 2019) to declare a global health emergency on 30 January 2020 (World Health Organization (WHO), 2005b) for the massive outbreak of COVID-19. Owing to its expected fatality rate, which is about 4%, as projected by WHO (World Health Organization (WHO), 2005a), researchers from nations all over the world have joined their hands to work together to understand the spreading mechanisms of this virus SARS-CoV-2 (Heymann, 2020;Huang et al., 2020;Zhou et al., 2020) and to find out all possible ways to save human lives from the dark shadow of  Coronavirus belongs to the family Coronaviridae. This single-stranded RNA virus affects not only humans but also mammals and birds too. Due to coronavirus, common fever/flu symptoms are noted in humans, followed by acute respiratory infections. Nevertheless, coronaviruses like Middle East Respiratory Syndrome (MERS) and Severe Acute Respiratory Syndrome (SARS) can create a global pandemic due to their infectious nature. Both of these coronaviruses are the member of genus Betacoronavirus under Coronaviridae. SARS started a significant outbreak in 2003, originating from Southern China. Seven hundred seventy-four deaths were reported among 8098 globally registered cases resulting in an estimated fatality rate of 14%-15% (World Health Organization (WHO), 2003). While MERS commenced in Saudi Arabia, creating an endemic in 2012. The world witnessed 858 deaths among 2494 registered positive cases. It generated a high fatality rate of 34.4% in comparison to SARS.
SARS-CoV-2 is under the same Betacoronavirus genus as that of MERS and SARS coronavirus (Lu et al., 2020). It comprises several structural and non-structural proteins. The structural proteins include the envelope (E) protein, membrane (M) protein, nucleocapsid (N) protein, and the spike (S) protein. Though SARS-CoV-2 has been identified recently, there is an intense scarcity of data and necessary information needed to gain immunity against SARS-CoV-2. Studies have revealed that SARS-CoV-2 is highly genetically similar to SARS-CoV based on several experimental genomic analyses (Hoffmann et al., 2020;Letko, Marzi & Munster, 2020;Lu et al., 2020;Zhou et al., 2020). This is also the reason behind the naming of SARS-CoV-2 by the International Committee on Taxonomy of Viruses (ICTV) (World Health Organization (WHO), 2020a). Due to this genetic similarity, the immunological study of SARS-CoV may lead to the discovery of SARS-CoV-2 potential drug development.
Due to the high morbidity and mortality of SARS-CoV2, it has been felt that there is a pressing need to properly understand the way of viral infection transmission from SARS-CoV-2 PPIN to human PPIN. This paper considers SARS-CoV PPIN for this research study due to its high genetic similarity with SARS-CoV-2. Another primary motivation is to study the spreadability pattern of the ancestral strain of nCoV. In the proposed methodology, at first, SARS-CoV-Human PPIN (up to level-2) is formed from the collected datasets (Agrawal, Zitnik & Leskovec, 2017;Pfefferle et al., 2011). Once created, the spreader nodes are first identified in the SARS-CoV PPIN. Then its level-1 and level-2 interactors in the human PPIN are extracted using a new network attribute, i.e., spreadability index, which is a combination of three different network features: (1) edge ratio (Samadi & Bouyer, 2019) (2) neighborhood density (Samadi & Bouyer, 2019) and (3) node weight (Wang & Wu, 2013). The detected spreader nodes in the human PPIN are validated by the Susceptible, Infected, and Susceptible (SIS) epidemic disease model (Bailey, 1975). Then the edges connecting two spreader nodes are ranked based on the average spreadability index. Thus, the ranked edges highlight the path through which viral infection gets mediated from SARS-CoV to human PPIN (up to level-2). The entire methodology can be categorized into 3-steps for (1) identifying the spreader nodes in the SARS-CoV and human PPIN using spreadability index, (2) validation of spreader nodes by SIS model, and (3) ranking of the spreader edges.
Developing the spreadability index for raking edges in a host-pathogen PPIN to analyse the host's viral infection propagation path is the primary contribution of this work. Furthermore, considering the current investigation on SARS-CoV and the notable similarity with its successor virus, we also attempt to shed light on the propagation pattern of viral infection of SARS-CoV2 in human PPIN.
In the following, we first describe the theory and methods for different network properties used to extract the PPIN characteristics. Then we describe the 3-step methodology. First, the methodology has been described using a synthetic PPIN (generated by Cytoscape; Shannon et al., 2003). Then, in the experimental results section, we have employed the developed method on the human-SARS-CoV PPIN to identify the SARS-CoV viral infection propagation path in the human PPIN. Finally, in the discussion section, we attempt to relate our findings with the ancestral virus, i.e., SARS-CoV, with its successor, i.e., SARS-CoV2, to study the SARS-CoV2 disease propagation may follow the pattern from SARS-CoV.

THEORY & NOTATIONS
The viral infection gets mediated from one part of the PPIN to another through spreader nodes and edges (Brito & Pinney, 2017). Generally, in disease-specific PPIN models, at least two entities are involved: pathogen/Bait and host/Prey (Saha et al., 2017). In this research work, SARS-CoV takes the role of the former while human the latter one. Viral proteins of SARS-CoV tend to target their corresponding interaction with human proteins, which target its next level of proteins. So, the establishment of interactions between SARS-CoV and human occurs through connected nodes and edges of PPIN. But mostly, these viral proteins try to interact more with the central/hub proteins rather than the other proteins (Brito & Pinney, 2017). Thus, proper identification of central nodes (i.e., spreader nodes) is required. It is also confirmed that the interaction is not possible without the edges connecting two spreader nodes. Thus, these connecting edges are called spreader edges. The proposed methodology involves a proper study and assessment of various existing established PPIN features followed by identifying spreader nodes, which the SIS model has also verified. Before going into the detailed study about the proposed work, various network-based terminologies which are used in this work are discussed below:

Protein-protein interaction network (PPIN)
When one protein interacts with another protein, it forms a network-like structure known as PPIN. Generally, it is portrayed as a graph where proteins are represented as nodes, and their corresponding connecting edges represent their interactions. Mathematically, PPIN can be highlighted as a graph G nv , which consists of a set of vertices v(nodes) connected by edges e (links). Thus, G nv = (v,e) (Saha et al., 2014;Saha et al., 2019a).

Level-1 and Level-2 proteins
In a PPIN, level-1 proteins of a node are those proteins that are in direct connection with that node, i.e., its immediate neighbors, whereas level-2 proteins are those proteins that are indirectly connected with level-1 proteins of that node, i.e., its indirect neighbors (Saha et al., 2014;Saha et al., 2019a).

Graph centrality
Graph centrality is one of the essential aspects for the identification of significant nodes in a PPIN. The centrality of a node defines how relevant the node is in a PPIN or how much a node is centrally located in a PPIN.

Betweenness centrality (BC)
BC (Anthonisse, 1971) is one of the ways of measuring a node's impact on the transmission of information between every pair of nodes in a graph, considering that this transmission is always executed over the shortest path between them. Mathematically, it is defined as: where ρ(s,t ) is the total number of shortest paths from node s to node t , and ρ(s,u,t ) is the number of those paths that pass through u.

Closeness centrality (CC)
CC (Sabidussi, 1966) is a procedure for detecting nodes that transmit information within a network efficiently. Nodes with high closeness centrality values are considered to have the shortest distance to all available nodes in the network. It can be mathematically expressed as: where |N u | denotes the number of neighbors of node u and dist (u,v) is the distance of the shortest path from node u to node v.

Degree centrality (DC)
DC (Jeong et al., 2001) is considered the simplest among the available centrality measures that only count the degree of a node, i.e., the number of directly connected neighbors. Nodes having a high degree are said to be the highly connected module of the network. It is defined as: where |N u | denotes the number of neighbors of node u.

Local average centrality (LAC)
LAC (Li et al., 2011) of a node represents how close its neighborhood proteins are. It is defined to be the local metric to compute the essentiality of the node for transmission ability by considering its modular nature, the mathematical model of which is highlighted as: , the number of neighbors of node u) and deg w c u isthe total number of nodes that are directly connected in C u .

Ego network
Ego network of node i (S i ) (Samadi & Bouyer, 2019) is defined as the grouping of node i itself along with its corresponding level-1 neighbors and interconnections. N (S i ) (Samadi & Bouyer, 2019) consists of the set of nodes which belong to the ego network, S i i.e., {i} ∪ (i).

Edge ratio
The edge ratio of node i (Samadi & Bouyer, 2019) is defined by the following equation: where E S i out is the total number of interactions between the ego network S i and the proteins outside it. E S i in is the total number of interactions among node i's neighbors. (i) denotes the level-1 neighbors of node i.S i is considered to be Ego network. S i (j) denotes node j's neighbors which belongs S i . In the edge ratio, E S i out is positively related to the non-peripheral location of node i. A large number of interactions resulting from the ego network denotes that the node has a high level of interconnectivity between its neighbors. On the other hand, E S i in is negatively related to the inter-module location of node i. It represents the fact that the interconnectivity between neighbors is usually connected to the number of structural holes available around the node. Thus, when the neighbor's interconnectivity is low, the root or the central node i gains more control of transmission flow among the neighbors.

Jaccard dissimilarity
The similarity between two nodes is determined by Jaccard dissimilarity (Jaccard, 1912) based on their common neighbors. Jaccard dissimilarity of node i and j (dissimilarity(i,j)) is defined as: where | (i) ∩ j | refers to the number of common neighbors of i and j.| (i) ∪ j | is the total number of neighbors of i and j. The similarity degree between i and j is considered more when they have more common neighbors. Whereas, when dissimilarity between the neighbors of a node is high, it guarantees that the only common node among the neighbors is the central node, which is termed a structural hole situation (Samadi & Bouyer, 2019).

Neighborhood diversity
The neighborhood diversity (Samadi & Bouyer, 2019) is a significant parameter of a graph that is based on Jaccard dissimilarity. When the dissimilarity of the neighbors of a node is high, it assures that the central node is the only neighbor common among the neighbors of that node, i.e., it represents the structural hole situation. On the other hand, when a node's neighborhood diversity reaches its greatest value, it reveals that the neighbors have no other closer path. Hence, the neighbors should transmit or communicate through this node. Mathematically, it is defined as:

Node weight
Node weight (Wang & Wu, 2013) is a graph parameter used to assign weightage to a node in a graph. Node weight w v of node v ∈ V in PPIN is interpreted as the average degree of all nodes in G V , a sub-graph of a graph G V . It is considered as another measure to determine the strength of connectivity of a node in a network. Mathematically, it is represented by are the randomly generated sample PPINs (nodes with edges) used for the detailed analysis and testing of the proposed methodology (for example, please see Fig. 1). The algorithm of the same is discussed in the supplementary document. Biological PPINs are the complete PPINs generated from the above datasets on which the proposed methodology is executed after testing (for example, please the complete PPIN view of SARS-CoV and human PPIN added at the end of the Experimental Results and Discussion section).

METHODOLOGY
The proposed work can be mainly categorized into three sub-sections: (1) Identification of spreader nodes by spreadability index, (2) Validation of spreader nodes by SIS model, and (3) Ranking of spreader edges.

Identification of spreader nodes by spreadability index
The spreadability index of node i is defined as the ability of node i to mediate a viral infection in a PPIN. Mathematically it can be defined as: Nodes having a high spreadability index are termed as spreader nodes, i.e., if the viral proteins establish interactions with these nodes, then the viral infection can be mediated to a more significant number of nodes in a much short amount of time compared to the other nodes in PPIN. Figure 1 represents a sample PPIN where each protein is denoted as a node while edges mark its interactions with other proteins. The PPIN consists of 33 nodes and 53 edges. The PPIN data and the protein names and interactions are given as input to the Cytoscape, which generates the network view as highlighted in Fig. 1. Cytoscape is open-source software that is used for PPIN generation and visualization (Shannon et al., 2003). The spreadability index is computed on the synthetic PPIN, shown in Fig. 1, using essential PPIN characteristics in this PPIN, as stated earlier. The same is compared to DC, BC, CC, and LAC, highlighted in Tables 1 to 5.
In Fig. 1, it can be observed that nodes 1 and 24 are the essential spreaders. Node 1 connects the four densely connected modules of the PPIN, making this node the topper with the highest spreadability index. This node has been correctly ranked by all the methods except LAC and DC. Node 24, though, has a moderate edge ratio and node weight but is one of the most densely connected modules itself despite getting isolated from the main PPIN module of node 1. Moreover, node 24 has the highest neighborhood density. It establishes that the only path of transmission of information for nodes 26,27,25,28,29,30,31,32, and 33 is node 24. Thus, if viral proteins of SARS-CoV establishes interaction with node 24, then all the connected nodes will be indirectly coming under the interaction of viral proteins as the connected nodes have no interactions with other central nodes except node 24. So, node 24 holds the second position for the spreadability index in our proposed methodology. Node 24 is not correctly identified as the second most influential spreader node by the other methods. Further assessment of the remaining nodes highlights the fact that the performance of the new attribute spreadability index in our proposed methodology is relatively better in comparison to the others.

Validation of spreader nodes by SIS model
To design the mathematical model for this infectious disease, the SIS Epidemic Model (Bailey, 1975) is used in this proposed methodology by classifying the proteins in SARS-CoV-human PPIN based on their interactivity status (for more details, please see ''Studied Models in epidemiology'' section of the supplementary document). SIS refers to Susceptible, Infected and Susceptible states, which are generally considered the three probable protein states in a PPIN. (1) S -The susceptible states are the states of those human proteins with which viral proteins have not yet interacted, but they are at risk of getting interacted. In general, every protein in PPIN is initially in a susceptible state. (2) I -These infected states are the states of those human proteins with which viral proteins have interacted, and the Table 1 Computation of spreadability index of synthetic Fig. 1 and computation  (3) S -The susceptible states are the states of those human proteins that have lost their interaction with the viral proteins (due to antiviral therapies or change in interface residues (Brito & Pinney, 2017)) and again become susceptible. The interaction rate of the viral proteins with human proteins, the loss rate of interactivity of the human protein with the viral proteins (general assumption is that any protein after coming out of the infected state gets into a susceptible state again in one day), and the   Table 1) in comparison to others for their corresponding top 10 spreader nodes in the synthetic PPIN, as shown in Fig. 1.

Ranking of Spreader edges
To show the ranking of interacting spreader edges, two synthetic PPINs: PPIN-1 and PPIN-2, have been considered in Fig. 2. Node D, E, and F are the selected top spreader nodes in PPIN-1 by spreadability index, similarly explained with a synthetic PPIN in Fig. 1.
To avoid the complexity in the diagram, the top 5 nodes in PPIN-2 (see Table 1) are selected as spreader nodes. Red-colored edges are the interconnectivity within PPIN-1, Table 5 Computation of DC of synthetic Fig. 1 and computation of spreadability rate of selected top 10 spreader nodes by the SIS model.  edges are ranked based on the average spreadability index of its connected spreader nodes. The ranked spreader edges in Fig. 2 are highlighted in Table 6.

EXPERIMENTAL RESULTS & DISCUSSION
The proposed methodology leads to the identification of spreader nodes and edges through a network characteristic, called spreader index which has also been checked and validated  Fig. 3. In Fig. 3A, at first, SARS-CoV PPIN is displayed in which each protein is marked in red. After that, spreader nodes in SARS-CoV PPIN are identified by the spreadability index. They are denoted as blue nodes among the red. Once the spreader nodes are active (Fig. 3B), the viral infection gets mediated through its corresponding direct partners, i.e., human-level-1proteins (marked in deep green). Then, in Fig. 3C, spreader nodes are identified in SARS-CoV level-1 human proteins (marked in yellow). The same will continue to SARS-CoV level-2 human proteins (light green nodes are the spreaders).     Fig. 4, SARS-CoV PPIN has been highlighted. There are mainly nine proteins, including E, M, ORF3A, ORF7A, S, N, ORF8A, ORF8AB, and ORF8B. The computed spreadability index of these proteins and the corresponding validation by the SIS model are highlighted in Table 7. It is also compared with other central/ influential spreader node detection methodologies like DC, CC, LAC, and BC, shown in Tables 8-11. Similarly, spreader nodes are also identified in SARS-CoV's level-1 neighbors and level-2 neighbors (see Figs. 5 and 6).
The spreadability index plays a vital role in this proposed methodology. Spreader nodes are successfully identified by this scoring technique which covers all the aspects through which viral infection gets mediated from one node to another in a PPIN (Brito & Pinney, 2017). It should be mentioned here that while identifying spreader nodes in SARS-CoV level-2 human proteins, it has been noted that the number of nodes is getting increased significantly with the increment of successive levels. So, high, medium, and low thresholds   Table 12. It can be observed that threshold application is only implemented at SARS-CoV level-2 human proteins, not on others. This is because of the availability of a smaller number of nodes and edges. Therefore, only nodes and edges having a shallow spreadability index have been discarded at the first level.
Besides the identification of spreader nodes, spreader edges are also identified. The ranked edges between SARS-CoV spreaders and its level-1 human spreaders are highlighted in Table 13. In contrast, the ranked edges between SARS-CoV s level-1 and level-2 human spreaders at high, medium, and low thresholds are highlighted in the Tables S1-S3, respectively. The supplementary document is available online here. The complete PPIN  -1 and level-2). The PPIN consists of the interaction between SARS-CoV and human proteins. The blue node represents SARS-CoV spreaders, while the yellow and green nodes represent SARS-CoV s level-1 and level-2 human spreaders. The thickness of the edges varies with the order of ranking.
In the above-generated PPIN views, the blue, yellow, and green colors represent SARS-CoV spreaders, level-1 human spreaders, and its level2 human spreaders. The remaining nodes are in indigo.

CONCLUSION
The spreadability index is thus proved to be effective in detecting spreader nodes and edges in SARS-CoV-human PPIN and the cross-validation by the SIS model. Spreader nodes are the central nodes in the PPIN through which viral infection gets mediated to their successors. Simultaneously, if the spreader nodes are not connected with spreader edges, that would not have been possible. In a nutshell, it can be said that the proposed work exploits the possibility of understanding how viral infection gets mediated from the SARS-CoV PPIN to the human PPIN. It should be borne in mind that SARS-CoV2 is ∼89% genetically similar to its predecessor SARS-CoV (Chan et al., 2020;CIDRAP, 2020). Therefore, it strongly reveals that the human proteins chosen as spreaders of SARS-CoV might be the potential targets of SARS-CoV2. So, the same concept of the