Identifying influential nodes in a wound healing-related network of biological processes using mean first-passage time

In this study we offer an approach to network physiology, which proceeds from transcriptomic data and uses gene ontology analysis to identify the biological processes most enriched in several critical time points of wound healing process (days 0, 3 and 7). The top-ranking differentially expressed genes for each process were used to build two networks: one with all proteins regulating the transcription of selected genes, and a second one involving the proteins from the signaling pathways that activate the transcription factors. The information from these networks is used to build a network of the most enriched processes with undirected links weighted proportionally to the count of shared genes between the pair of processes, and directed links weighted by the count of relationships connecting genes from one process to genes from the other. In analyzing the network thus built we used an approach based on random walks and accounting for the temporal aspects of the spread of a signal in the network (mean-first passage time, MFPT). The MFPT scores allowed identifying the top influential, as well as the top essential biological processes, which vary with the progress in the healing process. Thus, the most essential for day 0 was found to be the Wnt-receptor signaling pathway, well known for its crucial role in wound healing, while in day 3 this was the regulation of NF-kB cascade, essential for matrix remodeling in the wound healing process. The MFPT-based scores correctly reflected the pattern of the healing process dynamics to be highly concentrated around several processes between day 0 and day 3, and becoming more diffuse at day 7.


Introduction
Properties of random walks have previously been used in biophysics to characterize various biological processes [1]. They have also been used as a basis of diverse numerical descriptors of chemical compounds and biological networks [2][3][4] like random walk betweenness centrality [5], communicability and modular structure [6,7], as well as complexity measures [8][9][10] and descriptors used for evaluating network connections [11][12][13]. In the context of biological networks, random walks have been intensively used for estimating node influence. The influence between nodes in protein-protein interaction networks served as a way of inferring protein function [14], finding driver mutations in cancer [15,16], or finding disease-related genes [17]. The influence has been defined in terms of a diffusion kernel [18], diffusion with loss [19] or a heat kernel [20]. However, the diffusion kernel and heat kernel are both defined for undirected graphs, which reduces their use for directed networks such as a kinase-substrate protein signaling network or gene regulatory network. More importantly, the above measures ignore the time axis in their measurements of the spread of signal from node to node.
Mean first-passage time (MFPT) captures the progression of the random walk. In computational biology, it has been used previously for analyzing state transition graphs in probabilistic Boolean networks to identify gene perturbations that quickly lead to a desired state of the system [21]. It was also used in physical chemistry for finding reaction paths from a reactant to a product [22], for example to uncover the path of excitation migration after photon absorption in the photosynthetic complex. The study of statistical properties of first-passage times on various domains has a long history in physics [23], including recent results on regimes that led to different forms of the distribution and different behavior of the mean as the size of the complex system increases [24]. Distribution of first-passage times MFPT on undirected graphs has been well-characterized [25][26][27] and, recently, it has been shown that for directed graphs, MFPT can also be obtained analytically [28,29].
In this study, we build on those results to investigate the nodes which are influential and essential in a network of biological processes involved in skin wound healing. Focusing on a network of biological processes is a new direction for wound healing studies that expands on our previous in-silico analyses of healing [30][31][32]. Our new approach might be regarded as a computationally oriented branch of the newly open field of network physiology [33][34][35] or, more generally, of the new 'network of networks' field [36]. While each node in network physiology represents one of a set of interrelated experimentally characterized physiological processes, we focus on a network linking biological processes, as defined in gene ontology [37] along with a set of genes/proteins characteristic for the process. More specifically, we apply the MFPT method to three essential time points during healing of skin wounds in humans to analyze the interrelation of processes involved, with emphasis on those significantly enriched during the process of healing. Wound healing is a complex physiological process that involves extracellular as well as intracellular signaling and remodeling in an environment composed of mixtures of cells of different types. Operating at the level of biological processes instead of at the level of individual genes or proteins offers a chance for a more comprehensive and compact view of the key characteristics of wound healing.

Analysis of networks of biological processes
2.1. Wound-healing data and enrichment analysis We proceeded from a gene microarray dataset that captures normal epidermal wound healing in eight human subjects [38]. The wounds were a result of harvesting a skin graft from patients' thigh. We analyzed three groups of samples, representing different time points during wound healing. The first time point refers to acute wound, and includes samples biopsied from the skin graft site right after harvesting. The second time point characterizes the inflammatory phase of the healing process, based on samples biopsied on the third day after the skin graft was harvested. The third time point represents the re-epithelialization phases of healing, and includes samples biopsied on the seventh day. Samples of a biopsy of intact, unwounded skin from the graft site, collected immediately prior to harvesting of the skin graft, serve as a control.
For all samples, we used normalized transcriptomic data captured using the Affymetrix Human Genome U133 Plus 2.0 array. We used a t-test to detect genes up-and down-regulated in an acute wound (day 0) compared to control, at day 3 compared to control, and at day 7 compared to control. In all three cases, we performed enrichment analysis using 1000 top-ranking differentially expressed genes. We used the DAVID [37] tool to obtain a list of biological processes that were enriched in the top-ranking genes, using a cut-off of p = 0.05. We found 66, 80 and 84 biological processes enriched in the wound versus control for days 0, 3, and 7, respectively. The top enriched processes are presented in table 1.

Network of biological processes important in wound healing
We constructed networks linking biological processes separately for bioprocesses enriched at days 0, 3 and 7. The connectivity of the networks was based on the interactions between the genes that are involved in the enriched processes. For each biological process, we obtained from DAVID a set of genes that are involved in the process, and are present on our list of top-ranking differentially expressed genes for a specific day. We mapped those genes onto two networks of different types. The TRANSFAC network [39] contains validated information about transcription regulation. For each gene, it lists all proteins that regulate its expression by acting as transcription factors and binding to the gene's promoter region. The PhosphoNet network [40] captures cellular signaling at the protein level, by providing information about kinases and the substrates they phosphorylate. Uniting both networks is essential for capturing a comprehensive view of wound healing, a process that spans multiple time scales, from immediate processing of stimuli through signaling involving protein-protein interactions to slower response that involves changes in gene expression through transcription factor-DNA regulation.
For each pair of biological processes, we counted how many regulatory and signaling interactions connect genes from one process to genes from the other, and added a directed edge with a weight proportional to the count. We also counted the number of genes that are present in both processes, and added weighted bidirectional edges in both directions between each pair of such processes. Edge weights were normalized by dividing the counts by the maximum possible number of interactions, that is, the product of the cardinalities of gene sets related to both biological processes. In effect, we obtained two networks, one representing regulation and signaling, and another representing common genes. For further analyses, we normalized both networks to have the same total sum of weights, and merged them into a single weighted directed network of bioprocesses enriched at a specific day of healing.
The networks obtained for days 0, 3 and 7 are dense networks, with around 50-60% pairs of processes connected (figure 1). The underlying undirected graphs for both networks each form a single connected component. The distribution of link strength shows few strong and many weak ones, but is more concentrated than a power law (figure 2). The distribution of node weighted in-degrees, that is, the sum of weights of edges going into a node, is also highly concentrated, as is the distribution of weighted out-degrees (figure 2).

Identifying influential and essential biological processes
Identification of important biological processes in a dense enrichment network is not straightforward. While the enrichment score can point to the most enriched processes based on expression data, it does not take network topology into account. Here, we propose the use of an approach based on random walks that also takes temporal aspects of spread of signal over a network into account. The MFPT H i j ( , ), known also as expected hitting time, from node i to j in a strongly connected, directed graph is defined as the expected number of steps it takes for a random walker starting from node i to reach node j for the first time, where the walk is Markov chain defined by transition probabilities resulting from the graph connectivity. The average is taken over the number of transitions, that is, lengths L of all paths → s i j ( ) from i to j that do not contain a cycle involving j, with respect to probabilities P of the paths: Compared to the shortest distance from i to j, the MFPT takes multiple paths and node degrees into consideration. For example, paths through hub nodes increase H, since the walker has a high probability of moving to nodes connected to the hub that are not on the shortest path to the target. The MFPT has been well-characterized for undirected graphs [25]. Recently, it has been shown that for directed graphs, MFPT H i j ( , ) can be obtained analytically in closed form given the adjacency matrix and the vector node stationary probabilities π in a random walk in the graph [28]. More specifically, let A be the, possibly weighted, adjacency matrix of a strongly connected, directed graph, D the diagonal matrix of node out-degrees, and I the identity matrix. Then the following matrices can be defined: Table 1. Top five most highly enriched biological processes for days 0, 3, and 7 compared to control. For each process, we provide the process name, and the last five digits from its gene ontology ID (GO:00xxxxx).

Enrichment rank
Biological process 1 and the expected hitting time can be calculated as [28]: k V Figure 1. Networks for days 0 (a), 3 (b), and 7 (c). Rectangular shape with red border denotes top five biological processes in terms of essentiality (see table 3 for details). Red fill color denotes top five nodes with highest influence score (see table 2). We use the last five digits from the gene ontology ID (GO:00xxxxx) as node ID.
To find the most influential nodes in a network, we calculate H i j ( , ) for all pairs of nodes. To satisfy the assumption of strong connectivity of the graph, and to deal with imperfect knowledge of biological networks, for each node we add a low probability (0.001) of a jump to any other node in the network. Based on H i j ( , ), we calculate for each node the median of the MFPT to all other nodes, and treat nodes with the lowest values of this statistic as most influential. The results are presented in table 2.
There is a general trend of a slight increase in the lowest median of the MFPT score of the top five biological processes from days 0 to 3, followed by a considerable decrease with the advancement of the healing process in day 7. Also, different processes emerge as most influential during the three different time points in the healing process. For day 0, four out of the top five biological processes identified using MFPT are directly related to skin pointing to differentiation of cells in the epidermis, the outer layer of skin, which is formed by epithelial cells including karatinocytes. In day 3, NF-κB signaling is identified and it is known to influence matrix remodeling in wound healing [41]. In day 7, 'response to wounding' is discovered as the second most influential process, and the top five also include processes related to immune response that is active through first two weeks of wound healing. Interestingly, the most highly enriched processes do not end up being the most influential.
We have also used MFPT to define essentiality of each biological process, by removing the process from the network and evaluating the change in the average MFPT between all pairs of nodes. As the graph is smaller by one node, the times should decrease, except for processes that are highly essential for the spread of signal through the network. Indeed, the average MFPT decreases by 1.05 for the day 0, 1.6 for day 3, and by 1.25 for day 7 network. However, especially for day 0 and 3, removal of some processes results in an increase of the mean  MFPT between nodes (see table 3). Effects of system size on MFPT were studied previously [24], indicating that depending on the system class, MFPT may converge monotonically in the limit of infinite size of the complex system, or may diverge in a linear or sublinear way. The non-monotonic behavior we observe indicates a substantial change in the system resulting from the node removal. As can be seen in figure 3, the measures of node essentiality and influence are only moderately correlated. The Wnt receptor signaling pathway that is identified as the most essential in day 0 has long been known as playing crucial role in regulating wound healing [42]. The NF-κB signaling pathway discovered as essential for day 3 is known to influence matrix remodeling in wound healing [41]. The very low scores at day 7 indicate that no single process is essential to healing at that stage.

Discussion
Recent studies of a wide class of undirected graphs have shown that MFPT is highly influenced by the degree of the target node of the walk, and by the source-target distance [27]. Our experimental results indicate that this Table 3. Top five biological processes with positive essentiality score. For each process, we provide the last five digits of its gene ontology ID (GO:00xxxxx).

Rank
Biological process Score Enrichment rank  relationship also holds for directed, weighted graphs. As seen in figure 4, the MFPT for pairs of nodes is inversely proportional to the sum of weights of in-edges of the target node. However, the correlation is not perfect being expressed the best at the end of the 7-day healing period. When the MFPTs are aggregated for each source node, the resulting node influence score does not correlate to the node's sum of weights of in-edges (see figure 5).
Recently, a centrality score based on average MFPT to a target node was proposed [29], but it is highly dependent on the in-degree of the target node. For bioprocess networks analyzed in this study the target-based score identifies basic cellular processes relating to cell cycle, DNA replication, and eikosanoid signaling. In addition to node degrees, we also analyzed the relationship between source-target distance and the MFPT for that node pair. To account for edge weights that model the probability of transitions in a random walk, we identified the most probable path from source i to target j, and used the negative logarithm of the probability of the walk along that path as a measure of distance, , ) k k 1 denotes the probability of transition from kth to + k ( 1)th node on the most probable path. The results show that the MFPT is influenced by node distance, but the effect is much smaller than that of the target node weighted in-degree (see figure 6).
Finally, we also analyzed which graph quantities are most determinant of the node essentiality measured as the average increase of MFPT upon node removal. The essentiality is inversely correlated with the stationary probability π of a node in a random walk on the graph (see figure 7). The relationship is strong for nodes with negative essentiality, that is, those which slow down the spread of information in the network. Interestingly, this relationship breaks down for nodes which are essential, that is, their removal increases the average MFPT between all nodes in the network.
Results from the experiments show that the response to a wound is more centered around several important processes early during the healing. Observations for days 0 and 3, while differing in the specific processes that are most influential and essential, show similar characteristics in terms of network properties. By day 7, the network undergoes topological transition into a structure with higher weighted in-and out-degrees, and much lower variability of the influence and essentiality scores across nodes. Variance of random walk stationary probabilities  of nodes is also lower. While the number of enriched processes is actually growing in time, with 66, 80 and 84 processes at days 0, 3 and 7, respectively, indicating that healing is still in progress, these results show that the contribution of the processes becomes more evenly spread at later phases of healing. Figure 6. Relationship between pairwise source-target MFPT, target weighted in-degree, and source-target distance expressed as a negative logarithm of the product of probabilities of edges along the most probable source-target path, depicted for days 0 (a), 3 (b), and 7 (c). Each point corresponds to a source-target pair, points of the same color correspond to pairs with the same source nodes.

Conclusion
The paradigm for discovery and modeling of biological systems and their pathologies shifted in the past decade from focusing on isolated genes to pathways and networks. Knowledge about signaling or regulatory interactions among genes or proteins improves interpretability of results from statistical analyses of data captured with high-throughput techniques such as microarrays or next-generation sequencing. Connectivity structure of molecular networks can also point to novel diagnostic or therapeutic targets, and can serve as a regularizing factor narrowing the search space for statistical models that discriminate between different states of biological systems. However, focus on molecular networks ignores the multi-scale nature of biological systems. Translation of results from the level of genes and proteins to the level of tissue composed of dynamic populations of cells at multiple times is a challenge.
In this study we presented an approach for translating molecular-level networks and experimental data into a physiological-level network that captures interactions between biological processes. We also proposed how to use MFPT in these directed networks to measure two aspects of each biological process: its influence on other processes, and its essentiality for interactions between other processes. We demonstrated the proposed approach by studying the progression of healing of skin wounds. The processes identified as influential or essential in our analysis are consistent with previous knowledge of wound healing. More importantly, the aggregated analysis of the networks for this physiological processes allowed for making quantitative observations at the level above individual genes or pathways.