A Preference Random Walk Algorithm for Link Prediction through Mutual Influence Nodes in Complex Networks

Predicting links in complex networks has been one of the essential topics within the realm of data mining and science discovery over the past few years. This problem remains an attempt to identify future, deleted, and redundant links using the existing links in a graph. Local random walk is considered to be one of the most well-known algorithms in the category of quasi-local methods. It traverses the network using the traditional random walk with a limited number of steps, randomly selecting one adjacent node in each step among the nodes which have equal importance. Then this method uses the transition probability between node pairs to calculate the similarity between them. However, in most datasets, this method is not able to perform accurately in scoring remarkably similar nodes. In the present article, an efficient method is proposed for improving local random walk by encouraging random walk to move, in every step, towards the node which has a stronger influence. Therefore, the next node is selected according to the influence of the source node. To do so, using mutual information, the concept of the asymmetric mutual influence of nodes is presented. A comparison between the proposed method and other similarity-based methods (local, quasi-local, and global) has been performed, and results have been reported for 11 real-world networks. It had a higher prediction accuracy compared with other link prediction approaches.


Related Works:
Recently, numerous algorithms have been implemented for link prediction, and there have already been several excellent surveys that work for the link prediction problem [9,17,18,30]. Several classifications such as similarity-based algorithms, maximum likelihood methods, and probabilistic models can be provided for these methods. The maximum likelihood methods and probabilistic models provide higher accuracy than similarity-based algorithms; however, they have some intrinsic drawbacks [15]. The probabilistic models often depend on node attributes besides the network structure, so their applications are considerably restricted [12]. Furthermore, the quantity of parameters to be fixed is too large, and as a result, we cannot gain insight into the network organization, albeit building a considerably precise model. Maximum likelihood methods are not very efficient in terms of time consumption, and they can only handle the networks with hundreds of nodes [31]. In contrast, numerous real networks include nodes of different numbers from millions to billions. In this paper, we only emphasize structure-based similarity approaches using structural topology information. The topology features of networks are applied to assign similarity scores to unconnected node pairs using structurebased similarity methods. These methods can be classified into three categories: local, quasi-local, and global [26]. Therefore, overall speaking, the similarity-based algorithms, in particular the ones based solely on quasi-local topological information, have found the widest applications. Local similarity approaches use only the information of paths with length 2 for a pair of nodes. It is divided into two main classifications, common neighbor-based and clustering coefficient-based approaches. In the category of common neighbor-based, two disconnected nodes are more probably to be mutually connected if they have more common nodes such as the Common Neighbors Index (CN) [32] directly counting the number of common neighbor nodes, Adamic-Adar Index (AA) [33] and Resource Allocation Index (RA) [13] punishing large common neighbor nodes, Sørensen Index [34], Leicht-Holme-Newman Index [12] with a penalization of large-degree endpoints. Other approaches, such as CAR-based Common Neighbor Index (CAR), Node Clustering Coefficient (CCLP), Node and Link Clustering Coefficient (NLC), not only consider the common neighbors of node pairs but also take into account the local clustering coefficient between those common neighbors too. In the paper [35], the author considered the number of edges among the common neighbors and the CAR index presented based on the assumption that the edge exists between two nodes is more likely if their common neighbors are members of a local community (localcommunity-paradigm (LCP) theory). Wu et al. [36] designed the CCLP index. This index is also based on the local clustering coefficient property of the network. The local clustering coefficients of all the common neighbors of a seed node pair are computed and summed to calculate the final similarity score of the pair. The same author developed the NLC index in which combining both node and link clustering information to find the final similarity [37]. The main advantage of local similarity indices is their low computational complexity. Although, considering the immediate neighbors leads to this index to experience weak performance in prediction. On the contrary, global similarity points out the similarity according to the network's global structure information, including the Katz Index [38], counting all paths in which the connection of two nodes with shorter routes is desired. Random Walk with Restart (RWR) is a direct application of the PageRank algorithm [23]. Take a random walker into account starting from node i, who will iteratively move to a random neighbor with probability and come back to node i with probability (1 − ). Denote by qij the probability this random walker locates at node j in the steady-state. Quasi-local indices do not rely on global information but they use additional topological information compared to local methods to obtain a nice trade-off between computational complexity and performance. This approach can be divided into two categories local path and random walk with finite steps. The information of all 2-step and 3-step paths, with all 2-step paths preferred, is taken into account in Local Path (LP) [39]. Effective Path (EP), Significant Path (SP), and Resources from Short Paths (RSP) [40] are the improved versions of the LP. Xuzhen et al. investigated the effective influence of endpoints and captured the connectivity, and proposed the EP in which creating the influence model among two nodes as the connectivity of paths where it is defined as the product of transfer probability of every single link included in the path [41]. Zhu et al. presented the SP index derived from the intuition that short paths make better proof of a missing link connecting its two ends (they expressed that such paths are significant); the low degree intermediate nodes are examples. Practically, the Significant Path index only applies the paths with lengths 2 and 3 [42]. Yabing et al. [40] considered the interactions of paths with different lengths based on the resource-traffic flow mechanism on networks and proposed the RSP index. Random walking with finite steps that randomly walk on the graph is very useful in calculating the similarity and proximity between nodes. The local random walk [21] and Superposed Random Walk (SRW) [21] indices are two famous random walks with finite step similarity indices. Local random walk index limiting a random walker within a local range, and superposed random walk index based on local random walk continuously releasing a random walker at the starting node to emphasize the nodes near the target node. Semi-local methods provide a trade-off between the computational complexity and the obtained accuracy. They, therefore, have been recognized as one of the most efficient approaches to deal with the link prediction problem. In semi-local methods, the local random walk algorithm is very popular and effective in finding the probability of a link existing between a pair of nodes. However, this algorithm suffers from a significant drawback in terms of accuracy. In all link prediction methods that use random walking approaches, the importance of all links and nodes is considered equal, and this makes this approach not so efficient in traversing graph structure. Here we take a different approach from previous works. In the present work, we intend to take advantage of a new concept, i.e., mutual influence, to compute the transition probability between node pairs and, therefore, not choose the random walk nodes in a purely random manner. We claim that our proposed algorithm is one of the most efficient algorithms in the semi-local category due to its high performance in the link prediction task, based on the obtained results from experiments performed on large-scale datasets.

Background and notation:
In this section, before getting to the algorithm, some fundamental definitions and concepts in the proposed algorithm are reviewed.

Definition 1(Mutual Information):
In information theory, mutual information is a concept that is a measure of the amount of information that a random variable has about another variable and also is applied to indicate the relationships between the information of nodes. Consider a couple of random variables and with a joint probability mass function and marginal probability mass functions and [43]. The Mutual Information ( , ) can be denoted as follows: ( , ) = ∑ ∑ * log * ∈ ∈ (1) ( , ) measures the amount of information gained by observing each of the random variables relative to the other, and has three significant features:  MI ( , ) is always non-negative.  MI ( , ) is zero if and only if the random variables and are independent of each other.  MI ( , ) = MI ( , ) In fact, mutual information is a symmetrical function. So the above properties of ( , ) can measure the result of linear and nonlinear dependence between random variables and .

Definition 2 (Asymmetric Mutual Influence(AMI)):
In social networks, nodes have different influential and important values, and each can influence their neighbors or be influenced by their neighbors. The concept of social influence has affected various aspects of social network interactions and can be studied from different perspectives. Here, we are investigating its role in the problem of link prediction. More specifically, we take advantage of the mutual influence concept to measure how much a node can affect its neighbors and use the influence between nodes to tackle link prediction. This concept will be implemented using the network's structural information and quasi-local information of nodes. A quantity is introduced to represent the mutual influence of nodes, which uses a concept called 'Mutual Information' presented in Equation (1). We have modified this definition according to our purpose. Therefore, we measure the influence between a pair of nodes, using their first-order neighbors and the intersection of those nodes. The mutual influence between the two nodes is calculated using Equation (2.d): Where Ni is the number of first-order neighbors of node i, and N implies the total number of nodes in the network. refers to the probability of node getting influence from other nodes of the network.
( , ) is referred to as the number of nodes direct connection to both nodes and , in addition to both nodes, and Pij implies the occurrence probability of the intersection of node i and node j. In fact, Pij is a probability that is calculated using the count of common neighbors of node i and node j divided by the total number of nodes in the network, and it can be interpreted as the node pair i and j getting influence by a set of common nodes in the network. This formula measures the mutual influence between a pair of nodes in terms of the fraction of the neighborhood that they share. Therefore, the influence that a node gives to its neighbor is equal to the influence it gets from it. But we know that in a real-life situation, this cannot be true. According to [44], the notion of influence between a couple of social entities is an asymmetric value, and it depends on various factors, e.g., an individual's importance and role in the network. We assume that the more influence a node has on its neighbor, the greater its chance to be visited from that node. Hangal et al. [44] provided a quantitative definition of influence between two entities, which is as follows: The influence that i has on j is determined using the amount of investment of j, on i divided by the amount of investment of j on all the other entities. The concept of investment can be interpreted as the time or effort that one person spends on the other person. In this paper, we take advantage of the concept provided by [44] and modified it to be applicable for our purpose. The new asymmetrical mutual influence, which is an asymmetrical version from Equation (2.d), is computed via the following Equation: Where Pij, i.e., the joint probability of i and j, is the ratio of the number of common neighbors between i and j to the total number of common neighbors between node j and all of its neighbors multiplied by , and shows the firstorder neighborhood of node j ∈ V. Using this Equation means that the influence that a node gives to its neighbors depends not only on the number of common neighbors it has with that neighbor but also the number of common neighbors it has with its other neighbors. More specifically, in Equation (4.d), the maximum score is reached for nodes i and j when nodes i and j have low degrees and many mutually shared neighbors. Also, the minimum score for nodes i and j is reached when nodes i and j have high degrees and no mutually shared neighbors; under these conditions, they will be independent of each other and will not be affected by each other. Considering the following network in . Therefore, we can see that node E receives the strongest influence from node A, while node A receives the least influence from node E. This is happening due to the fact that node E has a lower degree compared to node A and, in addition to that, shares fewer common neighbors with its adjacent nodes compared to the number of common neighbors between node A and its adjacent nodes and, therefore, invests more resources on A, compared to the A's investments on node E.

Mutual Influence Random Walk (MIRW) algorithm:
Structural similarity between vertices, which are normally hidden, recognizes the similarities between nodes utilizing topological and structural information of the graph. If the structural similarity between two nodes is significant, the creation of a link between them is extremely probable. In the random-walk-based methods, if the structure of the network is traversed more efficiently, the similarity score between nodes is calculated more accurately. The local random walk and superposed random walk were some of the most effective and efficient examples of the random walk approach. They possess a significant advantage compared to other random-walk-based methods, e.g., random with restart walk, which is that these methods used quasi-local information of the network. Therefore, they significantly reduce computational complexity. The key contribution of LRW and SRW was limiting the number of steps that the walker could take. In this method, the transition probability matrix, i.e., , could be computed using the following rule: After the transition probability matrix was obtained, the probability of the walker starting from node and reaching node after steps can be computed as follows: In the above setting, (0) is a × 1 vector, with the ℎ element equal to1 and all the other ones equal to 0. Therefore, according to LRW, the similarity between a pair of nodes is computed using the following formula: Where k and E are referred to as the degree of node and number of existing links in the network, respectively. SRW has improved the LRW by continuously releasing walkers from the source node, resulting in a higher similarity between node pairs that were near each other. The similarity resulted from the SRW method can be obtained as follows: Even though LRW and SRW possessed many advantages like lower computational complexity and higher accuracy in predicting the missing links, compared to the previous works, their main drawback was that the process of computing the transition probability matrix was only according to the degree of the source node and therefore the results of the random walk were completely generated by random and are not accurate enough in capturing network structure and finding similarities between pairs of nodes. However, each node in the network can possess specific importance in its neighborhood and should not be treated like other nodes. To overcome this limitation, we proposed a biasing function to distinguish between every relationship of a node with its neighbors. The biasing function that is introduced in this article takes advantage of the mutual information concepts. This approach measures the influence that each node possesses on its neighbors. Therefore, the probability of the walker locating in a node, choosing one of its neighbors for the next step, is computed proportionally to the influence that it gets from that neighbor. The mutual influence between the pair nodes can be calculated by Equation (3). As previously stated, this concept is symmetrical and assumes that the influence that a node has on its neighbor is equal to the influence it gets from that neighbor. However, to produce more effectively our biasing function, we need to consider the influence of nodes on each other to be asymmetric and prefer to use AMI instead of MI. In this way, the probability of moving from node toward node is not the same as moving from node toward node . Therefore, the authors use Equation (4.d) to compute a transitional probability for each pair of nodes. Consequently, we introduce a new matrix transition according to Equation (9). In this way, by tuning the parameters of the biasing function, one can force the walk to visit nodes preferentially with high values of asymmetric mutual influence.
Our proposed algorithm (Mutual Influence Random Walk) allows the use of an asymmetric mutual influence matrix. It is more likely to move towards a node by which it is more affected. The MIRW algorithm is defined using Equation (10): By defining an appropriate weight for each pair of vertices (i, j), the walker jumps from one node to a neighboring node with a preference towards the link with higher weight. The pseudo-code of the proposed method is indicated below.

AUC and Precision
Begin algorithm 1:

2: For each pair of a node (i,j) in Gtrain do 3:
Compute the Asymmetric Mutual Influence (i,j)

4: End for 5: For each unconnected pair of nodes (x,y) in Gtrain do 6:
Compute the similarity score of the edge(x,y) as Sxy using Eq 6.

7: End For 8:
Arrange the list of all Sxy in descending order

9:
Insert top-L edges from the ordered list to Gtrain. //L is the number of removed edges from the original network 10:

Experimental analysis:
In this section, to investigate the efficiency of the proposed method, the authors have conducted some experiments and reported their results. The proposed method's performance is evaluated against some of the state-of-the-art link prediction methods. These methods were categorized according to the network's structure to local, global, and quasilocal categories. In the following sections, we describe the details of datasets used for performance analysis, compared methods, metrics for evaluation, and the results evaluations and comparisons. All the experiments were performed in a desktop pc equipped with a quad-core Intel i7 2.20GHz processor and 16GB RAM.

Datasets:
The proposed approach is evaluated on real-world datasets. These real-word networks have some features, including the number of nodes, edges, average clustering coefficient, average shortest path, etc. A detailed description of these properties can be found in Table 1. Columns from left to right of Table 1 are respectively: network name, number of nodes (|V|), number of edges (|E|), average degree (〈K〉), average clustering coefficient (〈C〉), average shortest path length (ASPL), diameter (D). Each dataset has been collected from different domains for research and analysis purposes. Zachary Karate Club is a network consisting of 34 members of a university karate club, and each edge describes a friendship relation [45]. FOOTBALL is also a network of football games between college teams [46]. DOLPHINS is a network representing relationships between some dolphins [47]. CELEGANS is a neural network of the nematode Caenorhabditis Elegans [48]. PHYSICIANS is a network of 246 physicians being friends or trusting each other [49]. Food is a food web consisting of 128 nodes and 2075 edges [50]. SmaGri is a citation network in which nodes are documents, and a link is formed if a document is cited by another document [51]. Yeast is a network describing interactions between proteins [52]. NetScience is a co-authorship network connecting scientists [53]. King James is a network of vocabularies co-occurring in the same sentences [54]. CA-GrQc is a collaboration network covering scientific collaborations between the author's papers [55].

The Evaluation Criteria:
For assessing the efficiency of the proposed method against compared methods, we need some evaluation metrics to measure how well each method is working. The two metrics used here are the area under the receiver operating characteristic curve (AUC) and precision. In the following subsections, we briefly introduce each metric separately, and then we describe the evaluation process.

AUC [56]:
The AUC is the most common metric for measuring how well a method distinguishes the missing link, i.e., links that will appear in the future, and non-existent edges, i.e., a pair of nodes that are not going to be connected. Almost all link prediction methods have been evaluated using this metric. In theory, this metric ranks all the non-observed links using their given score. It then counts the number of times a randomly selected missing edge is higher compared to a randomly chosen non-existent edge. This is a time-consuming process, so in practice, when we want to evaluate a method instead of ranking all the nonobserved edges, at each time, we just randomly select a missing edge and a non-existent edge and compare their scores.
In n independent comparison, if n' is the number of times that the missing edge has a higher score than the non-existent edge, and n" is the number of times that both of them have the same score, then the AUC can be calculated as follows: If a link prediction model gives a score to non-observed links randomly, then the AUC will be equal to 0.5. So, if the resulted score is higher than 0.5, it means that the model performs better than random performance.

Precision:
The precision metric is used to measure how well the model predicts missing edges right. In other words, precision is for measuring the accuracy of the model. To measure the precision of a model, first, we need to rank all the non-observed edges using their given score in descending order. Then out of top-L node pairs that have the highest score, we count the number of them that are a missing edge. Suppose missing edges exist in the top-L node pair. This means that the precision of the model is equal to:

Determination of random walk length:
According to [21], there is a positive correlation between the average shortest path distance and the appropriate length of the walk. Thus we find the best value of random walk length with respect to the average shortest path.

Comparison methods:
To evaluate our proposed method, we consider several baselines and state-of-the-art link methods from different categories, i.e., local, quasi-local, and global. In this section, these methods are introduced.

Local methods:
 Jaccard coefficient: this method computes the similarity of the node pair using the fraction of common neighbors they share relative to the total number of their neighbors. Jaccard coefficient for a pair of nodes can be computed as follows [57]: shows the first-order neighborhood of node i ∈ V.
 Resource allocation: this metric also takes advantage of the concept of common neighbors to compute the similarity between a pair of nodes but penalizes the common neighbors with a higher degree. Resource allocation for a pair of nodes can be calculated as follows [13]: ( , ) = ∑ 1 |Γ(z)| ∈|Γ(i) ∩ Γ(j)|  Adamic-Adar coefficient: this metric works in a similar way to resource allocation, and the common neighbors with lower degrees contribute more in the similarity calculation process; however, the difference between these two methods is the way they penalize nodes with higher degrees. Adamic-Adar coefficient is computed as follows [58]:  CCLP: this metric also uses the common neighbors of node pairs, but instead of considering all the common neighbors equally, it assigns weights to them using the clustering coefficient of that node. CCLP for a pair of nodes is computed as follows [36]:

Quesi-local methods:
 Local random walk: this similarity index uses random walks and measures the similarity between a pair of nodes using local random walks [21]: In this formulation, ( ) is the probability of reaching from node to node in t steps.
 Superposed random walk: this method works using a local random walk but gives more scores to the nodes nearby [21].
 Local path: this is a path-based method that uses paths with a length of 2 and 3 to compute the similarity between node pairs, but paths with a length of 2 are more important [39].
Where is the adjacency matrix. Global methods  Random walk with restart [23]: in this method, to find the similarity between a node and other nodes, a random walk is started from that node, and at each step, the walker decides the next node using the transition probability of edges. Also, the walker may return to the start node with the probability of α .Finally, the similarity between the start node and other nodes is determined using the probability of reaching that node.

Experimental results:
To evaluate our proposed method against other methods, we randomly remove 10% of edges from a dataset and consider them as missing edges. The remaining 90% of edges consist of the train set. Then we consider all the other node pairs that are not connected as non-existent edges. The union of these two sets of edges forms the non-observed edge set. After using each method to compute the score of all the non-observed edges, we evaluate the method using AUC and Precision. This process is repeated ten times for each dataset, and the average of them has been reported as final results. Table 2 illustrates the results of our proposed algorithm and other comparing methods on eleven real-world datasets. The best AUC obtained for each dataset has been shown in highlighted in bold. It is obvious that although quasi-local methods, i.e., LRW, SRW and LP, and global methods, i.e., RWR are computationally more expensive compared to local methods, i.e., JC, RA, AA, and CCLP, they have achieved a significant advantage in results almost for all the networks. For example, in the Physicians network, global and quasi-local methods have achieved over 10% higher AUC compared to local methods. Comparing the proposed method to the other methods, we understand that MIRW has significantly outperformed local, quasi-local, and global methods, which proves that MIRW has a huge advantage over all of them. In particular, comparing to the global method, i.e., RWR, it has been a 10%, 7%, and 11% improvement in AUC in karate, dolphins, and yeast networks, respectively, which is remarkable. Also, comparing to local methods, the performance of the proposed method was outstanding. For instance, in football, dolphins, and SmaGri networks, there has been an increase of 22%, 10%, and 8% resulted in AUC, which means that MIRW has considerably outperformed all the baseline local methods. In addition to that, comparing to quasi-local methods, the obtained results are very noticeable. To be more specific, in most of the networks, MIRW has significantly outperformed both LRW and SRW simultaneously, except for King James and NetSicence, in which the performance of MIRW was competitive. This is very important because it proves that using the concept of mutual influence in the transition probability computation process is very beneficial in the link prediction task.

ROC Curve:
A receiver operating characteristic curve is a graphical plot that shows how well a method identifies true positive samples and distinguishes them from negative samples. We need to plot the true-positive rate against the false-positive rate at varying thresholds to have a ROC curve. Figure 2 illustrates the ROC curves for each network and evaluates the performance of the proposed method, i.e., MIRW, against other comparing methods. The MIRW has outperformed all the methods, including local, quasi-local, and global methods, in almost all the datasets and has reached the best area under the curve. From these curves, it can be understood that using mutual influence to calculate weights of edges can greatly improve link prediction performance.  Table 3 summarizes the accuracy resulted from each method using the top precision metric. The best precision for each network is highlighted in bold. It is obvious that in most networks, our proposed method has a significant advantage compared to the local methods. In particular, in Karate, Celeganse, and Food networks, the proposed method has reached 0.3999, 0.1625, and 0.18 precision, respectively which is higher than all the local methods. However, it is clear that in some cases, the local methods have outperformed all the quasi-local and global methods in terms of top-100 precision. Compared to quasi-local and global methods, we can see that the proposed method has shown a competitive performance and proved to be precise and efficient at the same time. To be more specific, we can see that almost in all the networks, MIRW has gained more precision than LRW and SRW, which proves that taking advantage of mutual influence improves the method's performance in terms of precision. For instance, in the case of Food, Football, and King James networks, MIRW has performed 5%, 19%, and 47% better than LRW and 5%, 9%, and 36% better than SRW, which is remarkable. In other networks, MIRW has achieved acceptable results compared to LRW and SRW.

4.4.4
The varying size of the training set: Figure 3 illustrates the effect of different training sizes on the performance of the proposed method against other methods. From this figure, it can be observed that in general, with the increase of training size, the accuracy of prediction is improving. It is obvious that almost in all datasets, the MIRW has gained higher AUC compared to local, quasi-local, and global methods in different training set sizes. This is very important because it proves that even when we have access to a small fraction of observed edges, the MIRW can still predict the non-observed edges with an impressive accuracy compared to the state-of-the-art methods. In particular, in most of the networks, when the training size is very small, the proposed method has a significant advantage over the local methods. For example, in Food, Physician, and Dolphins datasets, MIRW has 7%, 12%, and 9% higher accuracy than local methods. Compared to other quasi-local random walk based methods, i.e., LRW and SRW, the obtained results for Food, Dolphins, and Physicians networks shows approximately 6%, 5%, and 7% improvement of AUC for MIRW, which emphasizes the role of mutual influence in measuring the similarity using random walks. Also, MIRW has a noticeable advantage over the global method, i.e., RWR, when the size of the training set is very small and outperforms it in almost all the networks.

Conclusion:
In the present research, a new metric similarity is proposed for link prediction, which considers mutual influence nodes; mutual influence nodes are the interactions of two nodes between each other in an asymmetric form. Also, the proposed method takes into consideration the mutual influence neighbors of the node during the movement of the random walk to reach the next step and conducts a random walk toward the node in which the source node is affected; this results in higher efficiency compared with SRW. In order to prove the performance of our proposed approach, a comparative experiment was performed on eleven real-world networks. Our proposed approach's advantages can be observed evidently in these tests. The experimental findings from tests on many networks of various sizes indicated that the proposed plan yielded positive results than other algorithms. In future studies, the proposed method will have the option to be applied to multilayer, weighted, directed, and bipartite networks. Furthermore, suggesting an approach to specify a proper length of random walk in the proposed metric in the present study is capable of being an excellent topic for future studies.