Research Article Link Prediction Based on the Derivation of Mapping Entropy

The algorithms based on topological similarity play an important role in link prediction. However, most of traditional algorithms based on the inﬂuences of nodes only consider the degrees of the endpoints which ignore the diﬀerences in contribution of neighbors. Through generous explorations, we propose the DME (derivation of mapping entropy) model concerning the mapping relationship between the node and its neighbors to access the inﬂuence of the node appropriately. Abundant experiments on nine real networks suggest that the model can improve precision in link prediction and perform better than traditional algorithms obviously with no increase in time complexity.


Introduction
A large number of complicated systems in nature can be described by complex networks [1]. e nodes in the network represent the individuals in the real system, and the edges connecting two nodes represent the relationships of the individuals in the real system. e existing networks in the world can be divided into social networks, biological networks [2,3], and so on. Link prediction evaluates the possibility of the link between two nodes in the network by the known network structure or node attributes. rough link prediction, we can find the links existing but unknown in the network which misses some data. Besides, we can also predict the possible links in the coming evolution of the network [4]. Link prediction plays an important role in practical application. For instance, through link prediction, the unknown interaction between proteins is predicted, which avoids the high experimental cost [5]. Furthermore, it also plays a role in user recommendation [6].
In the early period, most researchers engaged in link prediction focused on the similarity of the attributes of nodes such as age, occupation, interest, and so on [2,7] to judge the possibility of links. is method can achieve highprecision prediction. However, it is difficult to extract the attributes of nodes in complex networks, and the reliability of information is hard to assure [8]. So, researchers turned their attention to the study of network structure [9,10] which has relatively low computational complexity.
Algorithms based on the similarity of the network structure can be divided into three categories: local similarity algorithm, global similarity algorithm, and quasilocal similarity algorithm according to the path length [11]. e core idea of local similarity is common neighbors. On this basis, considering the influence of endpoints from different angles, a variety of local similarity indices is derived. For instance, common neighbor (CN) index [12] considers that if two nodes have more common neighbors, then they are more likely to have connected edges. Adamic-Adar (AA) index [13] holds the idea that the common neighbor with small degree has a greater contribution. Accordingly, each node is given a weight. Because these local similarity indices only consider the local structure of the network which lead to low precision, the third-order and higher-order path similarity indices were proposed such as Katz index [14]. Katz index considers all paths of two unconnected nodes in the case of short path priority. ere is no doubt that this greatly increases the computational complexity. Compared with the above two kinds of algorithms, the quasilocal similarity algorithm with moderate complexity and precision is more and more widely applied. Superposed random walk (SRW) index [15] is one of the indices based on the Markov model.
In traditional algorithms, only the degree of endpoint is considered when we evaluate the influence of it. is considers the influences of neighbor nodes to the same extent which loses the impacts of indirect neighbors [16,17]. In fact, due to the different degrees of neighbors, their influence on the endpoint should be different. e larger the degree, the greater the influence. However, taking global nodes into account will increase the complexity of the algorithm, and the result is not necessarily good. Because the influence of endpoint is limited, it only has great influence on nearby neighbors. erefore, this paper proposes to use the derivation of mapping entropy (DME) of node to represent the influence, which represents the mapping relation between a node and its neighbors. It considers not only the weight of the node but also the weight of its neighbors. Figure 1 shows a clear illustration.
On the basis of above discussion, we improve the SRW model, taking the influence of indirect neighbors into account.
rough extensive experiments on nine complex networks, the results show that DME can achieve higher precision than traditional algorithms in most cases. e rest of paper is organized as follows. In Section 2, we propose a new model based on the DME index. In Section 3, we introduce 9 complex networks and experimental approaches. In Section 4, five classic models are introduced as reference. In Section 5, results and analysis are presented. In Section 6, we arrive at a conclusion of our study.

Network Model. G(V, E) is defined as a network, where
V is the node set and E is the edge set. e total number of nodes is N and the total number of edges is E. e universal set U can have (N × (N − 1))/2 links. e method of link prediction is to give a score s xy to each pair of unconnected nodes which indicates the likelihood of connecting the two nodes. en, all unconnected nodes are arrayed in descending order of score. e node pair in the top represents that the two nodes are the most likely to generate a connection. In order to test the precision of the algorithm, the known edge set E is divided into training set E T and testing set E P . Only the testing set can be used to calculate scores. Obviously, E � E T ∪ E P and E T ∩ E P � ϕ. We define an edge belonging to U but not to E as an inexistent edge. In this paper, we use precision [18] to measure the accuracy of link prediction algorithm, which describes the proportion of real links in the top-L links with highest scores. If there are m real links in top-L links, the precision of the algorithm can be expressed as (1) In order to simplify the model, we use undirected and unweighted networks.

Superposed Random Walk (SRW) Model.
e SRW model inspired from the LRW model considers random walk between endpoint x and y, making the nodes nearby more likely to connect to the target node [19]. It is defined as where the initial density vector π → xy (0) � e → x and it evolves as π → xy (t + 1) � P T × π → xy (t). P represents the probability transition matrix with p xy � (a xy /k x ), and a xy � 1 when the link exists; if not, a xy � 0. Besides, t denotes the time steps.

Derivation of Mapping Entropy (DME) Model.
Inspired by Shannon entropy, the information entropy [20] of the network can be expressed as where DC i is the degree centrality of node i. A node and its neighbors construct a subnetwork. e local entropy (LE) [21] of the subnetwork originated from endpoint v i is shown in the following formula: where DC j is the degree centrality of node v j , which belongs to the neighbor set M of node v i . Taking the mapping relation between a node and its neighbors into account, we can obtain the mapping entropy (ME): where DC i is the degree centrality of node v i and DC j is the degree centrality of one of the neighbors of node v i . Inspired by ME index, we introduce the derivation of mapping entropy: DME, which is defined by interleaving the degrees of node v i and v j .
e definition considers both the degrees of the node and the degrees of its neighbors which takes the influence of indirect neighbors into account.
is may be useful for distinguishing the importance of neighbors. Based on the SRW model, we consider using the DME index to replace the influence of the endpoint, which can perform better than the ME model introduced later through experiments based on the superposed random walk. e model is defined as 2 Complexity As mentioned above, for better comparison, we also apply the ME index into the SRW model and the ME model as shown below.

Experimental Data
In order to confirm the validity of the DME model, we conduct abundant experiments on nine real networks. ey are listed as follows:(1) US Air (USAir), describing the network of the US air transportation system [22]; (2)  Our model is applied to the undirected and unweighted connected networks. Accordingly, we make arcs turn into undirected links. Besides, we delete the loops and multiple connections. Subsequently, the maximal edge-connected graph is extracted from each raw dataset to guarantee the connectivity of the whole.
Before the experiment, the edge set E of the nine networks is divided into two parts E T and E P randomly. e training set E T contains 90% of the whole edge set. e testing set E P contains 10%. e connectivity of the E T is guaranteed by the means of adding edges randomly to the minimum spanning tree until the training set contains 90% links. Next, 30 groups of separate experimental data for each network are divided in the same size. en, they are applied for the averaged precision by statistical methods to avoid the randomness of results.

Reference Standard
In order to highlight the superiority of our algorithm, five classic methods are listed as follows.
(1) In common neighbor (CN) [12], the similarity is judged by the number of neighbors shared by node x and node y, which is defined as where Γ(x) represents the set of neighbors of node x.
Besides, |Γ(x) ∩ Γ(y)| refers to the amount of common neighbors of nodes x and y. (2) Preferential attachment (PA) [29] considers that the probability of a new link linked to the node x is proportional to k x , so the probability between node x A (a) B (b) Figure 1: Sketch maps of influence based on the derivation of mapping entropy. As is shown, the degree of node A in subgraph (a) is equal to the degree of node B in subgraph (b). In traditional models, they are considered to have the same influence. Nevertheless, in the DME model, the differences in the contribution of their neighbors are taken into account. We think that node B has more influence. By calculation, we quantify the influence of A and B as 3.77, 4, 81.
Complexity 3 and y is proportional to k x × k y . e index is defined as is index does not require the information of the neighborhood of each endpoint. erefore, it has low computational complexity.
(3) In Adamic-Adar (AA) [13], the idea is that the contribution of the node with small degree is greater. So, each node is given a weight value equaling to 1/(log k z ) where k z is the degree of a node from common neighbor set. e similarity is defined as where k z represents the degree of common neighbor z. (4) Resource allocation (RA) [30], derived from AA, considers the resource allocation of network. Each node is given a weight value which is equal to 1/(k z ), and the similarity is defined as (5) Superposed random walk (SRW) has been discussed in Section 2 in detail.

Results and Analysis
In order to prove the effectiveness of the DME model, abundant experiments have been carried out in nine real networks. e results are shown as follows.
In Figure 2, we plot the variation of the average precision with random walk steps obtained by SRW, ME, and DME in nine networks in the case of L � 100. We can see that DME performs better obviously in 8 of the 9 networks than SRW. Furthermore, compared to both SRW and ME models, the DME model achieves the maximum precision in 6 of the 9 networks. Because the ME index reflects the robustness of local network, it is more suitable for applying in network attacks to represent the importance of nodes. erefore, we arrive at a conclusion that DME can achieve the highest accuracy in most cases when the random walk step t is optimal. Besides, it can reach the maximum precision in the minimum number of steps so that it can reduce the computation with the same precision. Table 2 contains the detailed description of Figure 2. Furthermore, it also compares our model with other five classical models. e maximum precision is emphasized in bold and the corresponding step is in the parentheses. As is shown, the DME model reaches the highest precision in 6 of the 9 networks under the condition of L � 100 compared with other five traditional models.
For ensuring the integrity of the experiment, we also conduct experiments in the case of L � 50. e results are shown in Table 3. We italicize the values when the DME model is more exact than SRW. ere are still 6 networks. Nevertheless, the advantage is not obvious when compared with other five models comprehensively.
is means the DME model performs better in the top 100 links than 50 links. Actually, L is often defined as a large number to avoid random error. e reason why the DME model can have an excellent performance is that it takes the mapping relationship between a node and its neighbors into comprehensive consideration. In this way, the differences in contribution of neighbors (i.e., the influences of indirect neighbors) are included, so that the model can assess the importance of endpoint better. ough the DME model can achieve preferable performance in most datasets experimented by us, it also has no superiority in a few networks such as Jazz. By analyzing the topological characteristics of these networks, we find that they usually have the same features. e model we propose may be not suitable for the networks with good associativity coefficient and high clustering coefficient. We infer the reason is that the differences in contribution of neighbor nodes in such networks cannot be well reflected.
Furthermore, time complexity is also a significant factor to evaluate an index. For instance, CN index has O(N 3 ) |V| denotes the number of nodes in the network, |E| represents the total number of links, 〈k〉 is the average degree of all nodes, 〈d〉 indicates the average distance among nodes, C represents the clustering coefficient, r is the associativity coefficient, and H denotes the degree heterogeneity defined as H � (〈k 2 〉/〈k〉 2 ).  Figure 2: Precision of SRW (green circles), ME (blue rectangles), and DME (red triangles) versus number of random walk steps t with L � 100 in 9 real datasets. e highest precision in each network is marked by a black five-angled star. As is shown, DME performs better obviously in 8 of the 9 networks than SRW. Furthermore, compared to SRW and ME models, the DME model achieves the maximum precision in 6 of the 9 networks.

Conclusions
Existing link prediction algorithms on the basis of structural similarity mostly focus on the paths or the influences of nodes with only their degrees considered. Because the differences in contribution of neighbors are not considered, the precision of algorithm is limited. rough analysis, we propose the derivation of mapping entropy (DME) model, which interleaves the degrees of node and its neighbors. We investigate our model in comparison of CN, PA, AA, RA, SRW, and ME models on nine real datasets.
e results indicate that the DME model prominently performs better than other six models and can achieve maximum precision in the minimum number of steps which reduces the computation with the same precision. Furthermore, the DME model does not increase time complexity. e DME model proposed in our study reveals the effectiveness of distinguishing the differences in neighbors' contribution. is finding can provide a reference for future research. However, we only take the influence of indirect neighbors into account and ignore other factors such as coreness and H-index which can describe the maximal connected subgraph. Besides, we do not know the performance of the DME model in weighted and directed networks. e results of our research are meaningful, and they are of great significance to the practical application of academic research. We can apply it in recommendation system, social cooperation network, information and communication technology, potential interactions in biological networks, and so on. Significantly, this work can inspire further work to add other factors such as H-index on the basis of our model and optimize the DME model in weighted and directed networks.
Data Availability e datasets used in this study are available from the corresponding author upon request.

Conflicts of Interest
e authors declare that they have no conflicts of interest.  (2) Each data point is an average over 30 independent datasets, each of which is randomly divided into training set and test set with 90% and 10% probability. e value enclosed in the parentheses represents the step corresponding to optimal precision. Besides, the italicized values illustrate that the DME model performs better than the SRW model. 6 Complexity