A Novel Similarity Measure of Link Prediction in Multi-Layer Social Networks Based on Reliable Paths

Online social networks are an integral element of modern societies and significantly influence the formation and consolidation of social relationships. In fact, these networks are multi-layered so that there may be multiple links between a user ’ on different social networks. In this paper, the link prediction problem for the same user in a two-layer social network is examined, where we consider Twitter and Foursquare networks. Here, information related to the two-layer communication is used to predict links in the Foursquare network. Link prediction aims to discover spurious links or predict the emergence of future links from the current network structure. There are many algorithms for link prediction in unweighted networks, however only a few have been developed for weighted networks. Based on the extraction of topological features from the network structure and the use of reliable paths between users, we developed a novel similarity measure for link prediction. Reliable paths have been proposed to develop unweight local similarity measures to weighted measures. Using these measures, both the existence of links and their weight can be predicted. Empirical analysis shows that the proposed similarity measure achieves superior performance to existing approaches and can more accurately predict future relationships. In addition, the proposed method has better results compared to single-layer networks. Experiments show that the proposed similarity measure has an advantage precision of 1.8% over the Katz and FriendLink measures.


Introduction
Many real-world systems can be described as networks that have nodes with the role of objects [1].These networks contain links between nodes that represent relationships or interactions between objects.Therefore, the study of complex networks has become a common focus of many branches of science [2].As part of recent research on large and complex networks, social network analysis (SNA) has become necessary due to their increasing extension [3].However, social networks of objects are very dynamic.They grow and change rapidly with the addition of nodes and links.As a result, predicting links in these networks is an interesting and challenging problem that has recently attracted more attention [4].
For example, finding a potential friendship between two users on a social network or a potential collaboration between two scientists may be interesting.This problem is commonly known as the Link Prediction problem.
The link prediction problem assumes the probability of a link between two nodes in a network, so that there is currently no link between them [5].In this problem, the social network  is assumed to be consecutive times  0 and  1 .Here, we are looking for a set of links that do not exist in [ 0 ], but are likely to appear in [ 1 ].The network [ 0 ] is used for training and the network [ 1 ] for testing.The correctness of the suggestions can be evaluated according to the predicted links and the actual links.The link prediction process is shown in Fig. 1.
The algorithms based on local/global similarity measure (assigning similarity rank to adjacent nodes) are maximum probability approaches and probabilistic models for link prediction [6].Classical approaches mainly take into account the similarity of the local structure when link prediction [5].These methods use some similarity measure such as Common Neighbors, FriendLink, Katz, etc. to estimate the probability of adding new links to the network [7].However, most of them are designed for a specific domain and for this reason they are called "algorithmic small world hypothesis" [5].Social networks are very big with a large number of users connecting to each other through various types of links.Therefore, predicting these links is still challenging and it is necessary to achieve a predictive method with acceptable precision.Although, the link prediction problem has been extensively studied and various researches have been presented to solve it [7][8][9]; However, the problem of how to optimally and effectively combine information to describe future communications remains largely unresolved.In [10], to link prediction in social networks, the analysis of user's demographic features has been used.The results show that the cluster coefficient and the shortest path are effective in the link prediction.In [11], a new similarity measure is proposed for the link prediction based on local structures in social networks.This measure is calculated through a supervised learning model with an observer based on estimating the similarity of source and destination nodes on a large database.In [12], a link prediction method based on the Deep Belief Network (DBN) for signed social networks is proposed.Since the DBN distributes the learning on all instances, it can be expected that the proposed links will be distributed along with their model tags.Here, the Bhattacharyya kernel is used to measure the similarity of the k-dimensional Gaussian distributions.Finally, an SVM classifier is trained to predict links based on user similarity information.In [13], link prediction in weighty social networks using learning automata is presented.Here, for each test link, a learning automaton is provided to estimate the actual weight of the link based on the weight of links in the current network.Then, each learning automata will be rewarded or punished according to its influence upon the true weight estimating of the training set.
In recent years, the link prediction problem has become popular on large networks.Researchers have proposed various methods to find missing links [7][8][9].Most of these methods are calculated based on a similarity measure on neighboring nodes [11].These methods also have limitations, because the same value is assumed for all common nodes of a node.In this paper, an efficient solution to the problem of link prediction in multi-layer social networks is presented, where a novel similarity measure is used to calculate similarity.The proposed similarity measure with assigning weight to links considers different values for common nodes of a node.
The remainder of the paper is organized as follows: Section 2 presents the overview of the link prediction problem in multi-layer networks.Section 3 introduces some of the classical similarity measures.Section 4 presents the details of the proposed method and experimental results and discussion are given in Section 5. Finally, the conclusion are described in Section 6.

Link prediction in multi-layer networks
Typically, modeling a single-layer social networking platform creates provides [14].Because, all nodes of a singlelayer network are considered to be of the same type, and all communications between the nodes are assumed to be equal.Therefore, this modeling method may lead to incorrect descriptions of some phenomena in the real world.Some realworld platforms have multi-layered structures.Obviously, social networks reflect a multi-layered structure [15].Meanwhile, users of these networks may be in different groups or even in some cases on different platforms such as Facebook and Twitter.A user probably has different communication structures on Facebook and Twitter networks.Multiplex and heterogeneous networks are two well-known categories of multi-layer networks [16].Multi-layer networks consist of interconnected nodes of the same type with different types of communications.The nodes are communicated by inter-layer and inside-layer links.
In this paper, a novel similarity measure for link prediction in a two-layer platform is presented.Here, the link prediction problem for same users on two social networks including Twitter and Foursquare is performed.In general, a multi-layered social network has different types of links.A social network architecture with two-layer (i.e.,   and   ) and three types of links (i.e.,  1 ,  2 and  3 ) is shown in Fig. 2. Here, there is a social network with two-layers   and   , where both networks are considered undirection.There are different types of links including two-layer links ( 1 ), single-layer links on   ( 2 ) and single-layer links on   ( 3 ).In this paper, the link prediction problem on two-layer networks is considered.Generally, link prediction is more useful in multi-layer networks than in single-layer networks, because multiple layers may provide more information about a node than a single-layer network.
It is important to study link prediction in multi-layer networks.Multi-layer networks consist of several layers with the number of same nodes in each layer [6].The information from these layers may be used to predict missing links in a layer.The use of inter-layer information for solving the link prediction problem in multi-layer networks has already been considered in a number of researches [17].In [18], an iterative degree penalty (IDP) algorithm for link prediction in multilayer social networks is presented.IDP performs better than current methods based on network structure when the average network degree and nodes overlapping rate are low.In [19], hyperbolic geometry was used to predict links in multiplex networks.Here, link prediction is performed using two new similarity measures based on hyperbolic distance.In [20], a decision tree classification model is proposed to link prediction in a multiplex collaboration network with three layers.In [6], the supervised classification model is used to link prediction in a two-layer network including Twitter and Foursquare.Weight networks may provide more information than unweight networks in which each link has a specific weight [21].These weights are effective in more accurately predicting links [21].Currently, multi-layer networks with weighted links are more popular for solving link prediction problems [22].

Classical link prediction measures
Most link prediction methods assign a weight value to the link of each node pair (, ) based on a similarity measure [11][12][13].This value is a score for predicting missing links between two nodes.The two nodes with the highest similarity scores are more likely to be linked in the future.In this section, some of the most popular classical similarity measures for the link prediction problem are introduced.

Common Neighbors (CN) measure:
In CN measure, the score for link prediction is computed by finding the number of common neighbors that are directly connected to the two nodes under evaluation [23].The CN measure can be represented by Eq. ( 1). (1) Where,  and  are nodes, and () and () show the neighbors of nodes  and , respectively.Jaccard (JA) measure: This measure was developed in 1901 based on a statistic to compare similarity and diversity of sample sets [24].The JA similarity measure refers to the ratio of common neighbors of nodes u and v to the all neighbor's nodes of u and v and prevents higher degree nodes to have high similarity measure with other nodes.The JA measure can be represented by Eq. ( 2).Adamic-Adar (AA) measure: This measure is related to the JA measure and used to compute the closeness of nodes based on their common neighbors [25].AA measure gives more importance to common neighbors who have fewer neighbors.The AA measure can be represented by Eq. ( 3). (3) Where,  is the common neighbor of  and  nodes and   is the degree of  node.Katz (KT) measure: KT measure is a global structure based similarity index and considers all paths between two nodes in calculating the similarity score [26].This measure introduces the concept of node centrality.The KT measure can be represented by Eq. ( 4). ( 4) Where, ℎ , <> the number of length paths  is between nodes  and , and  is a damping factor used to control path weights, where 0 <  < 1.In fact,  is a factor to reduce the effect of long paths in calculating similarity scores.
FriendLink (FL) measure: FL measure is a quasi-local structure based similarity index and uses paths longer than 2 to calculate similarity [5].The FL measure can be represented by Eq. ( 5). (5) Where,  is the number of nodes in network,  is the maximum path length considered and 1/( − 1) is the attenuation factor that weights path according to length .In addition, is the number of possible length -paths from  to .

The proposed similarity measure
In this section, an efficient solution to the link prediction problem in the two-layer social network is presented.The proposed method performs link prediction based on same users on Twitter and Foursquare networks.First, same users are identified based on the maximum similarity in their profiles.Then, inter-layer and inside-layer links of users are configured.Then, users are assigned to two sets of training (  ) and testing (  ).The purpose is to apply link prediction to Furasquare based on topological information from both Twitter and Furasquare layers.The flowchart of the proposed method is shown in Fig. 3.
In this paper, the similarity between users is calculated based on four topological features including the number of common neighbors ( 1 ), the number of common posts ( 2 ), the number of multi-layer paths ( 3 ) and the number of common multi-layer paths ( 4 ).Features are extracted for each user pair  and , where  ∈   and  ∈   .The  1 feature expresses the number of users that link to both  and  users.The  2 feature represents the number of common keywords used in  and  user's posts.The  3 feature expresses the number of multi-layer paths between users  and .A multi-layer path of length 2 between users  and  is defined as   , where there is a link between users  and  in layer   (Twitter network) and user  is linked to user  in layer   (Foursquare Network).Here, the feature of the number of multi-layer paths between two users  and  is calculated based on the number of similar users , where the path length is assumed to be 2.For example, in Fig. 4, there are two multi-layer paths of length 2 between users  1 and Hence, the value of this feature is equal to 2. The  4 feature is similar to the number of multi-layer paths, except that one of the links must appear in both layers.Thus, a common multi-layer path of length 2 between users  and  is defined as Here, the feature of the number of common multi-layer paths between two users  and  is calculated based on the number of similar users , where the path length is assumed to be 2.For example, in Fig. 4, there is a common multi-layer path of length 2 between users  1 and  2 as  1 Hence, the value of this feature is equal to 1.In the following method, the similarities between users are calculated based on the extracted features.Hence, the network is mapped to a weighted graph, where similarity is calculated based on the Pearson coefficient between each pair of users.Pearson correlation coefficient is calculated as Eq. ( 6).Where,   and   are the -th features for users  and , respectively,   ̅̅̅ and   ̅̅̅ are the average of all the features for users  and , respectively, and m represents the total number of features extracted.
In the following, a novel measure is developed to calculate the similarity between users based on reliable paths.The proposed similarity measure is based on weighted networks and is developed based on the KT measure.Local similarity measures, such as CN and JA, apply to the link prediction problem based on all paths between two users of length 2. Meanwhile, quasi-local and global similarity measure such as FL and KT solve the link prediction problem using all paths between two users with a length of more than 2.These measures have been shown to provide more accurate link prediction than local similarity measure, because in them the diversity of communication paths is considered [5,26].However, quasi-local and global similarity measure only consider the number and length of different paths and do not consider the A reliable path is provided by generalizing the quasi-local or global similarity measure of unweight networks to weight networks, where the importance of each link in the path based on its weight is considered to calculate the final similarity [7].Thus, a reliable path between two users includes the path with the highest weight, which shows the similarity between them.In general, the weight of a link indicates that it is probability to be safe on the path, which can be considered as the reliability of that path.Hence, a reliable path is a combination of the probabilities of all links in that path, which can help the link prediction in social networks.
In various studies, it has been shown that a reliable path between two users can be calculated based on the "multiply the path links weight" [7].This is because the sum of the path weights cannot express the importance of the path with respect to the path length.However, this technique has so far been used on local similarity measures and in this paper is the first time that it is applied on quasi-local and global similarity measure.Here, a novel similarity measure based on KT global similarity measure is presented.In the KT, only the number and length of paths between users are considered, while in the proposed similarity measure, the effect of each link in the path is also applied through the path weight.Therefore, the proposed measure maps the similarity of the number of paths in KT to the number of weighted paths.Eq. ( 7) shows the proposed similarity measure for calculating the similarity between users  and .Where, ℎ , <> is the set of paths between users  and  of length , and  represents a path of ℎ , <> .(, ) are two consecutive nodes of the path  that provide a link.(, ) is the weight associated with the link (, ).The damping factor  is defined similar to the KT measure to reduce the impact of paths with longer lengths.In addition,  is considered the maximum path length.
For a better understanding, consider the graph in Fig 5 .In this example, assuming  = 0.05 and  = 3, the similarity between the two users  2 and  5 is calculated based on the KT measure.Based on 2 paths with length 2 (i.e., 〈 2 →  3 →  5 〉 and 〈 2 →  6 →  5 〉) and 1 path with length 3 (i.e., 〈 2 →  3 →  4 →  5 〉) the final similarity score as Eq. ( 8).The results show that the KT measure does not consider the difference between links in the path, while there may be a strong link between two users in the path that has a high impact on future communication.between users  2 and  5 .Considering the proposed similarity measure, the final similarity score is according to Eq. ( 9). ( 9) ( 2 ,  5 ) = [0.05 2 × ((0.4 × 0.9) + (0.3 × 0.1))] + [0.05 3 × (0.4 × 0.9 × 0.6)] = 0.0010 We will now increase the link weight between users  2 and  5 to 0.8 to make this path more important for communication between these users in the future.Based on this, the similarity is calculated according to Eq. ( 10).It is clear that the proposed similarity measure increases the likelihood of linking between users  2 and  5 in the future due to increased link weights.Therefore, this technique considers safe and strong communication between links in calculating similarity between users, which can be effective for link prediction.
According to the proposed similarity measure, the similarity between each pair of users  and  is calculated, where  ∈   and  ∈   .Then, for each user such as , number of  users such as  is suggested with the highest similarity score.In the link prediction process for a user such as u versus a user such as v, a direct link between them (If there is a directed link or even an undirected link) should not be considered.In fact, the purpose is to predict the existence of direct link between these two users based on other links.

Simulation analysis
In this section, we perform extensive experiments on real data sets to evaluate the effectiveness of the proposed method.The proposed method is simulated in MATLAB R2019a software.All experiments were performed on a PC with 3.2 GHz Intel Core i7 CPU, 32 GB of RAM and Windows 10 operating system.In order to more accurate evaluation and fair comparisons, the all results are presented by the 10-fold cross validation technique.According to this technique, users are divided into two sets of training (  ) and testing (  ) so that  =   ∪   and   ∩   = ∅.Meanwhile,  is the total number of links between users on both layers.The purpose is to recommend the number of  users with the highest rank of the   collection to users in the   collection, where this process is performed for all users of the   collection who have links to at least one user from the   collection.This section consists of 4 subsections: (1) dataset description, (2) evaluation criteria, (3) parameter analysis, and (4) results and comparisons.

Dataset description
The dataset used in this paper is a two-layer social network including Twitter (as a microblogging social network) and Foursquare (as a location-based social network).The datasets used from both social networks were surveyed in November 2012 and are available from https://data.world/datasets[27,28].The Twitter network allows users to share tweets (messages) with a maximum of 140 characters.In this network, users can follow these tweets, where the link of the follower users to following users forms a directional network.Foursquare is the undirected network that allows users to share their location with friends by "checking-in" at a given place using their smartphone.In this paper, the social communications of same users in Twitter and Forasquare social networks are considered, where based on the communications in the Twitter network, link prediction is done for Forasquare network users.The same users were searched based on a similarity score greater than the threshold value for all profile pairs.Meanwhile, there are about 45,000 potentially identical users identified on the profile alone.All paths between users on both networks can be found by performing the DFS search.Here is a collection of 1517 users that can be used to simulate the proposed method.Statistical data on these two datasets are available in Table 1.

Evaluation criteria
Different evaluation criteria such as Precision, Recall and F-measure are used to confirm the performance of the proposed method [29].Precision is defined as the ratio of the number of correct users suggested () to the total number of users suggested.If   are only links from  in the   collection; then the precision criterion can be represented by Eq. (11).Recall is defined as the ratio of the number of correct users suggested to the total number of actual related users.If   contains users from the   collection that link to the target user (actual related users), then the recall criterion can be represented by Eq. (12).Finally, the f-measure can be interpreted as a weighted harmonic meaning of precision and recall, where it considers both false positives and false negatives.This criterion is defined according to Eq. ( 13).In order to calculate the evaluation criteria, first precision, recall and f-measure are computed for each user of   collection (target users) and then the final results are reported on average for all users.

Parameter analysis
In this section, the parameters of the proposed method to improve performance in link prediction are analyzed.These parameters include path length (), path length impact factor (), similarity calculation coefficient, two-layer platform, extracted features, and attenuation factors.The analysis is performed in order to find the optimal value of these parameters in the proposed method.When simulation is applied for a parameter with different values, the other parameters are set to  = 3,  = 0.05 and  = 5 on the default values.
First, the proposed similarity measure analysis with different path lengths is reported in Table 2.The results show that the proposed similarity measure with a maximum path length of 3 has the best performance.However, the results for  = 2 and  = 4 are also promising.Optimal value are shown in bold type.In the following, the value of the path length impact factor in the proposed similarity measure according to Table 3 is investigated.The results clearly show the higher efficiency of the path length impact factor with a value of 0.05.However, the  = 0.01 also reports suitable results.The value shown in bold in the table indicates the comparatively better optimum value.In the proposed method, Pearson correlation coefficient is used to calculate the similarity of the two users based on the extracted features.However, there are other coefficients such as Cosine, Jaccard, etc. to calculate similarity.Here, the Pearson coefficient is compared with the Cosine and Jaccard coefficients for better performance in the link prediction problem.The results of this comparison in Table 4 show the superiority of the Pearson coefficient, although this superiority is negligible.In the proposed method, a two-layer platform including Twitter and Foursquare is used to undirected predict links in the Foursquare layer, so that each network is considered in a separate layer.Therefore, to predict links in the Foursquare network, it is necessary to have communication information in both Twitter and FuraSquare networks.In other words, the communication topology of Twitter and Forasquare networks is used to predict links in Forasquare.Here, we proved that using both layers of information improves link prediction performance compared to using single-layer information alone.In the single-layer platform, only the Foursquare network topology information and the  1 and  2 features are used for link prediction work.The results of this comparison in Table 5 clearly show the superiority of link prediction in the two-layer platform, where this superiority in the precision criterion is more than 19%.Optimal value are shown in bold type.In the following, the effect of 4 extracted features on the performance of link prediction is investigated.Here, the effect of each feature on link prediction is calculated by deactivating that feature from the proposed method.The results of this experiment are shown in Fig. 7 based on various criteria.The results show that the accuracy of link prediction is reduced by deactivating the  3 feature (i.e., the number of multi-layer paths) compared to other features.Meanwhile, this feature has the greatest impact on the performance of the proposed similarity measure.However, the proposed method offers the best performance considering all the features.

Fig. 7. The effect of extracted features on the efficiency of the proposed method
The proposed similarity measure is designed with attenuation factors   .The purpose of this factor is to assign a score less similarity to paths with more length.In the following, the efficiency of this factor is compared against the number of different attenuation factors.The results of this comparison in

Results and comparisons
The proposed similarity measure uses only the topographic information of the network, so its results should be compared with other classical similarity measures in the link prediction problem.Here, five classical similarity measures including Common Neighbors (CN), Jaccard (JA), Adamic-Adar (AA), Katz (KT) and FriendLink (FL) are used for comparison.Meanwhile, for each different number of suggested users (), the evaluation criteria including precision, recall, and f-measure are calculated as average for all users of the   collection.
Figure 8 shows the results of the comparison of the proposed similarity measure with other classical similarity measures based on precision criterion.Comparisons are presented based on different  from 1 to 30.Here, PM refers to the results of the proposed similarity measure.The results of this comparison show that the proposed similarity measure in most cases has better precision than other classical similarity measures.At best, when  = 1, the precision results for the proposed method are 0.877.However, as the number of suggested users increases, the results of the precision criterion decrease.This is clearly visible for all similarity measures.The reason for this decrease is the precision criterion calculation process, where at the denominator is the  value.

Fig. 8. Comparison results based on precision criterion with different number of suggested users
In another experiment, the proposed similarity measure (PM) was compared to the classical similarity measures based on recall criterion.The results of this comparison with different  from 1 to 30 are shown in Fig. 9.The results of this experiment also show the superiority of the proposed similarity measure in most comparisons.At best, when  = 30, the recall criterion results for the proposed method are 0.635.However, the recall value decreased as the number of suggested users decreased.This is clearly visible for all similarity measures.The reason for this decrease is the recall criterion calculation process, where at the denominator is the number of actual related users.Figure 10 shows the f-measure criterion results for the proposed similarity measure (PM) and other classical similarity measures including CN, JA, AA, KT, and FL.F-measure criterion makes it easier to visualize and compare the performance of link prediction methods in different operating conditions than ROC [30].Here, the results are reviewed for different number of suggested users from 1 to 30.The results of this comparison show that when  = 5, the proposed method achieves the best f-measure result of 0.645.After the proposed method, the similarity criteria of KT, FL, JA, CN and AA are in the next ranks in terms of efficiency, respectively.These results prove the superiority of the proposed similarity measure over other classical similarity measures.

Fig. 3 .Fig. 4 .
Fig. 3. Flowchart of the proposed method Input: User information from Twitter and Foursquare networks Assign users to training and testing sets Calculate similarities between users through different features Network mapping to weighted graphs based on Pearson coefficient Find reliable paths between users through weighted graph Output: Evaluation results of link prediction model Training part Develop the novel similarity measure for link prediction Testing part Calculate similarities between testing and training users through different features Calculate the links weight between testing and training users based on Pearson coefficient Find reliable paths between testing and training users through weighted graph Calculate the final similarity between testing and training users based on the novel similarity measure Suggest  users with the highest similarity for each test user of Foursquare layer Select the same users from Twitter and Foursquare Users configuration based on multi-layer networks Check the correctness of the suggested users with different criteria of the path links.For example, in the KT similarity measure, |ℎ ,  | considers only the number of paths that exist between users  and  with length .

Fig. 5 . 1 →
Fig. 5.An example of the similarity calculation in the KT measure

Fig. 6 .
Fig. 6.An example of the similarity calculation in the proposed measure

Fig. 9 .
Fig. 9. Comparison results based on recall criterion with different number of suggested users

Fig. 10 .
Fig. 10.Comparison results based on f-measure criterion with different number of suggested users

Table 6 . Proposed similarity measure analysis with different attenuation factors
Table 6 prove the best performance for the attenuation factor   .