Extending adamic adar for cold-start problem in link prediction based on network metrics

The cold-start problem is a condition for a new node to join a network with no available information or an isolated node. Most studies use topological network information with the Triadic Closure principles to predict links in future networks. However, the method based on the Triadic Closure principles cannot predict the future link due to no common neighbors between the predicted node pairs. Adamic Adar is one of the methods based on the Triadic Closure principles. This paper proposes three methods for extending Adamic Adar based on network metrics. The main objective is to utilize the network metrics to attract the isolated node or new node to make new relationships in the future network. The proposed method is called the extended Adamic Adar index based on Degree Centrality (DCAA), Closeness Centrality (CloCAA), and Clustering Coefficient (CluCAA). Experiments were conducted by sampling 10% of the dataset as testing data. The proposed method is examined using the four real-world networks by comparing the AUC score. Finally, the experiment results show that the DCAA and CloCAA can predict up to 99% of node pairs with a cold-start problem. DCAA and CloCAA outperform the benchmark, with an AUC score of up to 0,960. This finding shows that the extended Adamic Adar index can overcome prediction failures on node pairs with cold-start problems. In addition, prediction performance is also improved compared to the original Adamic Adar. The experiment results are promising for future research due to successfully improving the prediction performance and overcoming the cold-start problem.


Introduction
The cold-start problem is a common problem in recommendation systems (RS) [1]- [3]. User coldstart problems [4], [5] and product cold-start problems [6], [7] are two categories of cold-start problems in the RS. Both types of problems arise because no information is available and cause the recommender system to underperform in handling sparse data [8]. In addition, the recommender system cannot provide specific recommendations due to insufficient information [9]. For prediction, information is usually collected based on a particular period as training data to provide appropriate recommendations [10]. Therefore, the recommender system must recognize the user or product in a cold-start problem, so it needs special handling in the prediction process.
The cold-start problem also occurs in link prediction [11]- [13]. Link prediction is a method for identifying the future link based on existing information [14]- [17]. Leroy et al. [18] introduced link prediction with the cold-start problem in 2010. The cold-start problem is classified into partial and pure

A R T I C L E I N F O A B S T R A C T
The cold-start problem is a condition for a new node to join a network with no available information or an isolated node. Most studies use topological network information with the Triadic Closure principles to predict links in future networks. However, the method based on the Triadic Closure principles cannot predict the future link due to no common neighbors between the predicted node pairs. Adamic Adar is one of the methods based on the Triadic Closure principles. This paper proposes three methods for extending Adamic Adar based on network metrics. The main objective is to utilize the network metrics to attract the isolated node or new node to make new relationships in the future network. The proposed method is called the extended Adamic Adar index based on Degree Centrality (DCAA), Closeness Centrality (CloCAA), and Clustering Coefficient (CluCAA). Experiments were conducted by sampling 10% of the dataset as testing data. The proposed method is examined using the four real-world networks by comparing the AUC score. Finally, the experiment results show that the DCAA and CloCAA can predict up to 99% of node pairs with a cold-start problem. DCAA and CloCAA outperform the benchmark, with an AUC score of up to 0,960. This finding shows that the extended Adamic Adar index can overcome prediction failures on node pairs with cold-start problems. In addition, prediction performance is also improved compared to the original Adamic Adar. The experiment results are promising for future research due to successfully improving the prediction performance and overcoming the cold-start problem.
Several studies proposed solutions to deal with the cold-start problem. Leroy et al. [18] proposed a method based on the probabilistic bootstrap graph. Ge & Zhang [24] presented the pseudo-cold-start link prediction with the two-phase method for predicting social structure in case only available small subgraph and multiple heterogeneous sources of the social network. The first phase generates a feature selection scheme and proposes a regularization method in the second phase to control over-fitting risk. Yan et al. [25] investigated friend recommendations based on cross-platform social relationships and behavior information. Zhang et al. [26] accommodated the information distribution using additional transfer information of old users from auxiliary sources for the new user. Rohani et al. [27] proposed an algorithm incorporating social features and faculty and friend's mate's ratings in social networks for academics.
Furthermore, Han et al. [28] examined the users' attributes and proposed new users prediction using users' social features based on the Support Vector Machine. Wang et al. [29] also presented a possible connection between cold-start users and existing users based on topological information extraction. The proposed method used the latent-feature representation model and established the relationship based on topological and non-topological information. Zhu et al. [30] proposed a recommendation system combining auxiliary and heterogeneous information in multiple networks. The latest approaches were multi-relational networks [13], [31] and learning community-specific [32], [33]. Wu et al. [31] summarized the complex system into multi-relational networks and used the latent space network model for extracting low-dimensional sub-networks factors. The target and auxiliary sub-networks regression was also proposed for predicting the potential links of cold-start nodes in target sub-networks [13]. Meanwhile, Xu et al. [32], [33] presented two models for learning community similarity metrics using community detection and named the community-weighted ranking (CWR) and probability (CWP) models.
Auxiliary networks and community information are generally chosen to address the cold-start problem. Both approaches have derivative problems: personalization, privacy, and overlapping community. The lack of auxiliary networks also needs cross-platforms of other networks and transfers existing links from the auxiliary to target networks [28]. Besides, both approaches require complex computing. Therefore, this research proposes a simple method by extending the Adamic Adar index and utilizing the network metrics to attract low-degree or isolated nodes. Although, several previously proposed methods have also extended Adamic Adar. The proposed methods did not aim to overcome the cold-start problem. Nassar et al. [34] proposed pairwise link prediction to predict new triangles based on extending Jaccard similarity and Adamic Adar. Later, Liu et al. [35] proposed weighted similarity based on extending unweighted similarity: Common Neighbor, Adamic Adar, Salton similarity, Jaccard similarity, Resource Allocation, and Local Path.
In summary, the contribution of this research includes: 1. To propose three novel methods by extending the Adamic Adar index based on network metrics by combining the local and global measures of the network: degree centrality, closeness centrality, clustering coefficient, and network density. Furthermore, this proposed method is called the extended Adamic Adar index based on Degree Centrality (DCAA), extended Adamic Adar index based on Closeness Centrality (CloCAA), and extended Adamic Adar index based on Clustering Coefficient (CluCAA).

2.
To conduct experiments and measurements using four networks to examine whether the proposed method achieves the better performance in link prediction.
The rest of the sections are arranged as follows. Section 2 introduces extending Adamic Adar, and the experiment results are discussed in Section 3. The conclusion is presented in Section 4.

Network Metrics
The network metrics are mathematical properties to quantify network topology information. The information topology is classified into local and global information. Local information is the information quantification of a node, and global information is related to the complete information of a network. The degree of centrality, closeness centrality, and clustering coefficient are local measures. The degree of centrality shows a node's importance based on the number of connected nodes [36]. The degree of the node has a hub information function and has a higher impact on influencing the other nodes. The degree centrality d(u) of node u is defined in Equation (1).
where m uv is equal to 1 or 0 to show the exiting links of node u and v.
The closeness centrality shows the node's impact on receiving and sending information to other nodes by calculating the path length between nodes [36]. The closeness centrality c(u) of a node u is defined in Equation (2).
where d uv is a value between 0 and 1 to show the number of the shortest path link from node u to v.
The clustering coefficient C(u) of a node u calculates the link possibility between two nodes based on the total possibility of all links. The clustering coefficient, called cliques, communities, or clusters, is defined in Equation (3).
where K u (k u − 1) is the maximum possible links in neighbors of u and T(u) is the number of distance triangles with node u.
Network density is a global measure of a network with a range value between 0 and 1 to show the number of links and closeness to a complete network. A dense network is a network with many connections, and a sparse network is a network with few links. The undirected and directed network densities D(G) are defined in Equations (4) and (5).
where m = |E| is the number of edges and n = possible number of edges.

Adamic Adar Index
Adamic/Adar (AA) was proposed by Adamic Pepper and Eytan Adar [37] to calculate scores as an index similarity between two web pages. The AA index depresses the common neighbors with the node degree and is defined in Equation (6).
where Γ( ) ∩ Γ( ) represents list of common neighbors for node pairs of u and v, Γ( ) represents the neighbor of each common neighbor's node.

Extended Adamic Adar index based on the Network Metrics
Most link prediction methods are based on Triadic Closure principles and cannot predict the future link for the node with the cold-start condition. For instance, a simple analogy to this situation is that a new participant comes to an international conference, and this participant does not know anyone at the international conference. Then naturally, this new participant must get acquainted with someone popular and famous who influences many people at international conferences, such as keynote speakers, committees, or moderators. Thus, the new participant can get acquainted with friends of this popular and famous person. Furthermore, network metrics are used to measure the edge weight and network density for the distance between nodes. The proposed method is defined in Equation (7).
, if path length u and v = 0 Nodes with cold-start problems are indicated by conditions where the two nodes are directly connected through neighbors. In other words, the path length between the two nodes equals 0 or has no neighbors. Network metrics examined in this research are extended Adamic Adar based on Degree Centrality (DCAA), extended Adamic based on Closeness Centrality (CloCAA), and extended Adamic Adar based on Clustering Coefficient (CluCAA). The difference from the original Adamic Adar is that this proposed method adds cold-start problem detection based on the path length of the predicted node pairs. If node pairs are known to have cold-start problems, then predictions are made with extended Adamic Adar based on network metrics. The proposed methods are defined in Equations (8), (9), and (10).
The Adamic Adar index is chosen for nodes with no cold-start conditions to calculate predictions for future links due to the results of the Liben-Nowell and Kleinberg experiments that show that Adamic Adar is at least as good as other common neighbor predictors [38]. Besides, Adamic Adar considers the common neighbor and the degree of the common neighbor. Fig. 2 shows the proposed extended Adamic Adar based on network metrics: network density, degree of centrality, closeness centrality, and clustering coefficient are extracted from the examined graph. The path length calculates the predicted node pairs based on Eq. (7), and the extended Adamic Adar applies in the path length is zero. The similarity score is calculated based on the Adamic Adar for the path length is more than zero. The pseudo-code in Algorithm 1 shows the experiments conducted to implement the proposed method. The experiments are conducted using NetworkX 2.4 and Python 3.6. The algorithm's input is four real-world network datasets, three proposed methods, and nine existing local similarity-based methods as the baseline. Several outputs of the algorithm are the similarity score of each examined method and measurement results, namely, the AUC Score and ROC Curve show in Fig.3.

INPUT:
Edges list of edges u and v OUTPUT: Score similarity score ALGORITHM: d(u,v) = path length u and v 4. D(G) = Network density using Eq 5 5.

Experiment Design
The experiment is conducted in three stages, i.e., graph generation, score computation, and result measurements, as shown in Fig. 2. In the first stage, the graph generation is conducted by creating the graph dataset from the list of edges in the files dataset. Furthermore, split the edge into edges train and edges test using scikit-learn [39], with the test size of each dataset is 10%. Later the graph train is created from the list of edges train connects and a list of all nodes from the graph dataset. The confusion matrix needs at least two classes to measure the prediction results: connected and not connected. The edges test not connect obtained using scikit-learn based on the graph train list of not connected edges. The number of test sizes in the second train test split depends on the number of edges test connect. Lastly, the edges test is called connect, and the edges test is called not connect as the edge sample is merged into the edge sample. The edge sample is the set of test data to compute the similarity score of the proposed and benchmark methods. Experiment design show in Fig. 4.

Fig. 4. Experiment Design
The second stage is similarity score computation to determine the score of each node pairs prediction. Every proposed and benchmark method is computed to get a prediction score and label actual and prediction. A pair prediction node is labeled actual true or false based on the edge sample label, whereas a pair prediction node is labeled actual true if the prediction score is more than 0 and vice versa. The last stage is the measurements. The AUC score and ROC curve are measurement methods for the experiments.

Benchmark Methods
The proposed method is compared with local similarity-based or node neighborhood methods for link prediction. The local similarity-based methods are chosen as benchmark methods because most researchers consider the similarity-based method as an appropriate research approach, as it has a relatively low computational level and does not require conducting complicated network analysis stages. Most researchers also use local similarity-based methods as benchmark methods to compare the proposed methods. The benchmark methods are defined in Table 1. Adamic/Adar (AA) Resource Allocation (RA) [46] [47] International Journal of Advances in Intelligent Informatics ISSN 2442-6571 Vol. 8, No. 3, November 2022, pp. 271-284 Yuliansyah et al. (Extending adamic adar for cold-start problem in link prediction based on network metrics )

Datasets
The experiment was conducted using four real-world networks dataset from several network categories to compare the proposed cold-start link prediction. The datasets are also downloaded on http://snap.stanford.edu/data/index.html [48], http://networkrepository.com [49], and http://konect.uni-koblenz.de/networks [50]. Dataset information is shown as follows: • Power [51] is a network that contains the Western States Power Grid topology of the United States.
• Firm-hi-tech [49] is a network of hi-tech firms.
• Wiki-vote [52] is Wikipedia voting data representing Wikipedia users, and a directed edge represents the user voted.
• Adolescent [53] is a network from a survey that asks the student to select their five best female and male friends.
The basic topological feature of dataset consists of several information. The Net, |V|, |E|, Kn, NE, 〈k〉, C, and ρ are network categories, nodes, edges, complete graph, unobserved node pairs, average degree, average cluster coefficient, and network density, respectively. The datasets are composed of five network categories: infrastructure networks and social networks, as shown in Table 2. One characteristic of the cold-start problem is the path length of the predicted node pair is equal to zero. The path length data is obtained from the testing data, and each real-world network dataset appears to have node pairs with a cold-start problem. Table 3 shows the path length of the testing data for experiments.  Power  5934  660  660  1320  311  119  890  1320  Firm-hi-tech  81  10  10  20  1  12  7  20  Wiki-vote  2622  292  292  584  41  193  350  584  Adolescent  9409  1046  1046  2092  13  585 1494 2092 • The experiment results of the proposed and benchmark methods are reported in Tables 4-6.
Common neighbors-based methods are a collection of benchmark methods based on common neighbors, including CN, AA, RA, JC, SA, SO, LHNI1, HPI and HDI. A closeness-degree-based method is proposed based on closeness centrality and degree centrality. The cluster coefficient-based method is a proposed method based on the clustering coefficient.

Number of Prediction Ratio Results
Furthermore, the ratio is the ratio between the number of path lengths predicted with the path length of the testing data. The Centrality and Degree-based Predict collection methods in this comparative data can better predict as many as 99% of node pairs. This result shows that centrality and degree-based prediction outperform the baseline and cluster coefficient-based methods in predicting node pairs with a cold-start problem. More in-depth analysis is conducted to find out the cause of some of the ratios is not 100% achieved by looking at the detailed data. The analysis results show that two nodes are isolated from relationships with other nodes in a node pair. So that the impact on no attraction between the two nodes. This isolated node pair also occurs based on the clustering coefficient in the proposed method. In addition, the cluster coefficient-based method shows that some nodes fail to be an influencer to nodes with cold-start conditions. Even though this attraction node is connected to other nodes, the ability to solve cold-start problems is not as good as closeness and degree-based methods.
The results of the prediction ratio in Table 4 mean that the extension to Adamic Adar can reduce cold-start problems. This result is because the proposed method can predict node pairs with cold-start problems to 46% and 99% for the cluster coefficient-based method and the closeness and degree-based methods, respectively. The cluster coefficient-based method does not get as good results as the closeness and degree-based method because the cluster coefficient still relies on community information from the network. Furthermore, measurements with AUC were conducted to examine the performance of the proposed method.

Some Common Mistakes
The experiment used the AUC score to evaluate the proposed cold-start link prediction [54]. AUC score is calculated by randomly selecting one of the links tested and comparing it to a link that does not exist randomly [55]. AUC is also used to measure link prediction performance [56]. The ideal AUC score is 1 and is defined in Equation (11). = ( 1 +0.5 * 2 ) ( −1)/2 (11) where n is the number of independent comparisons, n1 is the number of missing links with a higher score than non-existent links, and n2 is the number of missing links that have equal scores with nonexistent links [57]. Besides the AUC, ROC is a visual analysis based on the calculation of the AUC and is used to get information about different loss matrices of two error types and to get the classifier behavior under different loss matrices [58]. AUC measurement results show that the proposed method outperforms all real-world network benchmark datasets, as shown in Table 4. The AUC score is compared to the information dataset, as shown in Table 1, and the path length of the testing data, as shown in Table 2. The AUC score indicates that the proposed method is more suitable for networks with the number of isolated nodes and low values at the average degree, clustering coefficient, and network density, besides being able to solve the cold-start problem. See experiment results for Power, Firm-hi-tech, Wiki-votes, and Adolescent datasets, as shown in Tables 5 and Table 6.  Based on the AUC score results and comparing it with the prediction ratio results, the proposed method can outperform the benchmark method's performance achievements, in addition to solving the cold-start problem of the predicted node pairs. Although none of the proposed methods has consistently superior results for all datasets. This finding is essential for future research because the proposed method is a new link prediction method 3.

Receiver Operating Characteristics (ROC) Curve
The experiment results also draw the ROC curve's performance to evaluate the proposed methods and other benchmarks further, as shown in Fig. 5. The ROC curve is presented to visualize the AUC results in the previous Table 4 and Table 5. The ROC curve shows that the proposed methods can solve sparse networks that contain nodes with cold-start problems. The addition of filtering to detect node pairs with the cold-start problem by extending the original Adamic Adar represents a new difference in obtaining essential results from the findings. Therefore, extended Adamic Adar is presented as a novelty and originality in the link prediction research area. The results of this research are promising for future research due to the proposed method improves the prediction performance and solves the cold-start problem.

Conclusion
This research has proposed three novel methods by extending the Adamic Adar index based on network metrics. This proposed method is called the extended Adamic based on Degree Centrality (DCAA), Closeness Centrality (CloCAA), and Clustering Coefficient (CluCAA). The AUC value achieved by the proposed method is up to 0.9600. Furthermore, the proposed method based on closeness and degree can predict node pairs with cold-start problems up to a ratio of 99%. The experiment results demonstrate that DGAA and CloCAA outperform the benchmark methods and can predict node pairs with a cold-start problem better than the original Adamic Adar. However, the drawback of the proposed method is that the prediction formula is more complex than the original Adamic Adar because the proposed method has a condition check beforehand to find out the predicted node pairs in the coldstart problem. If the node pair is in a cold-start problem, the predictor uses its extension function, and vice versa. The predictor uses the original Adamic Adar. The proposed methods (DCAA, CloCAA, CluCAA) are more suitable for networks with high isolated nodes and low values at the average degree, clustering coefficient, and network density. In future research, the proposed method can be combined with machine learning and ensemble learning approaches by examining more varied datasets from several domains such as social networks, terrorist networks, co-authorship networks, and others.