Link prediction in social network based on local information and attributes of nodes

Link prediction is essential to both research areas and practical applications. In order to make full use of information of the network, we proposed a new method to predict links in the social network. Firstly, we extracted topological information and attributes of nodes in the social network. Secondly, we integrated them into feature vectors. Finally, we used XGB classifier to predict links using feature vectors. Through expanding information source, experiments on a co-authorship network suggest that our method can improve the accuracy of link prediction significantly.


Introduction
Link prediction plays an important role in the field of data mining [1].The traditional methods are based on the topological information, which have been widely used in various fields [1], like social recommendation, information retrieval and other fields. It is also very helpful to solve the problem of information overload [2] [3]. The traditional methods can be divided into two categories, the first category is based on the local information like CN [4], and the other one is based on the global information like Katz [5].
Centrality is used to describe importance of nodes in the network [6] [7]. If the centrality of a node is higher than others, it is easier for it to attract attention and construct new links with other nodes in the network, so researchers improve accuracy of link prediction methods with the help of centrality [8].Some researchers construct feature vectors using characteristics including centrality of nodes to predict links [9] [10],and some researchers use machine learning methods to improve performance of link prediction methods [11].
It has become more convenient for people to communicate with others using social media websites which produced a huge amount of information. In order to improve the accuracy of link prediction, researchers begin to pay attention to the relevance between content and structure of the social network, and then design new methods to predict links effectively [12] [13]. Many researchers used some characteristics of social networks to improve accuracy of link prediction, and got good results [14] [15].According to [12], people belong to the same social network tends to form a local community, so they defined a new index: CAR which can improve the effect of classical CN index significantly .CAR is also better than AA index [14] and RA index [15].
The relationship in social networks is dynamic with the change of time and network context [9]. In a co-authorship network, if two authors have no co-authorship at the moment, how can we predict whether they will construct relationship in the future? In this paper, we focus on topology and context of network and consider attributes of nodes, then design feature vectors to describe characteristics of nodes. In addition, this kind of framework is suitable for most social networks since the calculation of features is very easy. Our method is based on classification using XGB classifier, and experiments on a part of DBLP (Digital Bibliography Library Project) show that our method can predict links effectively.

Related Definitions and Concepts
In order to formalize characteristics of the social network, researchers usually use graph theory to model it. For a social network, each individual is considered as a point, and the relationship between them is considered as an edge, here we will give some basic definitions. Definition 1. Social network.
If nodes i v and j v have no direct edges at the moment, then CAR index [12]is defined as in equation (2), where ) , ( We choose XGB classifier [16]to predict links. Babajide [17] proposed an algorithm to predict molecular activities based on XGB and experiments show that it is more accurate than some popular algorithms at present, like support vector machine, stochastic forest, naive Bayesian and so on.

Methods and Experiments Setup
Common neighbors of two users often have an impact on the construction of new links between strangers, so we propose the concept of relevant force to describe influence of CN nodes at first and design the modified CN index by combining relevant force. Then we design some other features to describe attributes of nodes. Finally, we construct feature vectors using those features to predict links.  RF index is defined as in equation (5):

Our
For any node belongs to ) , (

Experiments Setup
DBLP is a collection of English literatures in the field of computer science which lists information of authors and papers. Link prediction on DBLP refers to predicting co-author relationship between authors in the future based on information within specific period. The author of the paper corresponds to a node, and the co-author relationship corresponds to the edge. We pay attention to authors' research fields, and integrate co-authorship information into MCN .Here are some definitions based on DBLP.

Similarity of research area.
In the field of computer science, different international academic conferences focus on different research areas. We select 15 international conferences in DBLP and divide them into four areas. The details as follows: 1) Data mining: KDD, ICDM, PAKDD; 2) Database: SIGMOD, VLDB, ICDE, PODS; 3) Information retrieval: SIGIR, ECIR, WWW, CIKM; 4) Machine learning: ICML, NIPS, AAAI, CVPR. We assume that if the author published a paper at an international conference which belongs to a specific research area, there will be a link between the author and the research area. Then we regard the number of links between authors and research areas as a feature which expresses author's interests. We use cosine similarity of feature vectors to represent the similarity between authors. We regard A = (A1,A2,..,An) and B = (B1,B2,..,Bn) as the authors' feature vectors, then the is defined as in equation (6).
We define the vector Confer to represent author's interests, so similarity of research areas is defined as ConfSim which is calculated as in equation (7):  (8) Unlike the MCN index, CoPub and ConfSim are the indexes which are designed according to the characteristics of DBLP. According to the specific conditions, we can extract different kinds of domain knowledge from different fields, and integrate them to the feature vectors, all of which will be helpful to improve the accuracy of link prediction methods.

Results and Discussion
After obtaining the information such as network topology and attributes of nodes, we can calculate the

Data Set
We select a part of DBLP and define it as DBLP_fourarea which contains 1500 authors and 6303 co-author relationships. It includes information of authors and papers published in 15 international conferences mentioned in 3.2.1 during the period from 2004 to 2010.Then we select information of DBLP_fourarea from 2004 to 2008 as the train set, and regard the information from 2009 to 2010 as the test set. Our aim is to predict co-author relationships which are included in train set but not exist in test set.

Evaluation Standards
In this paper, we choose F-value and AUC to evaluate the performance of our method. F-value combines precision and recall, which is the harmonic mean of precision and recall, so it is more reliable. The traditional F-value is defined as in equation (9): The area under the curve (AUC) describes the ability of the classifier to classify the patterns submitted correctly. It is defined as in equation (10): where i x and i y represent the X-axis and Y-axis of the ROC curve respectively.

Experiments and Analysis
The experimental results of the single index based on the DBLP_ fourarea are shown in Table 1. As can be seen from Table 2, the feature vector which combines some indexes has improved performance of link prediction methods dramatically, which indicates that expanding information source is efficient. In addition, it can be seen that the feature vector combining with attributes of nodes has a certain improvement comparing with the single index, so we should utilize attributes of nodes and network context to enhance the accuracy of link prediction.

Conclusion
We propose a new method to predict links in social network. We calculate some indexes to represent topological information and attributes of nodes to construct feature vectors, and predict links based on XGB classifier. Experiments show that our method improves accuracy of link prediction effectively. In addition, components of the feature vector are easily calculated, so it can be applied to many types of networks and allows us pay more attention to the domain knowledge.

Acknowledgements
This work was supported by National Natural Science Foundation of China (NSFC Grant No.