Content-Enhanced Network Embedding for Academic Collaborator Recommendation

,


Introduction
During the era of big scholarly data, information overload has become a serious problem. It is challenging how to dig useful information from overloaded information [1,2]. Prior studies show that collaboration among researchers can increase the productivity of the researcher and come up with unprecedented inspirations [3,4]. So, academic collaborator recommendation that aims to find the proper collaborators for a target researcher has played an important role in complex academic tasks.
Academic information can be described as an academic collaborator network with attributes (as shown in Figure 1). e methods of academic collaborator recommendation are divided into three categories, including network-based recommendation, content-based recommendation, and hybrid recommendation. For network-based recommendation, the structure of the network was utilized to improve the performance of recommending the researchers [5]. e probability theory and graph theory were used to model and analyze coauthor networks [6]. Another network-based recommendation involves the classic random walk model, which can dig for useful information from the academic collaborator network. For content-based recommendation, the interest of the researcher is an important attribute that characterizes the research topics, fields, and other personalized features [7][8][9]. It can be analyzed and mined through papers that the researcher publishes every year, and the relationships among researchers can also be established through interest detection. Compared with the above methods which consider the academic collaborator network structure and the interests of the researcher, respectively, the combination of network topology and text information is more effective. For hybrid recommendation, utilizing both text information and network structure can improve learning of the latent representation for each researcher [10][11][12]. Existing hybrid models always learn the feature representation of each researcher using text information and network structure independently and then combining the two feature representations into a unified latent representation. ey do not utilize the complex relationships between text information and network structure [13]. e aforementioned hybrid method could improve the recommendation of academic collaboration, but the latent semantic relationships formed by the text information in an academic collaborator network were ignored.
To capture the latent semantic relationships to improve academic collaborator recommendation, we utilize text information of each researcher to build a content-enhanced network and propose the CNEacR model for academic collaborator recommendation. CNEacR builds a contentenhanced academic collaborator network that contains the intrinsic collaboration relationships and the latent semantic relationships formed by the text information. Firstly, CNEacR obtains the weighted text representation for each researcher and then builds a content-enhanced academic collaborator network based on the similarity of the feature representation of each researcher. Secondly, high-quality latent representation is obtained by network embedding. Finally, the similarity between researchers can be calculated by the cosine similarity based on high-quality latent representation. Experiment results on the real-world datasets demonstrate that CNEacR produces a better improvement on precision, recall, F1, and normalized discounted cumulative gain (NDCG) over all baseline methods. e main contributions of this paper are summarized as follows: (1) A context-enhanced academic collaborator network that contains not only intrinsic collaboration relationships but also the latent semantic relationships formed by the text information is built using the similarity among the weighted text representation of researchers. (2) To get a context-enhanced academic collaborator network, the weighted text representation of each researcher is obtained from the text information. e edges are added which are between a node and its semantic similar nodes, and then a context-enhanced academic collaborator network is built. (3) Experiment results on the datasets demonstrate the performance of CNEacR is better than other methods of academic collaborator recommendation.  Figure 1: e extraction of a collaboration network with attributes from academic data. A is a list of researchers, P is a list of papers, and T is a list of topics. Each researcher has a distribution of topics. In this paper, we regard each word in the title of the paper as a topic. e figure shows that if two researchers coauthor a paper, there is a link between them, such as (David, Jessica). If a researcher has coauthored more papers, he will have more links with others, such as (David, Mike), (David, Sam), and (David, Jessica).

Content-Enhanced Academic Collaborator
Network. e academic collaborator network can be denoted as G � (A, E), where A � a 1 , a 2 , . . . , a |N| , a i is the i-th researcher, and E � e i,j |a i ∈ A, a j ∈ A . Each e i,j ∈ 0, 1 { } represents whether there exists the collaborative relationship if e i,j � 1 denotes that there exists a collaborative relationship between researcher a i and researcher a j ; otherwise, it does not exist. In the academic collaborator network, we firstly use the TF − IDF model to evaluate the importance of each term and then embed each term into a vector by Word2vec. e vector is the weighted text representation of each researcher. Secondly, TK � a 1 , a 2 , . . . , a k relevant researchers are listed based on cosine similarity of each a i . Finally, we can get a relationship set ′ , e i2 ′ , . . . , e ik ′ ). If the relationship between a i ∈ Tk and a j ∈ Tk exists, e ij ′ � 1. e content-enhanced academic

Methodology
In this section, we explain the CNEacR in detail. CNEacR builds the weighted text representation of each academic researcher and constructs a context-enhanced academic collaborator network. en, we maximize the co-occurrence probability to obtain the high-quality latent representation of each researcher. Finally, top-k researchers are recommended for a target researcher via the similarity of highquality latent representation. Some important notations are shown in Table 1. We summarize the framework of our proposed CNEacR in Figure 2 and show the whole algorithm framework in Algorithm 1.
As shown in Figure 2, all nodes in the dataset belong to the test set. To validate the effectiveness of our algorithm, the real collaborative relationships among nodes are divided into two classes: collaborative relationships and unknown relationships according to [13]. e collaborative relationships are edges in the academic collaborator network which means the structure of the network. e unknown relationships do not participate in the algorithm process. ey are used to compare with the recommended top-k collaborative relationships. e ratio R of collaborative relationships to unknown relationships is discussed in Section 4.4.3.

Building Weighted Text Representation of the Academic
Researcher. It is fundamental to represent the text in many natural language processing (NLP) tasks.
ere are many methods to extract the feature representation of the researcher from text information, including probabilistic latent semantic analysis (pLSA) [14], latent Dirichlet allocation (LDA) [15], Word2vec [16], and BERT [17]. Word2vec is widely used to generate more accurate feature representations based on text information in a specific scenario. So, we choose Word2vec to get weighted text representation.
Given an academic researcher set D � d 1 , d 2 , . . . , d |N| , where d i represents the text information of the i-th researcher composed by his published paper's titles.
where D is the text information set of all researchers, |D| is the total number of researchers in the dataset, d i represents the text information of each researcher, and d t i represents the is the inverse document frequency. As far as we know, the frequent occurrence of a term in the researcher's text information means that this term is important to the researcher. However, if a term appears in many researchers' text information at the same time, it indicates that this term is common to each text and is less important to each researcher. w d t i ,d i is used to weight the importance of a term in the text information of each researcher. As described above, the weights of the terms in the text information of researcher a i can be defined as follows: e weighted text representation of researcher a i can be defined as follows: Since each researcher has a different amount of text information, we normalized the weighted text representation of each researcher. |M| is the number of terms in the text information of each researcher, vec d t i is the vector of the t-th term of the i-th researcher learned by Word2vec, and w d t i ,d i is the weight of the t-th term of the i-th researcher.

Constructing the Context-Enhanced Academic Collaborator Network. Given an academic collaborator network
. , a |N| is the researcher set and E � e i,j |a i ∈ A, a j ∈ A represents collaborative relationships among researchers. We calculate any two nodes' similarity using their weighted text representation by widely used cosine similarity: ′ , e i2 ′ , . . . , e i|N| ′ ); each e ij ′ is defined as follows: where topKList is the top SK researchers in the similarity list for each researcher. SK is a hyperparameter. If e ij ′ � 1, e ij ′ is a new relationship. We add these new relationships to G, and then we will obtain a new academic collaborator network where E c � E ∪ E ′ , which is our context-enhanced academic collaborator network.  (1) for a i ∈ A do (2) calculate RW a i by equation (3) (3) end for (4) for a i ∈ A do (5) for a j ∈ A, a i ≠ a j do (6) calculate CosSim(RW a i , RW a j ) by equation (4)  (7) end for (8) choice SK similar researchers for a i (9) end for (10) construct G′ � (A, E c ) (11) map G′ into a low-dimensional space to get latent representation X of all researchers (12) Testing process: for a j ∈ A, a i ≠ a j do (15) calculate CosSim(X a i , X a j ) by equation (4) (16) end for (17) K← top-k most similar collaborator for a i (18) end for ALGORITHM 1: CNEacR.

Network Embedding.
e latent representation of each researcher is the input feature of many downstream tasks, such as classification, link prediction, clustering, and visualization. To get a low-dimensional space R d , d ≪ |N|, the network embedding aims to learn a function f: N ⟶ R d . Let Θ � (θ 1 , θ 2 , . . . , θ |N| ) denote the embedded vectors in the latent space. Θ maintains as much of the original network topology information as possible. ere exist many network embedding methods, such as DeepWalk [18], LINE [19], Node2Vec [20], and GCN [21]. In this paper, the local information and global information are equally important for each target researcher, so DeepWalk is suitable to obtain high-quality latent representation.
Given a context-enhanced academic collaborator network, we use the DeepWalk model to represent the relationships of academic collaboration. Intuitively, for academic collaborator recommendation, it is equally important for both local information, namely neighborhood, and global information. We use latent academic collaborative relationships obtained from random walks to learn academic researcher latent representation. For each walk sequence s � a 1 , a 2 , . . . , a s , following skip-gram, we aim to maximize the probability of the neighbors of researcher a i in this walk sequence as follows: where ω is the window size, ϕ(a i ) is the current representation of researcher a i , and a i−ω , . . . , a i+ω \a i is the local context researchers of a i .
Finally, we use hierarchical softmax [22] to obtain the embedding vector of each researcher X � (X 1 , X 2 , . . . , X |N| ), X i � (x 1,i , x 2,i , . . . , x d,i ). e latent representation of each researcher fuses researchers' text information and network structure. d is a hyperparameter that is the dimension of the latent representation of the researcher. For each target research, we can get the top-k similar collaborators according to equation (4).

Experiment
In this section, we evaluate our proposed CNEacR model on two real-world datasets. We introduce datasets, baselines, evaluation criteria, and the results of experiments in detail. Review B) from the APS (American Physical Society) (https://journals.aps.org/ datasets) consists of some articles about the subject of physics. At first, we do name disambiguation on authors from 1893 to 2015 based on [23]. Authors who have less than 2 collaborators from 2006 to 2010 are removed. Finally, we extract 34,905 authors and 14,055 papers to evaluate our proposed CNEacR. AMiner, a larger-scale dataset, is adopted, we randomly choose 14,000 papers, and it contains 20,057 researchers, who have more than 10 papers. Table 2 shows the details of the datasets. Some necessary cleaning is done, such as removing excess code fragments, removing the stop words, tokenization, and lemmatization.

Datasets. PRB (Physical
To evaluate the performance of CNEacR, we assume all researchers in the dataset as target researchers. e R ratio collaborator relationships of each researcher are used as the training samples, and the 1 − R ratio collaborator relationships are used as the test target according to [13]. In experiments, we choose relationships with the ratio R 10 times to ensure that the selected relationships can contain as many authors as possible. All experiments are performed on a 64-bit Linux-based operation system, Ubuntu 16.04 with a 64-duo and 2.10 GHz Intel CPU, 1-T Bytes memory. All the programs are implemented with Python.

Baselines.
We compare CNEacR with the following six methods, where the first is the classic method for academic collaborator recommendation. e baselines consist of the following: (1) MVCWalker: MVCWalker [24] is a random walk model standing on the shoulder of a random walk with restart for the collaborator recommendation which combines three academic factors including coauthor order, latest collaboration time, and times of collaboration. (2) TNERec-G: TNERec-G is a portion of TNERec which only uses the structure of the academic collaborator network to get the feature representation of the researcher for collaborator recommendation. (3) CTPF: CTPF [25] is a probabilistic model of articles to represent researchers with their preferences for topics. It integrates two ideas: collaborative topic regression and Poisson factorization. (4) TNERec: TNERec [13] is an academic collaborator recommendation method that learns feature representation from the interests of the researcher based on the topic model and feature representation from the structure of the academic collaborator network using network embedding, respectively, and then fuses them using a spectral technique for better collaborator recommendation. (5) CNEacR-G: CNEacR-G is a portion of CNEacR which only uses the structure of the academic collaborator network to get the feature representation of the researcher for collaborator recommendation (does not use any semantic relationship). (6) CNEacR-T: CNEacR-T is a portion of CNEacR which only uses the text information of the researcher to recommend the collaborator, which is based on text recommendation.

Evaluation Criteria.
We use the most common evaluation criteria in information retrieval as the academic collaborator recommendation evaluation metrics. Precision@k means the ratio of the right recommended collaborators to top-k recommended candidates when recommending k candidate collaborations for the target researcher. Precision@k is defined as follows: Recall@k means the ratio of the recommended right collaborators who are in the test set to all recommended candidates when recommending k candidate collaborations for the target researcher. e recall value is computed as follows: where m is the number of target researchers, R a is the top-k recommended researcher list for the target researcher, and T a is the real collaborators of the target researcher in the test set. F1 is the harmonic average of precision and recall, and F1 is defined as follows: IDCG represents the list of the best recommendation results. NDCG is the normalized recommended list evaluation scores. We define r i as the rating of the i-th researcher in the recommended researcher list. If r i � 1, the recommended collaborator is relevant, and r i � 0, otherwise. NDCG@k is defined as follows: Table 3 demonstrates the performance comparison of CNEacR, and the results outperform all baselines on precision, recall, F1, and NDCG. Besides, we present the result of CNEacR-G and CNEacR-T in PRB. CNEacR-G only uses text information of each researcher, and CNEacR-T only uses collaborator relationships in the network. To make the results more convincing, we give the results of the experiment in AMiner and compared it with the two kinds of methods, content-based recommendation and network-based recommendation. Table 4 demonstrates the results of the experiment in AMiner. From Tables 3 and 4, we know that CNEacR-G does not use the text information, and the results are poor. CNEacR-T does not use the network structure, and the results are not good enough. We can see that utilizing both text information and network structure plays an important role in academic collaborator recommendation. We demonstrate the performance in different recommendation lists and analyze different results when choosing different training sets of ratios in PRB. As an auxiliary experiment, we only demonstrate the performance in R � 0.3.

Parameter SK.
We analyze the parameter SK used to build the relationship among researchers in two datasets. Similar to [13], set the length of the recommendation list Top − k as 5, and choose SK as 0, 1, 2, 3, 4, and 5, respectively, to build the content-enhanced academic collaborator network. Figure 3 shows the comparison results of CNEacR on different SK in two datasets. From Figure 3, we can easily find that different datasets have different SK. SK of the best performance of CNEacR in PRB is 2, and SK of the best performance of CNEacR in AMiner is 1. We can see from Figure 3 that different SK have a big influence on the performance of CNEacR. With the increase of SK, the number of uncertainty relationships is increasing, which influences the performance of our proposed CNEacR to capture real collaborative relationships.

Influence of the Recommendation List.
We analyze the performance of CNEacR with different lengths of recommendation. We choose the ratio of the training set R � 0.3 to conduct our experiment and set the dimension of the researcher vector as 100. e parameter SK is set as 2. Figure 4 shows that our proposed model is compared with other methods of precision, recall, F1, and NDCG. With the increase of recommendation list Top − k, we can see that the precision of CNEacR, CNEacR-T, CNEacR-G, TNERec, TNERec-G, and CTPF shows a downward trend. MVCWalker goes up at first and then goes down with the recommendation list increasing. e recall of all methods shows an upward trend. F1 of all methods takes on the tendency of increasing first but decreasing afterward. e NDCG of all methods keeps a steady trend. We can see that network-based and context-based collaborator recommendations can work well, respectively, and the results of experiments verify that our method which utilizes both weighted text representation and academic collaborator network can perform well compared with all the above methods.

Influence of Ratio R.
To prevent the contingency of experimental results, we use different sizes of the training set to evaluate the performance of CNEacR over the training set. We set the ratio R varying from 20% to 80% and set recommendation list size k as 3. We also set the latent representation of the researcher as 100 and set the parameter SK as 2. Figure 5 shows the performance compared with other methods on different R in terms of precision, recall, F1, and NDCG. CNEacR outperforms other methods a lot on four metrics no matter how R is. From Figure 5, these methods have the same trends except the network-based methods including CNEacR-G and TNERec-G. We can see that CNEacR is always better than the network-based recommendation, content-based recommendation, and hybrid recommendation. 6 Complexity 4.5. Case Study. Table 5 shows the case study of different methods for collaborator recommendation. We randomly select a researcher (F.Ishikawa) for a demonstration from the test set. We use three methods to recommend the top 5 collaborators for the target researcher F.Ishikawa. From the table, we can see that only CNEacR-G correctly provides one collaborator, T.Naka. It indicates that CNEacR-G captures the information of the network structure. CNEacR-T correctly recommends a new collaborator, A.Matsushita, than CNEacR-G. It indicates that CNEacR-T can capture the information of semantic relationships. Our method CNEacR correctly recommends four collaborators, and it recommends two new collaborators, Y.Takaesu and T.Nakane, than CNEacR-T. It indicates that utilizing the weighted text representation and intrinsic collaborative relationships to recommend collaborators can yield better performance than contextbased and network-based recommendation. CNEacR correctly recommends four collaborators including the researchers recommended in both CNEacR-G and CNEacR-T. It indicates that CNEacR can capture both the semantic relationship and the collaborative relationship to recommend latent academic collaborators for the target researcher. @10 @15 @20 @5 @10 @15 @20 @5 @10 @15 @20 @5 @10 @15 @20  @10 @15 @20 @5 @10 @15 @20 @5 @10 @15 @20 @5 @10 @15 @20 CNEacR-G 0.

Related Work
At present, it is common for the researcher to collaborate in research [26]. A researcher who collaborates with others has an enormous effect on scientific productivity than those who always do the research independently [27]. So, how to find proper collaborators from complex and unstructured data is essential for the researchers. Recently, lots of works have been done on how to help the researcher to find proper collaborators. ese works on academic collaborator recommendations are mainly based on three categories: network-based recommendation, content-based recommendation, and hybrid recommendation.
In an academic collaborator network, academic collaborator recommendation is usually modeled as a link prediction problem. e key to predicting the relationship with structural features of the academic collaborator network is to calculate the similarity among researchers. In [28], Jeh and Widom used SimRank scores based on a simple and intuitive graph-theoretic model to measure the similarity between two researchers. However, they cannot exploit all different length paths of the network. To overcome this problem, they provided more accurate and faster friend recommendations by traversing all limited length paths [29]. Recently, new measurements such as relative entropy [30] and network motif [31] were proposed. e most popular model in the field of collaborator 8 Complexity recommendation was random walk [32]. ere exist some works that stand on the shoulder of random walk for academic collaborator recommendation, which had been proved to be competent for calculating the rank score of researchers in the academic collaborator network [33]. ese methods completely utilized the weight on edge to guide Random Walker on the academic collaborator network [24,34,35]. ese values of weight were composed of the affiliated institution of the researcher or the academic factors, such as coauthor order, latest collaboration time, and the times of collaboration. MVCWalker used the rich information of both nodes and links to dig out the similarity structure of the academic collaborator network based on probability [33,36]. However, Random Walker can merely extract information from the academic collaborator network.
Using structural features is not sufficient for academic collaborator recommendation. e proposed models for computing similarity between two researchers were based on expertise profiles extracted from their publications and academic home pages [7]. Kong et al. held that the interest of each researcher was very important for academic collaborator recommendation. e topic model was used to mine the text information of researchers each year to obtain the topic information and then cluster the topics as the researchers' interests [8].
e cross-domain topic learning model used topic layers to replace author layers to alleviate the sparseness issue and topic skewness for different discipline collaborations [9]. e text information and network structural information are equally important to academic collaborator recommendations.
ere exist some hybrid recommendation models. Complexity ey combined the structural information and user-generated content. And then, a generative model was introduced to help people find friends on Twitter and Flickr [10]. CCRec clustered the topics of each researcher's text information and utilized the structure of the academic collaborator network to find the most relevant and latent collaborator [11]. A hybrid algorithm with eight measures was proposed to recommend latent academic collaborators under different disciplines [37]. It generated highquality researchers' profiles by integrating researchers' expertise, coauthor network characteristics, and researchers' institutional connectivity into a unified framework with SVM-rank [38]. It was applied in the ScholarMate system, which is a virtual academic community for promoting researchers' collaboration.
ey predicted coauthor relationships based on content, social, and hybrid recommendation algorithms [12]. Kong et al. thought that the fusing topic model and academic relationships could improve the performance of academic collaborator recommendations [13]. However, the topic model showed the probability distribution of words and documents, which only demonstrated their implied topics. e title of a paper was always short, but it contained the main idea of the whole paper which can distinctly express the research field of a researcher. Word2vec [16] was based on text information (i.e., semantic and syntactic) of a researcher, which can express the researchers' feature representation in specific application scenarios. In this paper, we use the weighted text representation to represent each researcher. en, a context-enhanced network was built according to the similarity between every two researchers to predict collaborative relationships.

Conclusion
In this paper, we propose a novel CNEacR method to recommend academic collaborators. CNEacR utilizes the weighted text representation to build a content-enhanced academic collaborator network that contains not only intrinsic collaborative relationships but also the latent semantic relationships formed by the text information. From this network, we use network embedding to get high-quality latent representation, which captures the latent semantic relationships among researchers. Extensive experiments on the real-world datasets demonstrate the effectiveness of CNEacR and its superiority over several existing methods.
We just pay attention to these strong relationships (the paper content and academic relationships), while the weaktie relationship such as conference or journal is also supposed to be considered. Because the two papers from the same conference or journal share the same research field, researchers are likely to build a collaborative relationship in the future.
us, we will take the weak-tie relation into account in the next job.

Data Availability
e data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
e authors declare that there are no conflicts of interest regarding the publication of this paper.