Uncovering Research Topics of Academic Communities of Scientific Collaboration Network

In order to improve the quality of applications, such as recommendation or retrieval in knowledge-based service system, it is very helpful to uncover research topics of academic communities in scientific collaboration network (SCN). Previous research mainly focuses on network characteristics measurement and community evolution, but it remains largely understudied on how to uncover research topics of each community. This paper proposes a nonjoint approach, consisting of three simple steps: (1) to detect overlapping academic communities in SCN with the clique percolation method, (2) to discover underlying topics and research interests of each researcher with author-topic (AT) model, and (3) to label research topics of each community with top N most frequent collaborative topics between members belonging to the community. Extensive experimental results on NIPS (neural information processing systems) dataset show that our simple procedure is feasible and efficient.

As we all know, most real-world networks contain groups in which nodes are more highly connected to each other than those to the rest of the network [13]. The sets of such nodes are usually called communities, clusters, cohesive groups, or modules [13,20]. Similar to the real-world networks, SNs also include many communities based on common location, interests, and occupation. As one kind of SNs, it should be no exception for SCN [2,7,13,19,20]. According to whether or not a node is allowed to be a member of more than one community, the communities can be further divided into two types: overlapping and nonoverlapping. Most real-world networks are characterized by well-defined statistics of overlapping and nested communities [13].
Although real motivations for a link in the SCN are still not well understood at present, there usually exists one or more than one relationship (such as co-colleagues, advisoradvisee, classmates, coproject, friends, or many others) in the real world if two researchers coauthored some papers. Moreover, in order to follow the frontier research or borrow main ideas from other fields [18], an active researcher might involve multiple fields. Intuitively, it is unreasonable to limit one researcher to belong to only one community. Therefore, it convinces us of the fact that overlapping communities also exist in the SCN.
It is increasingly important to detect communities in SN in modern applications, ranging from bioinformatics, enterprise organization management, to bibliometrics [21]. Many approaches have been proposed to detect communities in the SN [22,23], such as traditional methods based on clustering like k-means and other applications, division algorithms based on hierarchical clustering, modularity-based algorithms, spectral algorithms, dynamic algorithms, statistical inference-based methods, multiresolution methods, and 2 International Journal of Distributed Sensor Networks lastly methods to find overlapping communities and other miscellaneous methods [23]. But most existing methods for finding communities just discover the separated sets in networks and ignore the overlapping phenomenon [13,23].
In fact, it is the first step to identify network structure if one wants to provide a valuable insight into how network function and topology affect each other. In a knowledgebased service system, users may be interested in not only the link structure in a network but also in the reason why they form a community. However, most present methods merely focus on detecting the structures or monitoring evolution of communities. There are few literatures on uncovering research topics of academic communities and providing particular information for searching a group of researchers with similar interests. To the best of our knowledge, only Ichise et al. [18] put forward a method to detect academic communities with topic identification in literature. On closer examination, one can see that the word assignment technique was utilized for obtaining the communities. Unfortunately, the method has the limitation of trust in the keywords.
To overcome these problems, the paper proposes a nonjoint approach, which integrates community detection method and author-topic (AT) model. Specifically, it consists of three simple steps: (1) to detect overlapping academic communities in SCN with the clique percolation method, (2) to discover underlying topics and research interests of each researcher with author-topic (AT) model, and (3) to label research topics of each community with top most frequent collaborative topics between members belonging to the community, where common topics between researchers are seen as collaborative topics.
The remainder of the paper is organized as follows. Section 2 provides related works on community detection models and topic models. Section 3 illustrates the analysis framework in the study and then introduces each unit of the framework. Section 4 describes and discusses experimental results. Finally, the conclusion is made.

Community Detection Methods.
Community detection is the organization of nodes in a network into subsets of nodes such that nodes within a subset are more densely connected internally than those within the other subsets. Another way to say this from a graph theoretic perspective is that, given a graph ( , ) with a set of nodes and a set of edges, community detection is to classify the node set into multiple subsets = { } | | =1 with ⊆ , such that nodes belonging to a subset are all closely related [24]. Here | ⋅ | denotes the number of the elements in a set.
Because the number of communities underlying a network is typically unknown in advance and the sizes or densities of communities are often uneven, it is not trivial to find automatically community structure. Several community detection approaches have been developed and employed with varying levels of success [25], including hierarchical clustering algorithm, Girvan-Newman algorithm [26], modularity maximization algorithm [27], and clique-based methods [13]. It is worth noting that only the last one can deal with the overlapping phenomenon.
Hierarchical clustering is a simple algorithm which employs some type of similarity metrics between node pairs to group similar nodes into communities. Girvan-Newman algorithm identifies edges that lie between communities and then removes them, just leaving behind the communities themselves. Though Girvan-Newman is popular in a number of standard software packages, its time complexity is O(| | 2 × | |), making it impractical for networks with more than a few thousand nodes [26]. Modularity maximization algorithm defines a benefit function to measure the quality of a particular division of a network into communities [27], and can be used for large-scale network problem. However, since approximate optimization is utilized in modularity maximization algorithm, it often fails to detect clusters smaller than some scale depending on the size of network.
Clique-based methods build up the communities from the cliques in a network [28]. By clique, we mean the complete subgraphs in a network that are not parts of larger complete subgraphs. Specifically, the general procedure of the methods is to find cliques firstly and then to unite the cliques bigger than a minimum number of nodes to define a subgraph of original network, and finally components (disconnected parts) of the defined subgraph are used to define communities [22,23]. The alternative of the method is to use -cliques, which are complete subgraphs with nodes, to construct line graph known as clique graph [20,28]. In fact, clique graph is a hypergraph of original graph, the nodes of which are -cliques, and the edges of which record the overlap of the cliques in the original graph. The difference between -cliques and cliques is that -cliques can become subsets of larger complete subgraphs. A typical approach based on -cliques is the clique percolation method (CPM) [13,29], which defines communities as percolation clusters ofcliques. CPM algorithm runs in the time O( ln | | ), where and are constant values [30].

Topic Models.
Topic models are a family of statistical models for discovering a mixture of "components" in a collection of documents [31]. In these models, each topic is modeled as a probability distribution over words in the vocabulary of corpus and each document in corpus is modeled as a mixture of topics given by a multinomial distribution over the topics [32]. An early topic model called probabilistic latent semantic indexing (pLSI) was proposed by Hofmann [33]. While Hofmann's work is a useful step toward probabilistic modeling of text, it is incomplete in the fact that it provides no probabilistic model at the level of documents. In order to overcome this problem, Blei and his coworkers developed latent Dirichlet allocation (LDA) model [34]. LDA is similar to pLSI, except that in LDA model the topic distribution is assumed to have a Dirichlet prior. In practice, the assumption usually brings about more reasonable mixtures of topics in a document. Subsequent topic models, such as author-topic (AT) model [35], topic over time (ToT) model [36], authortopic over time (AToT) model [31,32,37], and conferenceauthor-relation topic (CART) model [38], are generally extensions on LDA.
International Journal of Distributed Sensor Networks 3 To discover collaborative topics between authors  Figure 1: Framework of the method.
As a famous topic model, LDA [34] is a generative probabilistic model for collections of discrete data such as text corpora [39]. LDA model is based upon the idea that the probability distribution over words in a document can be expressed as a mixture of topics. It means that each document may be viewed as a mixture of various topics. LDA model can be viewed as a generative process. A document can be generated in following three steps: (1) to sample a mixture proportion from a Dirichlet distribution, (2) to sample a topic index according to the mixture proportion for each word in the document, and (3) to sample a word token from a multinomial distribution over words specific to the sampled topic.
AT is also a generative model that extends LDA model to include authorship information [40,41]. The model provides a relatively simple probabilistic model for exploring the relationships between authors, documents, topics, and words. In the model, each author is represented by a multinomial distribution over topics and each topic is represented by a multinomial distribution over words. The words in a document coauthored by multiauthors are assumed to be the result of a mixture of topic mixture of each author. Then, the topic-word and author-topic distributions are learned from text corpus. Compared with LDA, AT can give the increase of salient topics and more reasonable researchers interest patterns [40]. AT model has been proved to be an essential way to uncover the research interests of each researcher [40,41].

Method
The analysis framework of proposed approach is illustrated in Figure 1. The framework is composed of four parts: to preprocess data, to detect communities in scientific collaboration network, to discover collaborative topics between authors, and to uncover topics of academic communities. We describe each part in detail in the following subsections.

To Preprocess Data.
In this part, words and authors are extracted from papers collected. First word terms are extracted to build vocabulary and all stop-words are eliminated. Then word frequency and inverse document frequency of each word in vocabulary are computed for following author-topic model. Next, author names are extracted and the name disambiguation algorithm is used to process ambiguous names, such as an author with multiple names or multiple authors with the same name, and then all author names are normalized to a standard name and assigned a unique ID number.

To Detect Communities in SCN.
From a "topological" point of view, network can be divided into four categories: undirected binary network, directed binary network, weighted directed network, and weighted undirected network [10,14]. In the part, a SCN is first created by following the principle of undirected binary network, in which each node represents an author and each edge represents the coauthorship between two linked authors. Specifically, if two authors coauthored one paper at least, an edge with unit weight will be created. In other words, no matter how many papers two authors coauthored, there is only one edge between them. For example, if 1 , 2 , and 3 coauthored one paper and 1 and 2 coauthored another paper, three edges will be created, that is, 12 , 13 , and 23 , in the constructed network.
Then, cliques are extracted from constructed SCN and communities are detected with -clique-community detection algorithm [13]. The community definition in the algorithm is based on the observation that a typical node in a community is linked to many other nodes, yet not necessarily to all other nodes. A -clique-community is the union of all -cliques that can be reached from each other through a series of adjacent -cliques, where adjacency means sharing − 1 nodes. When = 2, the -clique communities are equivalent to the connected subgraphs, which are also called components in complex network analysis.

To Discover Collaborative Topics between Authors.
Here, AT model is used to uncover the research interest of each author. The graphical model representation for AT model is shown in Figure 2. The following notations are used in this study. Let and be the set of papers and unique words in the corpus, respectively. For each ∈ {1, 2, . . . , | |}, or are denoted by all the word tokens or the set of authors in the paper and | | means the length of the paper . or are multinomial distribution of topics or words specific to the author or the paper . For more elaborate and detailed descriptions on AT model we refer the readers to [35,40,41].
In this work, collapsed Gibbs sampling algorithm, which runs over the three periods, initialization, burn-in, and sampling with iterations in total, is used for inference , since it provides a simple method for obtaining parameter estimates under Dirichlet priors. The time complexity of the AT model is as follows: In our paper, each topic is represented with the top 10 words most likely to be generated conditioned on the topic, and research interest of each author is represented with the top 10 most likely topics. From the results of AT model, we can build the relation matrix between authors and topics. Each element of the matrix is the association probability between an author and a topic. With the matrix, it is easy to get collaborative topics between any two authors with coauthorship. Formally, for each ∈ {1, 2, . . . , | |}, let = { 1 , 2 , . . . , 10 } be the research topics of author . Then the

To Uncover Topics of Academic Communities.
In Section 3.2, we have detected communities in SCN. In Section 3.3, we have got topics of authors and we have also obtained the collaborative topics between any two collaborated authors in detected communities. Here, we will integrate both results to uncover the topics of academic communities by ranking topics and selecting the most frequently collaborated ones.
where the indicator function ( ) = 1 if is true and 0 otherwise. Once we have got collaborative frequencies of all topics in community , we rank them by sorting frequency descendingly and then select the top topics as the research topics of the community.
To illustrate the process clearly, let us take a simple example in Figure 3 with a community consisting of four nodes. Given author topics as follows: 1 = { 1 , 2 , 10 , Finally, if we rank the frequencies and select top 2 topics as research topics of the community, the result is { 1 , 4 }.

Experimental Results and Discussion
4.1. Data. NIPS proceeding dataset is utilized to evaluate the performance of proposed framework, which consists of the full text of the 13 years of proceedings from 1987 to 1999 Neural Information Processing Systems (NIPS) Conferences (http://www.cs.toronto.edu/∼roweis/data.html). The dataset contains 1,740 research papers and 2,037 unique authors. Because all the author names have been processed and normalized, we need not run name disambiguation algorithm in the step of preprocessing data. Based on coauthorship, we count the collaborative numbers between coauthored researchers. The distribution of the numbers of author pairs over collaborative numbers is shown in Table 1. It shows that the maximum collaborative number between authors is 9 corresponding to author pair (Smola A ID: 1475 and Scholkopf B ID: 1504).
In addition to downcasing and removing stop-words and numbers, we also remove the words appearing less than five times in the corpus. After the preprocessing, the dataset Table 1: Distribution of the number of author pairs over collaborative numbers in NIPS dataset.

Scientific Collaboration Network.
Based on coauthorship, we construct the scientific collaboration network containing 1897 nodes and 3130 edges. That is to say, there are 140 authors who did not collaborate with any other authors. The constructed network graph is shown in Figure 4. From Figure 4, we find that the NIPS network is composed of a larger subgraph (in the center of the picture) and many smaller subgraphs.

Component Analysis.
Using component analysis approach [42] on the network, we found 235 components totally. That is to say, the network contains 235 separated subgraphs. The number of author nodes in the top 10 components is 1061, 37, 27, 22, 19, 15, 11, 10, 10, and 9, respectively. We select the largest one (1061 authors) as analysis object in the following experiments. Figure 5 shows the graph of the largest component.

Cliques.
The -clique-community detection algorithm in NetworkX tools [42] is used to discover all cliques in the network. The size and number of cliques in the network and the largest component are shown in Table 2. The size of the largest clique in the network is 10, which means that  it contains 10 authors, while the size of largest clique in the largest component is 9.

Community Analysis.
The detected communities depend on the value of parameter , where refers to the size of cliques. Typically, the value of is between 3 and 6 [13]. Increasing makes the communities smaller and more disintegrated, but, at the same time, also more cohesive [13]. Different value of will give rather different results and thus give us flexibility when providing research community service for users in knowledge-based system. The cliques found in Section 4.4 are used to detect communities. Communities detected under different value are presented as follows.
(a) = 2. If = 2, communities detected will be the largest component of the network shown in Figure 5 (note that we just use the largest component to do the community analysis).  Tables 3 and 6 overlapping nodes: 77, 156, 383, 726, 1475, and 1504, which are underlined in Table 3. Their names are provided in Table 6. Comparing with the results of overlapping nodes on = 3, authors 77, 156, and 383 are still the overlapping nodes; however, other 20 nodes when = 3 are not overlapping nodes when = 4; therefore the three authors maybe played more important role in interaction between communities or they have more diverse research interests.
(d) = 5. Setting = 5, we obtain 6 communities shown in Table 4, and again the IDs of overlapping nodes are underlined. There is only one overlapping node left which is a member of three communities, that is, communities 4, 5, and 6, and the author ID of the node is 156. In our experiment, the author (name of whom is Sejnowski T) is the only overlapping nodes when = 3, 4, and 5. He maybe has more diverse research interests so that he can play "bridge" role in multiple communities.
The subgraphs of the 6 communities ( = 5) in NIPS network are shown in Figure 6. In Figure 6, nodes in the same community are painted with the same color and shape except the overlapping node. The overlapping node is painted with red color and circle shape. From Figures 6(a)-6(c), one can see that communities 1 and 3 are connected but they do not have overlapping nodes, community 2 is separated from other communities, and communities 4, 5, and 6 are connected by overlapping node 156.

Topics and Collaborative Topics between Authors.
By running AT model on NIPS dataset, we obtain 100 topics and assign each topic an ID number range from 0 to 99. In Table 5, we list some typical topics and top 10 hot words for each topic with corresponding probabilities. Among them, there are some representative domains in NIPS in the listed topics, including support vector machine (SVM) and kernel methods, neural network, speech recognition, image and vision, EM and mixture model, and independent component analysis (ICA). After running AT model on NIPS dataset, we also obtain research topics of each author. We select the 10 most likely topics as research topics for each author, and each topic has a probability value indicating the possibility that the author is related to the topic. Table 6 shows research topics of the authors corresponding to the overlapping nodes in communities under the condition = 4. In the column "Topics ID2 and probabilities" of Table 6, the decimal value in parentheses after each topic is the probability specific to the author. For example, "19(0.18351)" denotes the probability author Koch C specific to topic 19 is 0.18351.
From the results of authors' research topics, we can easily obtain the collaborative topics between two collaborated authors by finding their common topics. In the results of AT model, we got the probabilities between authors and topics. If two authors collaborated a topic, we use the smaller value of probabilities to express the possibility of the topic they are related to. For example, Vapnik V (726) and Smola A collaborated in topic 77, and the probability values they are related to the topic are 0.26370 and 0.33208, respectively, so the probability they collaborated in the topic is 0.26370. In Table 3, we have listed all authors in community 2 ( = 4).   In the community, there are three overlapping nodes, 726, 1475, and 1504. In Table 6, we have listed the hot topics of the three authors. In this way, we can obtain a symmetric matrix of collaborative topics between them. In each cell except the diagonal cells, the collaborative topics are the intersection of topics of corresponding two authors. Collaborative topics between them are reported in Table 7.

Community Topics.
In this subsection, we will integrate the results from Sections 4.5 and 4.6 to uncover the research topics of communities. Our main idea is to find out the most frequent and possible collaborative topics in each community.
To be specific, for each community, we count all collaborated topics in all edges of the subgraph using (1) and then rank them by sorting both collaborative frequencies and probabilities descendingly. We use the minimal collaborative probabilities as probabilities for each topic in the counting process. We show the result of community 1 ( = 5; see Table 4) in Table 8. If two topics have the same frequencies, the topic with larger probability is ranked before another one.
With ranked collaborative topics of communities, we can select the most outstanding topics or select top topics as research topics for each community. In this paper, we use top topics to represent the research interests for all detected communities. Here, we set = 3 and list the topics of all communities under the condition = 5 (shown in Table 9). In column 2 of Table 9, topics are split by comma, and each topic is listed as the sequence of topic ID, topic name, and collaborative frequency.
According to the results, we can speculate the research interests for each community. It seems that the main interest of community 1 is related to "pattern recognition" and "outlier detection, " community 2 "speech recognition, " community 3 "SVM & kernel method, " community 4 "EM & mixture model" and "neural network and Boltzmann machine, " community 5 "independent component analysis" and EEC, and community 6 "image and vision" and "independent component analysis. " In order to investigate the effectiveness of proposed method, we check all the papers collaborated by authors in each detected community in Table 4. We found that the uncovered topics for most communities are closely related to topics of the papers written by the community authors. For example, there are 16 papers collaborated by authors in community 3 ( = 5), and the topic of the community seems to be "support vector machine and kernel method" according to the results of proposed method. We have listed the titles of these papers in Appendix A. We found that there are 8 papers, the titles of which contain "support vector, " and 4 papers, the titles of which contain "kernel. " Although there exist other topics judging from the titles of rest papers, it still assures us that the research interests of the community are mainly related to "SVM and kernel method. " Finally, we examine the function of the overlapping nodes. In our experiments, when = 5, we have an overlapping node, 156 (author name: Sejnowski T), between communities 4, 5, and 6. It is interesting that all the uncovered topics for the three communities contain topic 24 (visual position), and topics of both communities 5 and 6 contain topic 89 (independent component analysis). In order to find whether or not overlapping nodes play a bridge role between the three communities, we check all the 43 papers of author 156. The titles of his papers are provided in Appendix B. We found that the research interests of Sejnowski T covered almost all found topics of the three communities. From the titles of the papers, it is not difficult to see that there are more than 12 papers covering topic 24, about 6 papers covering topic 89, and about 3 papers covering topic 92. Thus, we have reason to believe that author 156 has multiple research interests and so plays a bridge role between the three communities.

Conclusion
In this work, a method of uncovering research topics of communities in scientific collaboration network is proposed. The method integrates community detection model using -clique-community algorithm and the author-topic model. The approach of -clique-community algorithm is to detect overlapping communities in scientific collaboration network, while the approach of AT model is to discover topics and authors' topics. We use common topics of coauthored researchers as their collaborative topics. Finally, we count all collaborative topics and select the most frequent collaborated topics among authors as research topics of communities. Experimental results on NIPS dataset show that our method is feasible and efficient.
In a knowledge-based system, it will be useful after obtaining the information of communities and their research topics. This information will help users position aninteresting academic community quickly, and then they can be led to   Table 6: Author topics of overlapping nodes ( = 4).

AID Author
Topic IDs and probabilities 77 Koch C find interesting topics, researchers, and papers by using the topics and coauthorship of authors in the community. This information will also help us improve the application effect and user experience in academic recommendation system and provide researchers in a community with information they really need. Therefore, the more interesting problems related to the method are how to use it and its results in a knowledge-based system and how the parameter in -clique-community algorithm affects user selection in the practical application.
There are some challenge problems for future studies. One is to develop an algorithm to obtain collaborative topics between authors directly by extending AT model. Another one is to analyze topic evolution of communities and the functions of overlapping nodes in the evolution process.  CNo is the abbreviation of community number. TID is the abbreviation of topic ID. Freq. is the abbreviation of frequency.