The enhancement of TextRank algorithm by using word2vec and its application on topic extraction

TextRank is a traditional method for keyword matching and topic extraction, while its drawback stems from the ignoring of the semantic similarity among texts. By using word embedding technique, Word2Vec was incorporated into traditional TextRank and four simulation tests were carried on for model comparison. The results showed that the hybrid combination of Word2Vec and TextRank algorithms achieved better keyword/topic extraction towards our testing text dataset.


Introduction
With the booming development of new media and the Internet, the text data of unstructured or semistructured news is proliferating. Extracting effective information from the complex and irregular texts, which can certainly improve daily reading efficiency, is of great significance.
PageRank is a sorting algorithm for webpages [1]. The algorithm assigns a heavier weight to webpage which is more frequently cited by other webpages. In another word, the importance metric of a webpage relies on the amount of the linking resources. Assume } , , is the set of webpages, j m is the weight for each webpage,   i V In refers to set of webpages that link to i V , and the importance metric where d is a factor, which normally is set as 0.85. A drawback of this algorithm is that an outlier will draw significant effect to the result, e.g., a dramatic change of one webpage in   i V In will bring dramatic value change. Henceforth, several enhancement were made to enhance PageRank [2,3].
In light of PageRank, Tarau and Mihalcea proposed TextRank in 2004 [4]. In TextRank, article is divided into basic text units, i.e., words or phrases. As treated as webpage in PageRank, text unit maps to vertex in graph, and edge between vertexes refers to the link between text units.
The research of this paper is to introducing semantic similarity into text unit of sentences so as to achieve better topic representation of a text. Word2Vec [5] are used for the purpose of semantic embedding. Four simulation tests are carried on after the algorithm design and coding. The data and codes are available in GitHub, https://github.com/zuoxiaolei/TextRankPlus. The reason of the topics selection comes from authors' expertise in these wide-range disciplinary fields, based on which we released sufficient amount publications, thus made it accurate to define the relevant keywords/topics for each texts. From the view of our expertise, the keywords for each topic are safely pre-set, as shown in the last column in Table 1, and all of the topics will be tested in our simulation test in the Result section. The text data for evaluation is with proper length that make it sufficient to test the accuracy of keyword/topic extraction. The data is also delivered in Github for free downloading (https://github.com/JingboXia/Enhancement_of_TextRank).

Method
2.2.1. TextRank algorithm for keywords ranking. TextRank [4] is built in a graph-based unsupervised learning frame, and it has been widely used in keywords extraction and automatic abstracting. The core of TextRank come from vertex voting, where the voting action equals to an edge between two vertexes. The keyword is mapped with higher value, if the vertex it represents has higher relevance with the rest vertexes. In our research, the idea of TextRank is used for keywords ranking among four types of scientific abstracts.
PageRank algorithm for page ranking.
In the beginning, each vertex is assigned with equal weight, and afterwards a recurrent calculation update the weight thought voting. Here, is the graph, with V being the vertex set, E being the edge set. The importance metric of each vertex is as shown in the formula: In the initialization step, weight of each text unit is one, and all of the weights reach consistency after recurrent calculation by formula (2). The text units in the top ranking list are considered to be keywords of the text.
The flowchart of the algorithm is shown in Figure 1.

Figure 1. Classic TextRank algorithm workflow
The advantage of TextRank is that it is an unsupervised learning algorithm in no need of huge corpus for training. It make it easy to be adopted for handling other text resources in an efficient way.

Word embedding (CBOW and Skip-gram).
The disadvantage of TextRank is that it omit the keywords which has lower chance to appear though being meaningful in context. The natural way to enhance TextRank is to use the semantic similarity of words and avoid the miss selection of vital keywords.
Word2vec is a word embedding algorithm proposed by Google in 2013, which have two varities: CBOW and Skip-Gram. The main idea of Word2vec is to find numerical vector representation of word by using neural networks.
The idea of Word2vec comes from the probability calculation of Bayesian occurrence estimation, Let T= n w w w  , , 2 1 be sentences including n words, the probability of occurrence of the sentence T is: Similarly, the Bayesian estimation of the occurrence chance of the i-th word is: In CBOW, context information of each word is concerned within a -width window. The purpose of CBOW training is to maximize the probability of and minimize the probability of . Here, the probability of the occurrence of w based on is: where is the weights matrix connecting the hidden layer and softmax layer, refers to the sum of the numerical vectors which are in the flank side of the target word: The likelihood function of the model is (4) Negative sampling strategy is used to obtain , that make a quicker implementation. And gradient descent algorithm is used for parameter optimization as well.
Similar as CBOW, Skip-gram is to predict the neighbor words, and its objective function is to calculate the greatest average logarithm: (5) where c refers to the window width. As shown in Figure 2, CBOW and Skip-Gram own similar ideas and both were considered in our research.  3. Enhanced TextRank algorithm. The classic TextRank algorithm only considered the occurrence info of text units in sentences, but omit the semantic understanding of them. Furthermore, the same initial weights is assigned for each text unit, without differentiating the semantic representation of text unit. Henceforth, the word appeared more often is readily be selected to be keyword, regardless its importance. In order to solve this drawback, Word2Vec and TextRank are combined so as to form our proposed algorithm. In this way, lower-rank numerical vector is assigned for each word, and semantic similarity between text units are remained.
For implementation, popular python package Gensim is used for Word2Vec training and model construction (https://radimrehurek.com/gensim/), and Wikipedia Corpus is selected as training corpus. Here, Wikipedia corpus is a domain-free corpus which ensures a model with better generalization ability. Built-in Word2Vec computation in Gensim treate each text unit as a vertex of graph, and the similarities distribution among text units are calculated as edge between vertexs. Finally, in the integration part of TextRank and Word2Vec, the egde value between words are set as semantic similarity. In detail, to replace formula (2) in TextRank, weight value wji is reset by using semantic similarities counted by Word2Vec distance.

Results of Comparison
The result of comparison among four selected topics are shown in Table 2.