Rhetorical Sentence Categorization for Scientific Paper Using Word2Vec Semantic Representation

One of some ways to summarize scientific papers is by employing rhetorical structure of sentences. Determining rhetorical sentence itself passes through the process of text categorization. In order to get good performance, some works in text categorization have been done by employing semantic similarity words. Therefore, this paper aims to present the rhetorical sentence categorization from scientific paper by using selected features, added previous label, and Word2Vec to capture semantic similarity words. Then, this paper shows the result of employing resampling for balancing the existing instances per class and combining resampling and Word2Vec representation itself. Every experiment is tested in two classifiers, namely IBk and J48 tree. It shows that the use of previous label, Word2Vec (Skip-Gram), and resampling improves performance. After doing all the experiments in the 10-fold cross-validation, the highest performance of F-measure is achieved 84.97% by combining Word2Vec (Skip-Gram), all features, and resampling.


Introduction
Scientific papers become one kind of document that continues to increase in number along with more research conducted by academics and researchers themselves. They initially use, at least one, scientific paper to collect relevant information for their own argument in the research [1]. Abstracts of that papers can be read in advance to find out the summarized meaningful information [2]. However, abstracts does not always give all important information as needed by readers and does not expose the correlation of scientific papers at each other. Due to this condition, readers prefer summary of a collection of scientific papers in the form of outline on certain points. A scientific paper has rhetorical structure, in which every segment of a text (i.e. sentence) has a meaningful category in the body of each section [3]. The classified rhetorical sentences are easier to be structured into a summary as needed by readers [4].
Determining rhetorical sentence passes through the process of sentence categorization. This process produces the high dimensionality of feature space which represents text in the document by employing string to word vector [5]. Its vector representation can be formed into distributional semantic model by capturing the meaning of each existing words [6]. For example, because the word of "aim" and "purpose" has the similar meaning, these two words should also have high word similarity. Therefore, the more words with similar meaning on the scientific paper exist, the higher effect against performance of classification can be acquired. In this paper, we will conduct rhetorical sentence categorization on scientific paper by employing Word2Vec as a tool for computing word vector representation of semantic similarity words [6]. Since rhetorical dataset is an imbalanced dataset, this paper will also involve handling of imbalanced data set.
The rest of this paper is organized as follows. The next section provides the related works on rhetorical sentence categorization from scientific paper and implementation of word2vec semantic representation. Our method is explained in section 3. We also define various rhetorical categories and feature sets for classifying sentences. We describe our experiments in section 4. The results are analyzed in section 5. Finally, the conclusion and further work are described in section 6.
Word2Vec is used to group words which have similar meaning into vector representation. This model is proposed by Mikolov [6] and released by Google in 2013. There are two architectures of Word2Vec, namely the continuous bag-of-words (CBOW) and skip-gram. The CBOW predicts the current word based on the context and the skip-gram predicts surrounding words given the current word [11].
Heffernan and Teufel [12] employed Word2Vec representation to identifying problem statement in scientific text. They used 18,753,472 sentences from a biomedical corpus based on all full-text Pubmed articles and then built model from the 200 semantically similar words to only "problem". Its result showed that Word2Vec model caused significant performance increase, because Word2Vec attributes have the greatest information gain compared the other features.
Rahmawati and Khodra [13] have employed Word2Vec representation in multi-label classification for Indonesian news article. They did experiment by using CBOW (once) and Skip-Gram in the vector length from 200 to 500. Its result showed the testing F-measure value from 76.73% to 81.63%. It is proven that Word2Vec has better performance than TF-IDF in their previous research. Putra and Khodra [14] also showed that text representation using semantic model has higher accuracy than lexicon model which does not consider about semantic meaning of words. They obtained the best accuracy by using ANN with word2vec CBOW at 82.94%.
We use sentence features i.e. content, absolute location, explicit structure, sentence length, citation and sequential label. These features, except previous label, are adapted from [7] and [10]. Its explanation can be seen in Table 1. Then this paper employs Word2Vec representation algorithm from Medallia 1 library. We use 75 scientific papers still from [10] to build the word vector itself. Actually that data set was taken from ACL Anthology Reference Corpus (ACL-ARC).  In addition, resampling would be included in the experiment process for solving the imbalanced data [15]. Resampling is conducted by using WEKA. It will do oversampling for categories in smaller number and under-sampling for categories in higher number. So that every instances of categories is balanced in number.
Actually, our method is generally divided into four main processes. The first is preprocessing, and then constructing the vector representations of the vocabularies of data set by using Word2Vec. The second is producing all the features. The third is implementing resampling before building classification model. The last is we do 10-fold cross validation to know the F-measure detail of rhetorical category by using some classifiers, namely IBk and J48 in Weka.
We consider four preprocessing that consist of case folding, tokenization, stemming, and stop word elimination. These are done using Apache Lucene library and Weka. All of them were fixtures that would always be applied in all experiments. Then we use our annotated preprocessed sentences to output a model containing word vectors through Word2Vec representation (CBOW and Skip-Gram). We set its layer size into the vector length of 200. In the end, all classification models are produced from the last process.

Experiment
We conducted the rhetorical sentence categorization using the data set of scientific paper from [10]. This data has 75 annotated scientific papers that have been split into 10880 sentences. The number of sentences in every category before and after doing resampling can be seen in Table 2    In addition, there are five experiments that would be conducted. It is related to existence of previous label as sequential feature, resampling, and Word2Vec of CBOW and Skip-Gram. These are as follows: 1. Baseline. The types of feature to use are content, absolute location, explicit structure, sentence length and citation, without sequential label (previous label); 2. Scenario 1. In this experiment, all the features are used, including sequential label (previous label), to know whether previous label increases performance or not; 3. Scenario 2. In this experiment, scenario 1 will be conducted and completed by doing resampling.
This method is used because [16] showed that resampling is the best method for solving imbalanced data set; 4. Scenario 3. This experiment will employ all the features and involve Word2Vec representation by using CBOW and Skip-Gram algorithm. For this scenario, resampling is not used, because this scenario is only to know the effect of CBOW and Skip-Gram against performance; and 5. Scenario 4. After knowing the best one of Word2Vec algorithm from comparing the result of scenario 3, this experiment will combine all the features, the best one of Word2Vec (Skip-Gram), and resampling. For the last scenario, only Skip-Gram is used, because this kind of Word2Vec has better F-measure result than CBOW if compared with the result of first scenario. In addition, every scenario would be evaluated in two classifiers, namely IBk and J48 tree. Actually, determining classifiers for this research is based on [16] showing that IBk and J48 are the first two classifiers that have the highest performance in its last experiment. Then from all the experiments, we would know how some features have effect on the accuracy of F-measure in every rhetorical categorization.

Result
First, we built the word vectors from the data set which consist of 75 scientific papers that have been split into 10880 sentences. We extracted all the features without previous label. This was evaluated by two classifiers, namely IBk and J48 tree. Table 4 shows the result of experiments. We can see that previous label and method of resampling has significantly increased the performance. In the first experiment (baseline), the F-measure average from using IBk classifier achieved 17.39% and J48 achieved 18.79%. Then these two performance have increased becoming 30.07% for IBk and 39.12% for J48 tree in the scenario 1 that involves previous label. Its increase is until about 18% from initial performance in baseline. It is proven that previous label has a large effect to raise the performance of classification as concluded in [16]. It is because previous label as sequential feature shows some patterns which always appear in rhetorical sentence categorization that are correlated with the target classes [17].
In fact, the data set we got has the imbalanced instances per category, so we have to make it balanced. Then, the third experiment (scenario 2) exists to know how much the effect of handling imbalanced data set, by using resampling in Weka, against the performance. Its result shows that the F-measure average from using IBk classifier achieved 79.91% and J48 achieved 72.1%. These two increases are significant from the result of scenario 1. However, there is a difference in this scenario result compared with the two before. In the first two experiment, J48 always has the higher performance, but in the third, IBk becomes the highest. This result is the same as concluded in [16] that resampling is more suitable for IBk classifier than J48 tree.
The next is we do the last two scenarios that related to Word2vec representation. In the scenario 3, we compare between using CBOW and Skip-Gram algorithm without doing resampling. Its result will be compared with the result of scenario 1, because both of them use all the extracted features. The most result from this scenario is not good. When using CBOW algorithm, the F-measure average from using IBk classifier achieved only 27.58% and J48 achieved 33.77%. If compared with the result of scenario 1, this result has decreased from 2.5% to 5.5%. It is different when using Skip-Gram. Although in J48 tree classifier, the F-measure has decreased becoming 31.36% more than when using CBOW, in IBk classifier it shows the contrary that using Skip-Gram is better than CBOW. Moreover the only one of all result in scenario 3 that increases from scenario 1 is acquired when using Skip-Gram and IBk classifier. In [13], Skip-Gram is indeed better than CBOW. It could be because Skip-Gram predicts surrounding words given the current word. In this research we check the similarity between words existing in sentences and words existing in vocabulary in Word2Vec representation. So that every word in sentence has weight of the semantic similarity between vocabulary clearly. It is different when using CBOW. It will predict the current word based on the context, in which sentence can be that context. Therefore in the scenario 4, we only involve Skip-Gram algorithm then it is combined with resampling. This scenario is compared with scenario 2. Its result shows that the F-measure average from using IBk classifier achieved only 84.97% and J48 achieved 80.04%. After using resampling, the performance has always increased. It means that combination of Skip-Gram of Word2Vec representation and resampling give the higher performance for rhetorical sentence categorization. Resampling helps to solve the imbalanced data set and Skip-Gram Word2Vec to capture semantic similarity words in the data set.

Conclusion and Further Work
For the first, the implementation of previous label as sequential feature is used to catch class pattern which always appears that is correlated to the target class. So that it could be used to predict the next  rhetorical category for sentences. For the second, the implementation of resample can get the performance up. It is because this method make the imbalanced data set in every class become balanced. For this research, we get the highest performance of using resampling when employing IBk classifier. It means that resampling is more suitable with IBk classifier than with J48 tree. Furthermore our work focused also in employing Word2Vec semantic representation. There are two architecture of Word2Vec, namely Skip-Gram and CBOW. Actually, after doing this research, we got that the performance of using Skip-Gram is better than CBOW. The last is implementation of combining Word2Vec and resampling can give the higher performance. Actually, in this work we still do Word2Vec representation in the annotated data set itself and have not yet conducted testing and training. These two things can affect the performance whether it becomes better or not. In the further work, it will be considered to achieve good performance of rhetorical sentence categorization. Moreover the data set should be added more also to get higher performance.