Exploiting Contextual Word Embedding of Authorship and Title of Articles for Discovering Citation Intent Classification

The number of scientiﬁc publications is growing exponentially. Research articles cite other work for various reasons and, therefore, have been studied extensively to associate documents. It is argued that not all references carry the same level of importance. It is essential to understand the reason for citation, called citation intent or function. Text information can contribute well if new natural language processing techniques are applied to capture the context of text data. In this paper, we have used contextualized word embedding to ﬁnd the numerical representation of text features. We further investigated the performance of various machine-learning techniques on the numerical representation of text. The performance of each of the classiﬁers was evaluated on two state-of-the-art datasets containing the text features. In the case of the unbalanced dataset, we observed that the linear Support Vector Machine (SVM) achieved 86% accuracy for the “background” class, where the training was extensive. For the rest of the classes, including “motivation,” “extension,” and “future,” the machine was trained on less than 100 records; therefore, the accuracy was only 57 to 64%. In the case of a balanced dataset, each of the classes has the same accuracy as trained on the same size of training data. Overall, SVM performed best on both of the datasets, followed by the stochastic gradient descent classiﬁer; therefore, SVM can produce good results as text classiﬁcation


Introduction
e growth of scientific article publication has made finding important, relevant research difficult for researchers.Citations have long been studied for the identification of influential studies [1].However, not all the citations within a research article play the same role.ere may be different reasons for citing a research article, and therefore, the intensity of relatedness may vary.Moravcsik and Murugesan [2] argue that most of the references within articles are to understand the work and provide background knowledge about the research problem.Teufel et al. [3] have categorized the citations into three classes with a positive, weak, or neutral relationship with the citing paper.Jurgens et al. [1] have claimed that the citation maybe for six different reasons, and the strength of relevancy of these categories is different from each other.
Various attempts have been made in order to understand the reason and intent of a citation.e most recent techniques have used deep networks for reading the citation context of a citation [4][5][6][7].
ey have set a window for extracting the citation context.e window boundaries typically contain the paragraph in which citation has been made.It may also include the sentences before and after it.An example of the citation context is given in Figure 1, being cited for comparison of the proposed methodology cited work.
Other approaches have utilized the bibliographic information of the research articles, which creates a network of citation nodes having edges of their mutual linking by citing [9].
ese approaches reasonably find the relationship among the citation papers but usually fail to provide the reason for a citation as they give the same weight to each of the references.Metainformation has been used extensively for citation intent extraction.e study based on text features is limited to the statistical similarity of the articles and normally does not study the internal context of those features [10].New advances in natural language processing, especially word embedding, have made it possible to understand the text context and label them with a class of intent [11].
is paper has evaluated a number of classification methods after converting the text information to their numerical representation.We have used Association for Computational Linguistics-Anthology Reference Corpus (ACL-ARC) and Science Citation (SciCite) datasets, discussed in the next section, to extract text features related to citation records.e ultimate goal of classification was to find the citation intent based on our selected text features list.e experiments show that the linear support vector machine (Linear-SVM) classifier has performed well on both datasets.We also evaluated the classifiers for the prediction of individual citation intent class.e results show that the algorithms performed well, particularly for those class prediction where the training set was immense; for example, in the case of Linear-SVM, the "background" class has an F1 score of 86% while the other classes, including "future" and "extension," have 65% and 61%, respectively.
e overall objectives of this study include the following: (1) Understanding the impact of text features for citation intent classification while using contextual encoding (2) Evaluating the results and comparing the classification models for citation intent labelling (3) Understanding the impact of training set size classifiers' biasness towards the individual citation classes (4) To exploit the authorship and titles for citation intent classification e rest of the paper is organized as follows: in Section 2, we introduce existing citation intent classification methods and the number of labelled classes.Section 3 discusses the proposed study framework.e details of each of the steps are further discussed in the subsections of Section 3. Section 4 evaluates the classification models and compares the results.Finally, we conclude our study in Section 5.

Related Work
e citation intent, also called the reason for a citation or citation function, has long been studied to analyze the research article relationship.As each article has, on average, 40 references and with time, the number of referenced articles within a research paper is growing [12], it is essential to understand why a paper has been cited.is section discusses various attempts made to identify the citation reason.
Roman et al. [4] used contextual embedding for capturing the context of citation context.ey used an automated method for annotating the unannotated dataset for citation intent and achieved good precision, recall, and F1 score.
ey also developed a vast dataset containing one million labelled citation context, named C2D-I.e author claimed the dataset as new a state-of-the-art dataset to design new citation intent approaches.C2D-I annotated the intent in three classes: background, method, and result.Although they could successfully develop a vast labelled dataset required for deep learning, they have not developed any recommender system to identify the citation reason.eir method was merely for the dataset annotation and not for the citation reason identification.
Hassan et al. [13] proposed a deep-learning-based approach for classifying the importance of a citation from a list of referenced papers.ey argued that not all references have the same measure of relevancy.ey used a Short-Term Long Memory-(LSTM-) based [14] deep-learning model to distinguish between important and unimportant citations.
ey also presented a classification model based on machine learning to select best-performing features using a Random Forest (RF) classifier [15].e authors have listed 14 features of a citation context describing the reason for citation, apart from being an important or unimportant citation.
Cohan et al. [16] criticized predefined hand-engineered features such as linguistic patterns extracted from paper content and borrowed the idea of scaffolding from Swayamdipta et al. [17].
ey assumed that better representations could be obtained directly from the data.ey proposed a multitasking framework to incorporate knowledge from a paper structure.eir designed framework incorporates two tasks as structural scaffolding: (1) prediction of the section title and (2) predicting whether a citation is needed.eir scaffolding also predicts the citation intent of a citation as background, method, or result class.
ey also created a SciCite dataset out of 6,627 papers having 11,020 by crowdsourcing.
e authors compared their model with the previous state-of-the-art Jurgens et al.'s [1] method for citation intent classification and achieved better results in terms of precision, recall, and F1 measures.e authors used pattern-based features including sequence of phases, parts of speech, lexical categories depicting the positive or negative sentiments, and specific categories such as words "we extended" and "compared with the previous state-of-the-art method."ey borrowed the list of patterns from Simone Teufel [18] and extended it with newly Haider et al. [11] and Zafar et al. [39].In this result, the top 5 best performer adverbs are selected from Haider et al. and Zafar et al.'s research studies.
In Figure 6, the "Postive" score results of the proposed work are compared with Figure 1: A sample of citation context from a research article [8]. 2 Complexity identified patterns and categories.
ey further exposed topic-based features by arguing that a topic thematic framing can point out the citation function.For example, a citation context describing the methodology is more likely related to "uses" function, whereas a citation context providing some definition is from the "background" class.
ey also explored the prototypical argument features and investigated a list of arguments that reflect a class of citations.For prototypical argument featuring, they identified frequently occurring arguments in syntactic positions.For example, the words "follow," "unfold," and "extend" frequently occur for "extend" class of citations.A vector representing the occurrence of an argument is created.e average of those occurrences decides the similarity of a citation towards a citation class.is study has used natural language processing features in detail to measure the citation reason and importance and has proved to be state-of-the-art research in this area.is study demonstrated that authors are sensitive to discourse structure and publication venue when citing a research paper.
Table 1 provides the list of citation Internet classes.e table also lists the dataset in which each of the classes is used.Some citation context examples are taken from these available datasets, which belong to those citation intent classes.

Proposed Study Framework
In this section, we discuss various steps of the proposed study, as depicted in Figure 2. e flow of the proposed study starts with the data processing and cleaning step, followed by converting text data to numeric representation.After converting the text data to numeric data, we apply different classification algorithms by feeding this data to the input layer to the classifiers.Finally, we gather the results and compare various evaluation measures for comparing the effects of classification algorithms.In the next step, we discuss the data preparation and preprocessing step in detail.

Data Preparation.
e data preparation step starts with the extraction of data for our study.We used two state-ofthe-art datasets ACL-ARC and SciCite.ese datasets are publicly available and widely used for citation intent classification.ACL-Anthology Reference Corpus (ACL-ARC) is an Artificial Neural Network-(ANN-) based citation intent classification dataset [1,19].e dataset has around 2,000 records.It has a number of features, including the citation context where in-text citation has been placed, citing and cited paper_id, which can be used to access the paper details using a web service, publication years, paper titles, author ids, extended context including more information on the intext citation context, section number, section title, citation marker offsets, the sentence before the citation context, and finally, the most crucial feature of citation intent specifying the reason of a reference.e citation intent in the ARL-ARC dataset has six citation intent classes described in Table 2.
e second dataset that we have used is the SciCite dataset [3]. is dataset has achieved a 13 percent increase in the F1 score in comparison to the ACL-ARC.e dataset includes, along with some other unimportant features, the name of the section in which in-text citation is placed, citing and cited paper id, citation context, citation intent class, and the confidence level of the annotated citation intent class.e features included in the dataset are minimal, and only few match the features listed in ACL-ARC.e second state-ofthe-art dataset contains the citation intent annotation in only three classes: background, method, and result.is dataset is five times larger than the ACL-ARC dataset, with over 9,159 instances with citation intent distribution listed in Table 2.
In order to keep the datasets persistent and for comparing and evaluating the results on both of these datasets, we made a balanced version of SciCite, which includes the missing required features for our study.From the name, it is clear that the balanced version of SciCite is a balanced one with an equal number of instances in each class.We used the Semantic Scholar API (https://api.semanticscholar.org/)by passing the citing and cited paper ID to extract the missing feature information.

Preparation of Textual Information.
is study is based on the features selected from both of the datasets discussed in the previous section.Table 3 provides the list of features selected from both of the datasets for our study.e table also provides the reason for choosing these particular features as input for machine-learning classifiers.
e features contain information in text form and, therefore, need natural language processing preprocessing steps for making them ready to be taken as inputs.e following operations are performed as data preparation steps.

Tokenization.
is task is used for breaking the paragraph or sentences into words by using whitespace or a special character as a token separator.

Stop Word Removal.
Stop words include the words that frequently occur in text having no significant impact on the topic under discussion.ey normally include parts of speech.Natural Language Toolkit (NLTK) [27] has defined a massive list of stop words in sixteen different languages.

Removing Punctuation and White Spaces.
We extended the NLTK stop word list in Python by adding numbers and special characters to it while removing the stop words.

Case Conversion.
Regardless of the position of the words in a sentence, we have changed the case of text to small so that the case of a text does not impact the meaning of a text.

3.2.5.
Stemming.Kantrowitz et al. [28] have studied the effects of stemming on word embedding using TFIDF and have proved that it has remarkable results.It is a Complexity  language-specific task and converts words from the derived form to their root form.We have used the NLTK package for stemming the terms of our text data.
Once the text data are in a cleaner form, we need to convert the nlp_input to some numerical form as machine-learning algorithms required numerical

Numerical Representation of Text Data.
e raw data in text format is converted into numerical representation such that similar words are closer to each other on the vector size.We used word embedding for numeric representation.Table 4 discusses various types of word embeddings along with their strengths and weaknesses.We have selected BERT word embedding as BERT is good in capturing the contextual information from a text and has been used by Roman et al. [4] for a similar task.BERTuses the transformers model [35,36] for encoding the vector representation, using encoding-decoding architecture.We used Transformer libraries [37] for BERT implementation using Python language on the Kaggle platform (https://www.kaggle.com/).

Classification Models.
Once the data has been converted to numeric representation, similar words are closed on the vector space.We are ready to feed this information to the citation classification model and evaluate the results to determine the best classification algorithm for citation intent class prediction.
e classification methods assign predefined classes to the feature data.To define our problem, we consider our training dataset, of records.Each record r i is assigned a citation class c i from e task is to find the best classification method m, where can assign an accurate citation intent to the new instance r.
To study the accuracy of the classifiers, a number of classification algorithms have proved best for natural language processing tasks, listed in Table 5. e steps performed in this stage are listed below and depicted in Figure 3.
(1) e classification models were provided with the input parameters, listed in Table 5, from ACL-ARC and SciCite datasets.80% of the records were provided as training data.(5) To guard against jumping to a conclusion without enough evidence, we calculated the average accuracy by repeating the experiments multiple times.
After setting the general guidelines and executing the steps discussed above, we performed an experiment and compared the selected machine-learning algorithms discussed in the next section.

Result Analysis and Comparisons
After training the models listed in Table 5, we performed experiments on the testing part of the datasets.In this section, we discuss the results of each model using precision, recall, and F1 measures.Precision counts positive predicted values and is the number of classes correctly identified.Recall is the fraction of actual classes identified.Increasing one decreases typically the other, and therefore, a harmonic mean of these two values is calculated given by the F1 measure.By evaluating the results against these measures, we want to see which model has performed well compared to the other models.
A multiclass confusion matrix is created using sklearn [44], NumPy, and seaborn libraries shown in Figures 4 and 5 for ACL-ARC and SciCite datasets.e confusion matrix clearly describes the number of true positive, false positive, false positive, and false negative predictions for each of the classes in the respective datasets.e calculation of precision is based on true positive and false negative parameters.e true positive is divided by the sum of true positive and false negative.
A multiclass confusion matrix is given in Table 6 for the linear regression classifier on the ACL-ARC dataset.We used this table to present a sample calculation of precision, recall, and F1 score.e precision of a model is the average of the precisions of each of its classes.us, the precision of the linear regression classifier is calculated as follows: e average recall of linear regression using ACL-ARC is, thus, 66%.(1) Vectors based on the occurrence of a word within a corpus and in the document are counted (2) Vector is proportional to the count of a word in a document and inverse to its count in other documents (3) Reducing the importance of common words frequently occurring, e.g., "while," "but," "the," and "is" (4) Computing similarity is easy (1) e similarity is merely based on the frequency of the words neglecting the semantic similarity (2) e size of a vector is large (3) Co-occurrence of words in a document is not recorded (4) Vectors are sparse (5) Synonyms are not considered (6) Polysemy words have a single vector.For example, apple is a fruit and Apple is a company; both have the same vector representation Count based 2 Global Vectors (GloVe) [30], cooccurrence matrix [29] (1) It is a hybrid method using a statistical matrix with machine learning (2) Records the appearance of a set of words in a corpus (3) Semantic similarity between King and Queen (4) Dimensionality reduction reduces the dimensions while producing more accurate vectors (1) Costly in terms of memory, for recording co-occurrences of words Count based 3 Word2Vec [31] (1) Word analogies and word similarities are stimulated (2) Measures likelihoods of wordsxxxx (3) "King-man + woman � Queen," which is a great feature of word embedding (4) Vectors can infer "king: man as queen: woman" (5) Input words mapped to target words (6) Probabilistic methods generally perform superior to deterministic methods [32] (7) Comparatively, small memory is consumed

Complexity
Precision and recall measures are always in tension, and increasing one results in decreasing the other.erefore, a third measure called F1 score is used, which is a weighted average of the two previously calculated measures given by A sample calculation of the F1 for linear regression on the SciCite dataset is as follows:  10 Complexity e average F1 score of linear regression is only 63% using the ACL-ARC dataset.Although the F1 score of some of the citation intent classes is very high, in the case of background class, it is 83%, yet the overall F1 score of this classifier is significantly less. is is because of the unbalanced nature of the ACL-ARC dataset, as some of the other classes have minimal records in the dataset, and their training has not been performed very well.
Tables 7 and 8 provide a complete list of precision, recall, and F1 scores for each classifier.e overall accuracy of the classifiers is shown in Figures 6 and 7. Linear-SVM has the highest accuracy on both of the datasets, having 78.49% and 77.8%.Background class measures in the ACL-ARC dataset are much higher than the rest of the classes as the ACL-ARC is not a balanced dataset and, therefore, is biased towards the classes having a higher number of training records.Motivation, extension, and future classes have the least F1 score due to their small training data size, having less than 100 records in each of these cases.To further validate our conclusion, we observed that, in our balanced SciCite dataset, the F1 score is very closed for each of the classes, while the result class has the highest F1 score.
e SGD classifier has the second-highest accuracy with little difference with the linear regression classifier.

Conclusions
Understanding the reason for a citation in a research article is crucial to investigate the essential related documents.Machine learning can perform well in classifying numeric metadata.Advances in natural language processing have made it possible to convert text data into a vector representation.e vectors can then be passed to classification algorithms to annotate the records in a scientific dataset.We have used BERT, a contextualized word representation, for converting text data to vectors.e classifiers were then evaluated, and two state-of-the-art datasets, ACL-ARC and SciCite, were used.
e trained models performed well, especially in the case of our balanced version of SciCite.Linear SVM achieved an 86% F1 score on the "background" class where the training records were above 1000.In the case of citation intent classes where the number of training records were less than 100, SVM achieved only 57 to 64% F1  Complexity score.In the case of a balanced dataset, SVM and other algorithms did not have that much difference in the accuracy of the classifiers.is study has utilized only the text features from the dataset.In the future study, the meta-and NLP feature, consisting of text information, can both be combined to classify citation intent class.

( 1 )( 1 )
Similar to the work of Li et al., 2013, our summarization system consists of three key components: an initial sentence preselection module to select some important sentence candidates; the abovementioned compression model to generate n-best compressions for each sentence; and then, an ILP summarization method to select the best summary sentences from the multiple compressed sentences (2) Tateisi et al. also translated LTAG into HPSG (Tateisi et al., 1998) (3) We are going to make such a comparison with the theories proposed by J. Hobbs (1979, 1982) that represent a more computationally oriented approach to coherence and those of T.A. van Dijk and W. Kintch (1983), who are more interested in addressing psychological and cognitive aspects of discourse coherence Motivation ACL-ARC e cited paper demonstrates the need for a new method, technique, or dataset (1) is idea was inspired by Delisle et al. (1993), who used a list of arguments surrounding the main verb together with the verb's subcategorization information and previously processed examples to analyze semantic roles (case relations) xxxx.(2) Our motivation for generation of material for language education exists in work such as that of Sumita et al. (2005) and Mostow and Jang (2012), which deal with automatic generation of classic fill-in-the-blank questions Extension (1) ACL-ARC e citing paper is extending the work or dataset of the referenced research (1) We improve a two-dimensional multimodal version of LDA (Andrews et al., 2009) (2) Our work builds on earlier research on learning to identify dialogues in which the user experienced poor speech recognizer performance (Litman et al.We perceive that these results can be extended to other language models that properly embed bilexical context-free grammars, such as, for instance, the more general historybased models used in the work of Ratnaparkhi, 1997, and Chelba and Jelinek, 1998.(2) Such a component would serve as the first stage of a clinical question answering system (Demner-Fushman and Lin, 2005) or summarization system (McKeown et al., 2003)ImportantTeufel[18] e reference article is an important one and must be counted towards the main contribution or being extended by the citing article (1) We use the nonprojective k-best MSTalgorithm to generate k-best lists (Hall, 2007), where k � 8 for the experiments in this paper (2) For better comparison with the work of others, we adopt the suggestion made by Green and Manning (2010) to evaluate the parsing quality on sentences up to 70 tokens long 4 Complexity

Figure 2 :
Figure 2: Framework of the proposed study for the citation content classification study.

( 2 )
We trained a model based on the input parameters, adjusting the input weights for the target class of citation intent.(3)e trained model was then used for predicting 20% of the remaining records.(4) e predicted citation class was checked with the actual class of the inputs.

( 1 )
Training becomes difficult with the large size of the vocabulary (2) Polysemy words have an aggregated vector representation provided in CBOW, whereas in Skip-gram, they keep separate vectors Prediction based 4 ELMO [33], Infersent [34], BERT [33] Positioning embedding is incorporated, creating different vectors for the same word depending upon the position and context in a sentence/ paragraph Contextualized embeddings require lots of computation Prediction based

Figure 7 :
Figure 7: Accuracy of classifiers using the SciCite dataset.

Table 1 :
Description of citation intent classes with examples from respective datasets.

Table 2 :
Distribution of records in citation intent classes in selected datasets.

Table 3 :
e list of features selected for study and their availability in the selected datasets.

Table 4 :
Overview of word embedding techniques with their strengths and weaknesses.

Table 5 :
Classification algorithms used for comparison for citation intent classification.
Precision background × Recall background Precision background + Recall background ,

Table 7 :
Comparison of precision, recall, and F1 score of various classifiers using the ACR-ARC dataset.

Table 6 :
Multiclass confusion matrix for linear regression using ACL-ARC.

Table 8 :
Comparison of precision, recall, and F1 score of various classifiers using the SciCite dataset.
Figure 6: Accuracy of classifiers using the ACL-ARC dataset.