ClaiRE at SemEval-2018 Task 7: Classification of Relations using Embeddings

In this paper we describe our system for SemEval-2018 Task 7 on classification of semantic relations in scientific literature for clean (subtask 1.1) and noisy data (subtask 1.2). We compare two models for classification, a C-LSTM which utilizes only word embeddings and an SVM that also takes handcrafted features into account. To adapt to the domain of science we train word embeddings on scientific papers collected from arXiv.org. The hand-crafted features consist of lexical features to model the semantic relations as well as the entities between which the relation holds. Classification of Relations using Embeddings (ClaiRE) achieved an F1 score of 74.89% for the first subtask and 78.39% for the second.


Introduction
The goal of SemEval-2018 Task 7 is to extract and classify semantic relations between entities into six categories that are specific to scientific literature (Gábor et al., 2018).In this work, we focus on the subtask of classifying relations between entities in manually (subtask 1.1) and automatically annotated and therefore noisy data (subtask 1.2).Given a pair of related entities, the task is to classify the type of their relation among the following options: Compare, Model-Feature, Part Whole, Result, Topic or Usage.Relation types are explained in detail in the task description paper (Gábor et al., 2018).The following sentence shows an example of a Result relation between the two entities combination methods and system performance: Combination methods are an effective way of improving system performance.
This sentence is a good example for two challenges we face in this task.First, almost half of all entities consist of noun phrases which has to be considered when constructing features.Secondly, the vocabulary is domain dependent and therefore background knowledge should be adopted.
Previous approaches for semantic relation classification tasks mainly employed two strategies.Either they made use of a lot of hand-crafted features or they utilized a neural network with as few background knowledge as possible.The winning system of an earlier SemEval challenge on relation classification (Hendrickx et al., 2009) adopted the first approach and achieved an F1 score of 82.2% (Rink and Harabagiu, 2010).Later, other works outperformed this approach by using CNNs with and without hand-crafted features (Santos et al., 2015;Xu et al., 2015) as well as RNNs (Miwa and Bansal, 2016).
Approach We present two approaches that use different levels of preliminary information.Our first approach is inspired by the winning method of the SemEval-2010 challenge (Rink and Harabagiu, 2010).It models semantic relations by describing the two entities, between which the semantic relation holds, as well as the words between those entities.We call those in-between words the context of the semantic relation.We classify relations by using an SVM on lexical features, such as part-of-speech tags.Additionally we make use of semantic background knowledge and add pre-trained word embeddings to the SVM, as word embeddings have been shown to improve performance in a series of NLP tasks, such as sentiment analysis (Kim, 2014), question answering (Chen et al., 2017) or relation extraction (Dligach et al., 2017).Besides using existing word embeddings generated from a general corpus, we also train embeddings on scientific articles that better reflect scientific vocabulary.
In contrast, our second approach relies on word embeddings only, which are fed into a convolutional long-short term memory (C-LSTM) network, a model that combines convolutional and recurrent neural networks (Zhou et al., 2015).Therefore no hand-crafted features are used.Because both CNN and RNN models have shown good performance for this task, we assume that a combination of them will positively impact classification performance compared to the individual models.
By combining lexical information and domainadapted scientific word embeddings, our system ClaiRE achieved an F1 score of 74.89% for the first subtask with manually annotated data and 78.39% for the second subtask with automatically annotated data.

Features
In this section, we describe the features which are used in our two approaches.All sentences are first preprocessed before constructing boolean lexical features on the one hand and word embedding vectors on the other.Both feature groups are based on the entities of relations as well as the context in which those entities appear.
Apart from the Compare relation, all relation types are asymmetric, and therefore the distinction between start and end entity of a relation is important.If entities appear in reverse order, that means the end entity of a relation appears first in the sentence, this is marked by a direction feature which is part of the data set.
In our entrance example, combination methods denotes the start entity, system performance the end entity, and are an effective way of improving the context.

Preprocessing
Early experiments showed that it is beneficial to filter the vocabulary of our data and reduce noise by leaving out infrequent context words.The best setting was found to be a frequency threshold of 5 on lemmatized words.Therefore we discard a context word if its lemma appears less than 5 times in the corpus of the respective subtask.

Context features
First we will explain feature construction based on the context of a relation.Abbreviations for feature names are denoted in brackets.Context is defined as the words between two entities.Early tests showed that using those words described the relation better than the words surrounding the relation entities.
Lexical We construct several lexical boolean features which are illustrated in Table 1.First we apply a bag of words (bow) approach where each lemmatized word forms one boolean feature, which for example takes 1 as value if the lemma improve is present and 0 if it is not.Second we determine whether the context words contain certain part-of-speech (POS) tags (pos), such as VERB.POS-tagging was done with the help of SpaCy1 (v.2.0.2).To represent the structure of the context phrase we add a path of POS tags feature, which contains the order in which POS tags appear (pospath).The distance feature depicts whether the POS-path and therefore the context phrase has a certain length (dist).
Additionally we add background knowledge by extracting the top-level Levin classes of intermediary verbs from VerbNet2 (lc), a verb lexicon compatible with WordNet.It contains explicitly stated syntactic and semantic information, using Levin verb classes to systematically construct lexical entries (Schuler, 2005).For example the verb improve belongs to class 45.4, which is described by Levin as consisting of "alternating change of state" verbs. 3mbeddings Aside from lexical features we also use word embedding vectors to leverage information from the context of entities (c).For each filtered context word we extract its word embeddding from a pre-trained corpus, where out-ofvocabulary words (OOV) are represented by the zero vector.The individual word vectors are later applied to train a C-LSTM.
In contrast, for use in an SVM we found it beneficial to represent the context embedding features as the average over all context word embeddings.

Entity features
In the second set of features, we model the relation entities themselves as they may be connected to a certain relation class.For example, the token performance or one form of it mostly appears as an end entity of a Result relation, and in the rare Lexical For the creation of boolean lexical features, we first take the lowercased string of each entity and construct up to three distinct features from it.One feature which marks its general appearance in the corpus without order (ents) and one each if it occurs as start (startEnt) or end (endEnt) entity of a relation, taking its direction into account.Additionally we add the head noun to the respective feature set if the entity consists of a nominal phrase to create greater overlap between instances.Furthermore we measure the semantic similarity of the relation entities using the cosine of the corresponding word embedding vectors (sim100).While the cosine takes every value from [-1, 1] in theory, we cut off after two digits to reduce the feature space and get 99 boolean similarity features for our corpus.To again enable learning across instances we additionally discretize the similarity range and form another five boolean similarity features (simb) that capture into which of the following buckets the similarity score falls: q0 = [−1, 0), q25 = [0, 0.25), q50 = [0.25,0.5), q75 = [0.5, 0.75), q100 = [0.75,1] (values below zero are very rare in this corpus).
Embeddings Similar to the context features we also want to add word embeddings of entities to our entity feature set.This is not straighforward as more than 44% of all entities consist of nominal phrases, while a word embedding usually corresponds to a single word.By way of comparison, the proportion of nominals in the relation classification corpus of the SemEval-2010 challenge was only 5%.Thus we tested different strategies to obtain a word embedding for nominal phrases and found that averaging over the individual word vectors of the phrase yielded the best results for this task.These word embeddings for start (e s ) and end (e e ) entities of relations were then presented to our two classification methods, which will be described in detail in the following section.

Classification Methods
We utilize two different models for classifying semantic relations: an SVM which incorporates both the lexical and embedding features described in Section 2 and a Convolutional Long Short Term Memory (C-LSTM) neural network that only uses word embedding vectors To fully exploit our hand-crafted lexical features we employ a traditional classifier.In comparison to Naive Bayes, Decision Trees and Random Forests we found a Support Vector Machine to perform best for this task.Instead of utilizing the decision function of the SVM to predict test labels we decided to make use of the probability estimates according to Wu et al. (2004) as this proved to be more successful.As mentioned before, the lexical features are fed into the SVM as boolean features whereas the word embeddings are normalized using MinMax-Scaling to the range [0, 1] to make it easier for the SVM to handle both feature groups.
In contrast to SVM, neural network models do not necessarily rely on handcrafted features and are therefore faster to implement.We experiment with standard C-LSTM (Zhou et al., 2015) which extracts a sentence representation by combining one-dimensional convolution and an LSTM network and uses the representation to perform a classification.

Evaluation
After describing the two models we employ for relation classification, we now portray the data set we use and present results for both SVM and C-LSTM as micro-F1 and macro-F1.The latter is the official evaluation score of the SemEval Challenge.We describe the experimental setup for both models and compare different feature sets and pretrained embeddings.

Data and Background Knowledge
We evaluate our approach on a set of scientific ab- Background Knowledge In our experiments, we compare different pre-trained word embeddings as a source of background knowledge.As a baseline, we employ a publicly available set of 300-dimensional word embeddings trained with GloVe (Pennington et al., 2014) on the Common Crawl data4 (CC).To better reflect the semantics of scientific language, we trained our own scientific embeddings using word2vec (Mikolov et al., 2013) on a large corpus of papers collected from arXiv.org5 (arXiv).
In order to create the scientific embeddings, we downloaded L A T E X sources for all papers published in 2016 on arXiv.orgusing the provided dumps. 6fter originally trying to extract the plain text from the sources, we found that it was more feasible to first compile the sources to pdf (exclud- ing all graphics etc.) and then use pdftotext7 to convert the documents to plain text.This resulted in a dataset of about 166 000 papers.Using gensim ( Řehůřek and Sojka, 2010), for each document we extracted tokens of minimum length 1 with the wikicorpus tokenizer and used word2vec to train 300-dimensional word embeddings on the data.We kept most hyper-parameters at their default values, but limited the vocabulary to words occurring at least 100 times in the dataset, reducing for example the noise introduced by artifacts from equations.

Parameters and Results
After an extensive grid search per cross validation the best parameters for the SVM were found to be a rbf-kernel with C = 100 and γ = 0.001 for both tasks.
Results of the SVM for subtask 1.1.are shown in Table 3. Adding entity features proves to be very beneficial compared to using only context features, as we could improve macro-F1 by 12 points on average.Results are further improved by enlarging the data set with the training samples of subtask 1.2 and by adding word embeddings to the feature set.While adding the CC embeddings enhances the micro-F1 by more than 4 points, our domain-adapted arXiv embeddings prove to perform even better and deliver the best result with a macro-F1 score of 74.89 % and a micro-F1 of 76.90 %.
Similar observations can be made for subtask 1.2., as is pictured in Table 4.
Due to space limitations we publish parameter details and elaborate results for the C-LSTM on arXiv.org(Hettinger et al., 2018).In comparison to the SVM, which additionally uses hand-crafted features, the C-LSTM achieves lower scores.For arXiv embeddings it reaches a macro-F1 of 63.3 % for the first subtask and 68.0 % for the second.

Discussion
We briefly discuss our approach during the training phase of the SemEval-Challenge and how label distribution and evaluation measure influences our results.Ahead of the final evaluation phase where the concealed test data D test was presented to the participants we were given a preliminary test partition D pre as part of the training data D train .To be able to estimate our performance we evaluated it on D pre as well as for a 10-fold stratified cross validation setting.We chose this procedure to be sure to pick the best system for submission at the challenge.
As some classes were strongly underrepresented in the training corpus and D pre , we assumed that this is also true for the final test set D test .When in doubt we therefore chose to optimize according to D pre as cross validation is based on a slightly more balanced data set (of train data for subtask 1.1 + 1.2).The best system we submitted for subtask 1.1 of the challenge achieved a macro-F1 of 75.05% on D pre during the training phase which shows that we were able to estimate our final result pretty closely.
During training we also noticed that for heavily skewed class distributions as in this case, macro-F1 as an evaluation measure strongly depends on a good prediction of very small classes.For example, macro-F1 of subtask 1.1 increases by 5 points if we correctly predict one Topic instance out of three instead of none.Thus we pick a configuration that optimizes the small classes.
We also omitted some lexical feature sets from our system as performance on the temporary and final test set showed that they did not improve results.These features were hypernyms of context and entity tokens from WordNet and dependency paths between entities.Using tf-idf normalization instead of boolean for lexical features also worsened our results.

Conclusion
In this paper, we described our SemEval-2018 Task 7 system to classify semantic relations in sci-entific literature for clean (subtask 1.1) and noisy (subtask 1.2) data.We constructed features based on relation entities and their context by means of hand-crafted lexical features as well as word embeddings.To better adapt to the scientific domain, we trained scientific word embeddings on a large corpus of scientific papers obtained from arXiv.org.We used an SVM to classify relations and additionally contrasted these results with those obtained from training a C-LSTM model on the scientific embeddings.We were able to obtain a macro-F1 score of 74.89 % on clean data and rank 4th out of 28 and 78.39 % on noisy data, which resulted in a 6th place out of 20.
In future work, we will improve the tokenization of the scientific word embeddings and also take noun compounds into account, as they make up a large part of the scientific vocabulary.We will also investigate more complex neural network based models, that can leverage additional information, for example relation direction and POS tags.Some minor changes we applied to the feature generation during the post-evaluation phase and which further improved our results by more than 2 % are published on arXiv.orgtogether with more detailed evaluation (Hettinger et al., 2018).

Table 1 :
Example Sentence: Combination methods are an effective way of improving system performance.Examples for lexical context and entity features.casewhen it represents a start entity, it is almost always part of a Compare relation.Therefore we leverage information about entity position for the creation of lexical and embedding entity features.

Table 2 :
Distribution of class labels for training data as absolute and relative values.