Tom Jumbo-Grumbo at SemEval-2019 Task 4: Hyperpartisan News Detection with GloVe vectors and SVM

In this paper, we describe our attempt to learn bias from news articles. From our experiments, it seems that although there is a correlation between publisher bias and article bias, it is challenging to learn bias directly from the publisher labels. On the other hand, using few manually-labeled samples can increase the accuracy metric from around 60% to near 80%. Our system is computationally inexpensive and uses several standard document representations in NLP to train an SVM or LR classifier. The system ranked 4th in the SemEval-2019 task. The code is released for reproducibility.


Introduction
Bias is the inclination or prejudice for or against one person or group.News articles that contain extreme bias fail to provide fair and multifaceted views for readers and can create polarization within the society (Bernhardt et al., 2008).A system that can detect bias in news articles is thus relevant, especially in a time where an increasing number of people consume news from online sources that might not be trustful.
The SemEval-2019 task aims to detect hyperpartisan news given the text of the news article, where hyperpartisan news is defined to be an article that overtly favors a side or view.The details of the task can be found in Kiesel et al. (2019).We are provided with a dataset of two parts.The first part is labeled by the publishers (e.g. if a publisher is decided to be a hyperpartisan source, all its articles are labeled as hyperpartisan), and split into a training and validation set with no overlapping publishers (which we will refer to as training-1 and validation-1).The second part is crowdsourced and labeled per article (which we will call training-2).
1 https://github.com/chialun-yeh/SemEval2019 Due to the large number of labeled samples, we decide to use a supervised classification approach, where features are extracted from the text and used to train a classifier.Bag-of-words (BoW), TFIDF weighting, and n-grams have been shown to be strong baselines (Hu and Liu, 2004;Wang and Manning, 2012).Other features such as Part-Of-Speech (POS), counts of sentiment and bias words have also been studied (Liu, 2012;Mukherjee and Weikum, 2015).In a similar setting, Potthast et al. (2018) uses features such as n-gram of characters, readability scores, dictionary, and the ratio of quoted words to separate hyperpartisan news from the mainstream.They trained a random forest classifier and achieved an accuracy of 75%.Kulkarni et al. (2018) build a neural network to predict the political ideology of news articles to be either left, right or center.They combine information from the headlines, the links within an article, and the content.They use a CNN (Kim, 2014) for the headlines, a Node2Vec (Grover and Leskovec, 2016) to model the links and a hierarchical attention network (HAN) (Yang et al., 2016) to extract features from the content.They compare the model with several baselines, including a BoW LR model, a fully-connected feedforward network, and networks with only the individual components.Their proposed model performs the best.However, their system is trained and evaluated on only data with publisher labels.They randomly split them into training and testing sets, with overlapping publishers.
The main contribution of the paper is two-fold.First, we analyze the problem of using the dataset labeled by publishers, concluding that it is difficult due to the noisy labels.Second, we train SVM classifiers with different representations: TFDIF, doc2vec and GloVe pre-trained vectors.The 300dimensional GloVe vectors obtain the best crossvalidation accuracy as well as the performance metrics on the official test data.This paper is organized as follows.In section 2, we describe the data pre-processing.In section 3, we present the two systems that we devise and explain how one motivates the other.In section 4, we present the performance of the final system.We outline our main conclusions and future work in section 5.

Pre-processing
Since the articles are collected from online news platforms, they contain texts that are irrelevant to the news itself.We use the following three steps to clean the data: (a) Remove online usage including links, hashtags, @-tag, and advertisements.(b) Remove parentheses, brackets, and curly brackets that contain additional information because the usage is often specific to publishers.(c) Remove paragraphs that might reveal publisher information.Some publishers use headers and footers of specific patterns in their articles.We try to remove them by discarding the first and last paragraphs from the article if the article has more than two paragraphs, assuming that these two paragraphs have higher probabilities of being headers and footers.This is by no means optimal since the first paragraph often contains important content if it is not a header.Some publishers also inserted short text such as "read more here" between paragraphs.To remove these irrelevant texts that can reveal publisher pattern, we remove any paragraph with less than ten words.Any article with less than ten words after the cleaning is discarded.
We consider (a) and (b) as basic data cleaning and apply them on all data.On the other hand, (c) is a more aggressive cleaning that is done only on training-1.This is because we have a comparatively large training set where we can afford filtering out information and even entire articles.
3 System Description

System 1
In the first method, we use training-1 to train our models, validation-1 to choose hyperparameters, and training-2 to test the models.As mentioned earlier, training-1 is labeled by publishers.While a biased publisher publishes more biased articles on average, it is unlikely that all of its articles are biased.Therefore, the labels are noisy, e.g., some labels are flipped.It is, however, difficult to identify the articles that have the wrong labels without manual inspection.We assume that the publisher labels are correlated with true bias labels, thus providing information to learn bias.To have an idea of to what extent this assumption holds, we investigate training-2.We select publishers of whom at least five articles are included in the dataset and whose media bias can be retrieved from Media-Bias/FactCheck2 .This results in a total of 24 publishers.The publisher bias ratings on the website can be roughly mapped to 7 categories, extremeleft, left, left-center, center, right-center, right, and extreme-right.In Table 1, we list these publishers along with the percentage of the articles that are rated as hyperpartisan by crowd workers.The number of articles per publisher range from 5 to 24. Figure 1 shows the percentage of hyperpartisan articles in each category.We see that leftcenter and center publishers indeed have considerably less percentage of hyperpartisan articles.However, right-center publishers are almost as biased as right publishers.The observation can be due to the small sample size (the high percentage is caused mainly by the publisher RealClearPolitics).In general, there is a correlation between the publisher and true hyperpartisanship.We use BoW and n-grams (n=1,2) as features, with different weighting schemes, including raw counts, binary, and TFIDF.For BoW and n-grams, the feature dimension is 50K and 500K respectively.We train two classifiers on each representation.The accuracy of the classifiers on validation-1 is listed in Table 2.We include experiments where training-1 is not cleaned with pre-processing step (c) to make sure that the step helps the task.
From the result, we observe that adding bigrams doesn't improve accuracy.We use the best model (BoW and an SVM classifier) to predict the articles in training-2.The accuracy is 56%, which is lower than the majority baseline of 63%.
Although we clean the dataset in an effort to prevent the classifier from overfitting on the publisher, it seems that the classifier cannot generalize to unseen publishers, and fails to capture bias.We also experiment with training a CNN (Kim, 2014) with the headlines, and a HAN (Yang et al., 2016) with the content.However, the two models again fail to generalize to new publishers.The observation makes us believe that the publisher labels are too noisy to be used directly to learn true bias.Another possible explanation could be that the publishers have too distinct writing styles so that the classifier focuses much on those features when learning.

System 2
Due to the observation in system 1, we decide to treat training-1 and validation-1 as unlabeled samples that can be used to train a feature extractor in an unsupervised setting.We then train the classifier using training-2.We use the first part of the data by the following two extractors.
1. TFIDF: The data is used to build vocabulary and record the inverse document frequency.All terms that occur in more than 90% of the documents are discarded, and we kept the most frequent 50K terms.
2. Doc2Vec: The data is used to train a PV-DM model proposed by Le and Mikolov (2014).
We discard all terms that occur in less than 10 documents or are shorter than two characters.We train the model for 20 epochs using the implementation of gensim ( Řehůřek and Sojka, 2010).When inferencing new documents, the word vectors are fixed and the model is trained for 100 epochs.
In addition, we experiment with using pretrained word embeddings since the meaning of each word should not differ significantly in different corpora.We use vectors trained with GloVe algorithm (Pennington et al., 2014) on Wikipedia and Gigaword 53 .The vectors are chosen because they are trained on Wikipedia and newswire text, which provides general knowledge and news domain specific usage.We take the vectors of each word in the document and average all the vectors.Stop words are removed, and if the document has more than 1000 words, we average over the first 1000 words (we find this to work better in our case empirically).
We also experiment with a set of features including normalized count of 5 POS tags, 6 readability scores, 8 normalized sentiment and bias word counts according to MPQA and bias lexicons (Wilson et al., 2005;Recasens et al., 2013), number of quotes, words, capitalized words, stop words, and sentences, and average length of words and sentences.This result in a total of 27 features which we call Feat.
For supervised training, we split training-2 into two sets.The first half, with 322 samples, is used to train and choose hyperparameters in a 10-fold cross validation setting.The second half, with 323 samples, is used for testing.We train LR and SVM on the features.Both linear SVM and SVM with rbf kernels are experimented with.We also have some initial experiments of single layer and two-layer neural networks of different hidden layer sizes but the small sample size makes them difficult to generalize.

Results
We first train LR and SVM with different GloVe vector dimensions.Table 3 shows the accuracy on the test set.SVM with rbf kernel works consistently better.The best vector dimension is 300.
We then compare different features, including TFIDF, Doc2Vec, GloVe, and the effect of adding Feat.Table 4 shows the accuracy on the test set.It shows that SVM performs better than LR, and only in the case of TFIDF does a linear SVM outperforms kernel SVM.It also shows that the pretrained GloVe vectors achieve better performance than the vectors that are trained on our data.The ability to generalize might result from the larger corpus that is used to train the vectors.Adding simple lexical and sentiment features hurts the performance. The

Conclusion and Future Work
In this paper, we present the system we use to compete in the SemEval-2019 hyperpartisan news detection task.The final model we use is a kernel SVM trained with pre-trained GloVe vectors.It turns out that a simple method which requires the least training time performs the best in this case.
Both system 1 and system 2 have interesting future work to be done.For system 1, it is interesting to correct the labels or filter the articles in order to obtain a cleaner data to learn from.For system 2, we plan to use contextual embeddings (Peters et al., 2018) or pre-trained language models (Radford, 2018;Devlin et al., 2018) to extract representations that are then fed into downstream classifiers.The high performances of the models made them interesting to compare with.

Table 1 :
three representations are furthered evaluated on another test set (the official test set of the Selected publishers with their bias categories and percentage of biased articles in the dataset.

Table 3 :
Accuracy of different GloVe vector dimensions.

Table 4 :
Accuracy of our model that is trained using training-2.The majority baseline is 63% accuracy.task)that is labeled by crowd workers.Since the additional feature set does not improve the perfor-mance, it is not further evaluated.In Table5, the accuracy, precision, recall, and F1-score on the held-out test set are shown.Our classifiers tend to have a higher false negative rate.This can be due to the imbalance in the training data.Further experiments would be required to see whether resampling to have a balanced training set can improve that.

Table 5 :
Submission results on the held-out test set, with metrics including accuracy, precision, recall, and F1-score.