Negativity Spreads Faster: A Large-Scale Multilingual Twitter Analysis on the Role of Sentiment in Political Communication

Social media has become extremely influential when it comes to policy making in modern societies, especially in the western world, where platforms such as Twitter allow users to follow politicians, thus making citizens more involved in political discussion. In the same vein, politicians use Twitter to express their opinions, debate among others on current topics and promote their political agendas aiming to influence voter behaviour. In this paper, we attempt to analyse tweets of politicians from three European countries and explore the virality of their tweets. Previous studies have shown that tweets conveying negative sentiment are likely to be retweeted more frequently. By utilising state-of-the-art pre-trained language models, we performed sentiment analysis on hundreds of thousands of tweets collected from members of parliament in Greece, Spain and the United Kingdom, including devolved administrations. We achieved this by systematically exploring and analysing the differences between influential and less popular tweets. Our analysis indicates that politicians' negatively charged tweets spread more widely, especially in more recent times, and highlights interesting differences between political parties as well as between politicians and the general population.


Introduction
In recent years social media have come to resemble a 'battleground' between politicians who constantly aim to reach out to people and win their votes. This behaviour is not surprising as the influx of people onto social media aiming to get updated on the latest news keeps growing, especially in the west, with 48% of Europeans using social media on a regular basis [1]. More to engage with the public in regards to social and political commenting [2], to such an extent that political accounts are often more active than non-political ones [3]. Due to their inherent interest and thanks to the openness of Twitter's API, political tweets have been extensively studied. Using tweets extracted from politicians and their followers, researchers have attempted to answer questions such as: is it possible to predict elections results [4]? or what makes a politician popular on Preprint submitted to Online Social Networks and Media April 5, 2023 arXiv:2202.00396v3 [cs.CL] 3 Apr 2023 social media [5]?
In general, discovering what makes a message viral has been a popular research topic. Sentiment analysis is one of the tools that have been extensively used to answer this question. Findings in related literature suggest that negatively charged tweets have a bigger network penetration than average [6,7,8]. However, even though there have been studies of sentiment in tweets revolving around politics/elections [9,10,11], to our knowledge there has not been a large-scale multilingual analysis of the relation between sentiment and the propagation of politicians' tweets.
In this paper, we focus on politicians' tweets to understand the relation between their sentiment and virality. By performing a more fine-grained analysis, a distinction is made between politicians from different political parties as we attempt to identify differences in their tweeting activities. At the same time, we investigate whether politicians' behaviour regarding their tweet sentiment is independent of their home country and language, and also whether the behaviour is consistent or evolves over time.
Overall, we bring together and assess the validity of these and other research questions (Section 2) with regards to politicians and Twitter by: (1)

Research Questions
The aims of this paper are summarised by the following three research questions.

Related Work
The popularity of social media platforms such as Facebook and Twitter is transforming the way citizens and politicians communicate with one another. Political candidates and voters use Twitter to discuss social and political issues, sharing information and encouraging political participation [12]. Politicians in particular, especially in recent years, have eagerly embraced social media tools to self-promote and communicate with their electorate, seeing in these tools the potential for changing public opinion especially during election campaigns [13]. Given the rapid growth of politicians' engagement through Twitter, there is plenty of research on how the platform is used for political communication.
Many studies focus on the classification of tweets referring to politicians by sentiment -positive, negative, or neutral -to investigate popularity and voting intention, and whether there is a correlation between post sentiments and political results [14,15,16]. Moreover, sentiment is considered to affect message diffusion in social media. Research suggests that the virality of a message -the probability of it being retweeted -is affected by its polarity, as emotionally charged messages tend to be re-posted more rapidly and frequently compared to neutral ones [17]. Negative messages in particular are likely to be more retweeted than positive and neutral ones [18]. However, other studies show that the re-lationship between sentiment and virality in Twitter is more complex and is related to subject area [19]. Literature suggests that sentiment occurring in politically relevant tweets has an effect on their propagation [20,21].
When considering techniques used to extract sentiment from political text in social media it is common to utilise dictionary based approaches [22,23,24] or in cases where the platform offers the functionality to react in one's post (e.g., Facebook's like/angry/happy reactions) then an aggregation of such reactions is used to determine the sentiment of the post [25,26]. These approaches have the benefit of not requiring to train machine learning models for the sentiment analysis task which itself can be a time consuming process while also requiring previously annotated data. Other researchers chose to train their own ML models instead and often utilise neural network architectures such as LSTM and GRU networks [27,28]. Despite the variety of methods utilised, there seems to be a lack of usage of stateof-the-art NLP methods like language models such as BERT [29] and RoBERTa [30]. In Section 5, we show how this makes an important difference in practice given the important improvement attained by language models in NLP in recent years.  [31], trending topics [22], what drives public engagement [24] and relations between sentiment and politicians popularity [32]. Similar research can be found in other countries such as UK [33], but also in non-English speaking countries such as Russia [27], 3 Mexico [26], Italy [34], Germany [21], Austria [35] as well as in cross-European settings [23].

Data Collection
For the purposes of this study, tweets were collected from the MPs (members of parliament) of three  [36] and LGBTI inclusion [37]. Socioeconomic differences as the above, have been shown to affect public perception on national issues (e.g., sense of solidarity [38]) but also the way politicians engage with the public [39]. 1 We decided to include these devolved parliaments due to their size, idiosyncrasy and nationalist identity. 2  and influential persons (e.g., athletes and artists) whose Twitter activity can be deemed to be closer to that of MPs. Each set of random users follows the same distribution of tweets of their respective country shown in Figure 1. The geolocalization of tweets was achieved using Twitter's API country filter ('place country'). An assumption was made that tweets belong to users that reside to the country they are posting from. The verified users set was constructed by extracting tweets only from a list of known verified users and a combination of keywords 5 alongside with information on the location from user profiles metadata was applied to ensure the accounts resided in the countries studied.

Sentiment Analysis: Evaluation and Model Selection
Over the years there have been multiple approaches of dealing with sentiment analysis in text data. Varying from the use of sentiment lexicons [46,47,48] to utilising linear machine learning models [49] and, most recently, by applying transformer models like BERT [29].
One of the most challenging problems appears when we have to deal with multilingual data. Acquiring a model that is able to perform well in a multilingual setting is a difficult task that often requires large labelled corpora. This is especially true if low resource languages are taken into consideration [50]. Some crosslingual approaches, such as language models, deal with this issue by making use of the large amount of training data available in major languages, e.g., English, to essentially transfer sentiment information to low resource 5 https://github.com/cardiffnlp/politics-and-viralitytwitter/blob/main/data/keywords.csv languages, e.g., Greek [51]. It is also important to note that despite the architecture being used, an important factor to achieve accurate sentiment classification is the domain of the training and the target corpus [52].
For our purposes, we select a number of pre-trained language models, both monolingual and multilingual, which we attempt to further finetune and evaluate them using a manually labelled tweets dataset (Section 5.1) aiming to find the best suitable classifier for each language (English, Spanish, Greek) studied.

Sentiment Annotation
Tweets from the 2021 Main Dataset (Section 4.1.1) for each language included in the study were sampled from their respective parliaments. This way three datasets were collected and annotated based on their sentiment for the English, Spanish and Greek languages (will be referred to as: Annotated Set).
In sentiment annotation tasks, annotators are asked to either evaluate the overall polarity of the text on a scale, e.g., 1 to 5 [53] or to distinct positive/neutral/negative classes [54]. For simplicity and to follow current stateof-the-art sentiment analysis models [55,56], in our setting annotators were asked to indicate the sentiment of each tweet and classify it in one of the following classes: • Positive: Tweets which express happiness, praise a person, group, country or a product, or applaud something.
• Negative: Tweets which attack a person, group, product or country, express disgust or unhappiness towards something, or criticise something.
• Neutral: Tweets which state facts, give news or are advertisements. In general, those which don't fall into the above 2 categories.
• Indeterminate: Tweets where it is not easy to assess sentiment or sentiments of both polarities of approximately the same strength exist. Tweets annotated with the indeterminate label were discarded from our analysis.
For each set of tweets three native speakers were assigned as annotators. Initially, 100 tweets were sampled for each language and were given to each group of annotators. The annotators were advised to consider only information available in the text, e.g., to not follow links present, and in cases where a tweet includes only news titles to assess the sentiment of the news being shared. too. It is also worth noting that the divergence between positive and negative labels (which could be the most problematic in our subsequent analysis) was extremely low. Only 9% (Greece), 3% (Spain) and 7% (UK) of all annotated tweets had a contrasting positive/negative or negative/positive labels between any annotator pair.
Finally, in order to consolidate the annotations, the final label of each tweet was decided by using the two annotators agreement in each group and in cases of differences the third annotator was used as a tiebreaker.    [60]. These additional sources are used only for training purposes. All the datasets have been constructed for the specific task of Twitter sentiment analysis where each tweet is classified as either Positive, Neg-ative, or Neutral. All Twitter handles are anonymised by replacing them by '@username'.

Training/Evaluation
We consider two different approaches to evaluate our models. Firstly, a train/validation/test split method is applied. The testing data are the subset of tweets from

Comparison systems
Multiple transformer-based language models, including domain-specific models which are state of the art in All the models are based on the implementations of the uncased versions provided by Hugging Face [67], and further finetuned and tested for each language individually, as well as in a multilingual setting using the data collected.
In order to assess the difference between these recent transformer models with more traditional approaches, three baseline models were tested: (1)  As our aim was not to necessarily find the best sentiment classifier but instead to acquire a robust classifier with good overall performance across all the languages studied no hyper-parameter tuning was performed.

Evaluation metrics
We report results both in the usual macro-average F1 and the F1 average between positive and negative classes (F1 PN henceforth). For sentiment analysis tasks the average of the F1 scores of the Positive and Negative classes is often used as an evaluation measure [71] instead of other metrics such as Accuracy. This is mainly justified as firstly F1 scores are more robust to class imbalance, and secondly due to the fact that classifying correctly Positive and Negative classes is more crucial than the Neutral class, especially in our subsequent analysis. 8 Spanish and Greek texts were first translated to English with Google's Translate API prior to using VADER performs the best while achieving a similar score.

Model selection
Considering the focus of our study, classifying tweets from members of parliaments in different countries, we decided against the use of mono-lingual models such as 'Bertweet-Sent' 9 for two reasons: (1) there is no certainty that a tweet will follow the same language as the main language of the parliament, e.g., Welsh tweets in the UK parliament, Catalan tweets in the Spanish parliament, Turkish tweets in the Greek parliament; and (2) using a multilingual model will make the comparison across countries easier. As such, for the pur- 9 In Section 7.4 we further present a control analysis in which we compare the trends between our selected multilingual model and the best performing 'Bertweet-Sent' in English.
poses of our experiment, the multilingual implementation of 'XLM-T-Sent' fine-tuned on our in-domain data is selected as a classifier and applied across all of the data collected. 10   in-domain data) is then utilized for the rest of the experiments, and in particular for attempting to answer research questions RQ2 and RQ3.

Analysis
Having acquired a suitable sentiment analysis classifier capable to successfully distinguish sentiment polarity in MP tweets (see Section 5.4) we applied it to our collected Twitter corpora (see Section 4) and perform an in-depth analysis aiming to explore whether politicians' tweets containing negative sentiment have a bigger network penetration than tweets that are positive or neutral.
Initially, we attempt to establish what is considered a 'popular' tweet in the context of our analysis. Table 5 displays the percentiles of how many times a tweet has been retweeted for the UK, Spanish and Greek parlia- Greek 0 0 3 9 28 3357

Sentiment and Virality
To directly answer the research question RQ2, we investigate whether there exists a correlation between retweets count and sentiment based on our collected data (see Section 4) and sentiment analysis classifiers (see Section 5).

Correlation analysis
As an initial step, we ran several correlation experiments based on statistical testing and regression anal-  11 . Even though there is no evidence for direct correlation, we manage to establish that there is a relation between sentiment and popularity, and also that retweet count distributions differ between sentiment labels.
Regression Analysis. To complement the statistical tests multiple regression models are fitted and the significance of sentiment is examined. 12 . In our experimental setting, the retweet count is set as the dependent variable while the existence (or not) of positive and negative sentiment constitutes the independent variables.
Neutral sentiment is not taken into account in order to avoid potential collinearity problems. Furthermore, we utilise four lexical statistics (Table 1) as control variables. Specifically, the presence of emojis, URLs, hashtags (#), and mentions (@) in a tweet are used as binary variables 13 . We did not consider features such as number of followers, favourites count or time posted as our main focus was to identify the importance of features extracted from text.
Due to the nature of the problem at hand, i.e., modelling a count variable, and to the highly skewed distribution of the target (see Table 5

Governing Party vs Opposition
To answer the first part of the researcher question RQ3 we consider the sentiment distribution on the party level for each parliament. Figure   This behaviour becomes even more apparent when we take into consideration only the tweets from the lead-

Control Experiments
In this section, we present four control experiments to test the robustness of our evaluation.

Popularity metrics
In addition to the raw retweet counts, we also tested additional metrics of popularity to divide the tweets in 'Head' and 'Tail'. As the retweet count is an absolute measure that does not take into account the existing popularity of the user posting a tweet, it may be skewed to favour users with a big number of followers.
We attempt to incorporate the popularity of each user and explore whether there are differences in the sentiment trend when using normalised metrics. To achieve this, three new metrics are introduced: (1) the ratio between the retweets count and the follower count of the user; and (2) the ratio of the retweets count to the average number of retweets of the user. This way, a heavily shared tweet from a user that tends to get only few retweets will be considered more 'popular' than a similar tweet originating from a user that is retweeted often. Finally, we also consider a third popularity metric [80] where a tweet is considered to be viral if its retweet count is at least two standard deviations above the mean for its creator. These three metrics offer an alternative and more normalised view of popularity.

Temporal Analysis
Continuing our analysis, we explore whether the ten-

Sentiment Analysis Models Consistency
In order to ensure the validity of our results, a comparison is made between our selected model (multilingual 'XLM-T-Sent') and the best performing model for English, 'Bertweet-Sent' (see Table B

Research Questions
Having tested several machine learning models and conducted multiple sentiment analysis experiments, we attempt to answer the research questions we set in Section 2.
Sentiment analysis classifier (RQ1). When considering which is the best sentiment analysis classifier for our particular use case, language models such as 'Bertweet-Sent' and 'XLM-T-Sent' prove to be the most suitable.
As shown in our evaluation (Section 5), language models outperform lexicon-based approaches and traditional Potential temporal changes were also explored (Section  Finally, our methodology can be easily extended to languages not studied in this work by either applying the sentiment model provided directly or by gathering a small annotated corpus and further finetuning the classifier for more precise results.

Conclusion
We have presented an analysis of the relation between sentiment and virality when considering politicians' tweets. By performing an exhaustive search for a successful sentiment classifier we obtained a robust multilingual model capable of accurately identifying sentiment in politicians' tweets. This is achieved by utilising state-of-the-art transformer-based language models, which we also finetune to the domain-specific task at hand. Both the model used in our analysis and the collected dataset of manually annotated tweets used for training and evaluation are made publicly available. 16 Our analysis indicates that there is a strong relationship between the sentiment and the popularity of politicians' tweets, with negatively charged tweets displaying a larger network penetration than tweets conveying positive sentiment. This phenomenon seems to be consistent across all three sovereign countries analysed (i.e., based on the sentiment. The results indicate, i.e., negative coefficient, that the "more" negative a tweet the more is retweeted.