Political Advertising Dataset: the use case of the Polish 2020 Presidential Elections

Political campaigns are full of political ads posted by candidates on social media. Political advertisements constitute a basic form of campaigning, subjected to various social requirements. We present the first publicly open dataset for detecting specific text chunks and categories of political advertising in the Polish language. It contains 1,705 human-annotated tweets tagged with nine categories, which constitute campaigning under Polish electoral law. We achieved a 0.65 inter-annotator agreement (Cohen's kappa score). An additional annotator resolved the mismatches between the first two annotators improving the consistency and complexity of the annotation process. We used the newly created dataset to train a well established neural tagger (achieving a 70% percent points F1 score). We also present a possible direction of use cases for such datasets and models with an initial analysis of the Polish 2020 Presidential Elections on Twitter.


Introduction
The emergence of social media has changed how political campaigns take place around the world (Kearney, 2013). Political actors (parties, action committees, candidates) utilize social media platforms, Twitter, Facebook, or Instagram, to communicate with and engage voters (Skogerbø and Krumsvik, 2015). Hence, researchers must analyze these campaigns for several reasons, including enforcement of the laws on campaign contribution limits, implementation of freedom and fairness in electoral campaigns or protection against slander, hate-speech, or foreign interference. Unlike U.S. federal or state law, European jurisdictions present a rather lukewarm attitude to unrestrained online campaigning. Freedom of expression in Europe has more limits (Rosenfeld, 2003), and that can be seen in various electoral codes in Europe (e.g. in France, Poland) or statutes imposing mandatory systems of notice-and-takedown (e.g. the German Network Enforcement Act, alias the Facebook Act). In Poland, agitation (an act of campaigning) may commonly be designated 'political advertisement', corporate jargon originating from such as Twitter's or Facebook's terms of service. Primarily, however, it has a normative definition in article 105 of the Electoral Code. It covers any committees' or voters' public acts of inducement or encouragement to vote for a candidate or in a certain way, regardless of form. Election promises may appear in such activities, but do not constitute a necessary component. A verbal expression on Twitter falls into this category. There exist some natural language resources for the analysis of political content in social media. These include collections related to elections in countries such as Spain (Taulé et al., 2018), France (Lai, 2019), and Italy (Lai et al., 2018). Vamvas and Sennrich (2020) created an X-stance dataset that consists of German, French, and Italian text, allowing for a cross-lingual evaluation of stance detection. While the datasets on political campaigning are fundamental for studies on social media manipulation (Aral and Eckles, 2019), there are no Polish language datasets related to either political advertising or stance detection problems. We want to fill this gap and expand natural language resources for the analysis of political content in the Polish language. Our contributions are as follows: (1) a novel, publicly open dataset for detecting specific text chunks and categories of political advertising in the Polish language, (2) a publicly available neural-based model for tagging social media content with political advertising that achieves a 70% F1 score, and (3) an initial analysis of the political advertising during the Polish 2020 Presidential Election campaign.

arXiv:2006.10207v1 [cs.CL] 17 Jun 2020
We created nine categories of political advertising (incl. election promises) based on manually extracted claims of candidates (see sample of examples in Table 1) and a taxonomy proposed by Vamvas and Sennrich (2020). We gathered political advertising topics (incl. election promises) provided by presidential candidates and committee websites, Facebook fanpages, and other websites (such as news agencies). We tried to fit them into the Vamvas and Sennrich (2020) taxonomy. However, we spotted that some categories should be corrected or omitted, such as the economy. It was particularly hard to divide whether political advertising should fall into the category of welfare or economy. The final version of categories was created after a couple of iterations with our annotator team, and we chose categories for which we got a consensus and repeatable annotations. By repeatable, we understand that the same or another annotator will consistently choose the same categories for repeated examples or very similar examples. To the best of our knowledge, there do not exist any commonly used political advertising categories. Moreover, we are aware that they can evolve in future election campaigns. We shall be ready to update the dataset according to political advertising types, categories, and any concept drift in the data.
We extracted all Polish tweets related to the election based on appearances in the tweet of (1) specific hashtags such as wybory (elections), wybory2020 (elections2020), wyboryprezydenckie (presidentialelections), wyboryprezydenckie2020 (presidentialelections2020), and (2) unigram and bigram collocations 1 generated using examples of political advertising from all categories (as in Table 1). We name these sets of hashtags and collocations together as search keywords or search terms. We gathered almost 220,000 tweets covering approximately ten weeks, between February 5, 2020 and April 11, 2020 using the search terms mentioned. We assigned sentiment orientation for each tweet using an external Brand24 API 2 . Then we sampled tweets using a two stage procedure: (1) we divided all tweets into three sets based on sentiment orientation (positive, neutral, negative) to have representations of various attitudes, and (2) for each sentiment category we randomly selected log 2 (|E k |) examples, where E is a set of all tweets for a particular search keyword k. We used almost 200 different search keywords and finally reached 1,705 tweets for the annotation process. The dataset was annotated by two expert native speaker annotators (linguists by training). The annotation procedure was similar to Named Entity Tagging or Part-of-Speech Tagging, so the annotators marked each chunk of non-overlapping text that represents a particular category. They could annotate a single word or multi-word text spans. Table 1 presents an example of political advertising with corresponding categories. These kinds of examples (they could be named as seed examples) were presented to the annotators as a starting point for annotating tweets. However, the annotators could also mark other chunks of text related to political advertising when the chunk is semantically similar to examples or it clearly contains examples of a political category but not present in the seed set. We achieved a 0.48 Cohen's kappa score for exact matches of annotations (even a one-word mismatch was treated as an error) and a 0.65 kappa coefficient counting partial matches such as reduce coil usage and reduce coil as correct agreement. We disambiguated and improved the annotation via an additional pass by the third annotator. He resolved the mismatches between the first two annotators and made the dataset more consistent and comprehensive. According to McHugh (2012), the 0.65 kappa coefficient lies between a moderate and strong level of agreement. We must remember that we are annotating social media content. Hence there will exist a lot of ambiguity, slang language, and very short or chunked sentences without context that could influence the outcome. Then, we trained a Convolutional Neural Network model using a spaCy Named Entity classifier (Honnibal, 2018), achieving a 70% F1 score for 5-fold cross-validation. We used fastText vectors for Polish (Grave et al., 2019) and default spaCy model hyperparameters. The dataset and trained model are publicly available in the GitHub repository 3 . Table 3 presents per category precision, recall, F1 score, as well as the number of examples for each category in the whole dataset. As we can see, there are 2507 spans annotated. Interestingly, 631 tweets have been annotated with two or more spans. Finally, 235 tweets do not contain any annotation span, and they represent 13.8% of the whole dataset.

Polish 2020 Presidential Election -Use Case
We present the analysis of 250,000 tweets related to the Polish 2020 Presidential Elections gathered between February 2, 2020 and April 23, 2020. The data acquisition and sentiment assignment procedures were similar to those described in Section 2. The dataset and model we propose enabled us to analyze sentiment polarity across all election promise categories. Figure 3 shows the overall average sentiment categories. The sentiment has been assigned on a -1 (negative) to 1 (positive) scale. None of the categories were positive on average; hence for readability we show only the zoomed part of the graph with a scale from -0.5 (moderate negative) to 0 (neutral) sentiment. All categories contain a much more negative attitude on the scale of an absolute sentiment analysis score. As we can imagine, most tweets gathered by us are from potential voters, and there are many more negative than positive messages. Most of the sentiment analysis tools available right now perform only general sentiment detection, saying only how many positive or negative tweets they have identified and analyzed. However, our dataset and model enable us to go deeper into the analysis of attitudes towards particular political advertising categories or even more granular towards specific election promises.

Conclusions and Future Work
A new dataset and model enables us to analyze the Polish political scene, counter political misinformation in social media, and evaluate the political advertising of candidates. We plan to work on more datasets and models to fight fake news, classify political agitation content, and widen natural language solutions in regards to elections and political content in Polish social media. The dataset annotation will be very challenging due to many potential concept drifts between different election types such as presidential, parliamentary, European Union, and others. We use political advertising model to generate presidential candidates' vector representations. We can compare candidates with each other and say who is similar to whom.