Dataset of Jaccard similarity indices from 1,597 European political manifestos across 27 countries (1945–2017)

This dataset compiles the results of our computations of Jaccard similarity indices of political manifestos from 27 European countries since 1945. The raw text is accessible through the Manifesto Project's API in R. In total, 1597 political manifestos have been compared on a country level, providing measures of similarity between different political platforms. In addition to the Jaccard similarity indices gathered into 27 csv files (one per country), the R code to download, transform, and compare political manifestos is provided. Moreover, we also provide the R code to execute all the necessary computations for a whole country.


Data
The data files we provide in this article are in csv format. Each of the 27 files concerns a specific country and gathers the Jaccard similarity indices of all manifestos available from the Manifesto Project concerning a specific country. In addition to the 27 csv files, we provide four supplementary files in R format in order to reproduce all computations. Those four files help to (1) access the Manifesto Project's API (asses-sManifestoText.R), (2) transform the raw data (transformRawData.R), (3) compute the Jaccard similarity indice between two sets of text (jaccardSimilarity.R) and (4) produce all data from extracting text files from the API to producing the heatmap and csv file for one country (completeDataForOneCountry.R).
In recent years, the rise of populism has been noticed through some elections. More surprisingly, this dynamics has also occurred in countries with strong democratic institutions. Edgar Morin [1] even speaks of a period of historical regression. From the election of President Trump to Brexit, electoral platforms of traditional political parties (called government parties) became very close to the contesting parties, in so-called populists parties. It has become difficult to characterize far-right parties due to the diversity of movements [2,3]. However, they could be defined by what they oppose [4] and as such, as contesting parties. They share core values, such as radicalism, populism and natalism. Radical right to far-right political parties have evolved during the past decades, with the opportunities of securing government positions in Italy (Lega) or in Austria (Freedom Party) for example.
Our motivation is the following one: how has evolved the political platform of such parties through time and across Europe? In order to answer this question, the use of raw e unedited e political platforms in their original language is particularly relevant. The Manifesto Project (https://manifestoproject.wzb.eu) collects and provides information about electoral programs aiming at comparative politics-based studies [5]. The Manifesto Project is supported by the WZB (Social Science Research Center Berlin) and DFG (German Science Foundation). Elections since 1945 from more than 50 countries are integrated into this database. Machine-readable texts are accessible through several APIs (R, Stata), as well as the manual codification of each political party's propositions. Several tools have been developed to visualise the results of these analyses as for example the evolution of political parties on the left-right axis through time. Since 2000, more than 350 refereed journal articles have been published by scholars using the Manifesto Project's database.
In our case, we use the raw text accessible through the API with the R library ManifestoR. The methodology used to compare two texts together is the Jaccard similarity indice. It is a metric ranging from [0; 1] comparing two sets of elements, such as with set 1 and set 2 representing both sets of text to be compared. This metric has already been used in several fields. Historically, it originates from the botany field when, in the early 1900, Paul Jaccard quantified the number of common floral species across several sets of lands [6,7]. It has then been used in International Business when [8,9] used Jaccard similarity indices to measure the persistency of the TPP across free trade agreements. From a methodological perspectively, they treated text as data by using character N-grams of 5 characters before providing a coefficient of similarity. To be noticed, the use of character N-grams of 5 characters instead of bag-of-words provides the advantage of capturing the order of words in a text. This method is interesting in the sense that it does not require the use of reference lexicons such as in sentiment analyses. This provides a greater latitude in the analysis of different languages. For instance, studies of Arabic texts [10,11] and Thai [12] have been performed using this methodology.
In this data-based paper, we compare e at the country level e each political manifesto available from the Manifesto Project and provide the associated Jaccard indices. These metrics could serve multiple purposes, such as comparing populist manifestos to government parties' manifestos, and evaluate this association through time across Europe.

Experimental design, materials and methods
The experimental design is divided into three main parts: (1) accessing data, (2) transforming of political manifestos, and (3) comparing texts through Jaccard indices. The R code used in this article is available in the supplementary files (assessManifestoText.R, transformRawData.R, jaccardSimilarity.R, completeDataForOneCountry.R).

Accessing data
Political manifestos were obtained with the R library ManifestoR [13]. For each country, a list of available manifestos is computed. The number of political manifestos varies by country, ranging from 9 for the least (Romania) to 175 for the most (Denmark). In total, 1597 manifestos are accessible through the 2018e2 version of the Manifesto Project's database (updated on December 2018) [14].
After obtaining an API key to use the Manifesto Project's API (available at: https://manifesto-project. wzb.eu/), the first step is to gather all available manifestos using the mp_corpus() function, as well as the country name and the date (after 1940) as inputs. The complete list of political parties provided by the API could be obtained using the names() function on the corpus variable previously computed. Finally, for each political manifesto, the unannotated text is considered using the content() function on the corpus of text. Fig. 2 The following table provides the number of texts associated with each country, as well as the total number of words analyzed for each country. The total number of words was assessed with the following line of R code from the stringr library in R [15]: Fig. 3 1 Fig. 2. Evolution of Jaccard indices for the Austrian Freedom Party. Finally, provides an overall view of the distribution of Jaccard indices concerning the overall database (27 countries). All data is provided in the supplementary files in csv files, each file concerning a single country. 1 Fig. 3 provides a comparison on a country level of the distribution of Jaccard similarity indices. As such, countries can be compared one to another. However, due to different languages involved within political manifestos from different countries, two parties from two different countries are not compared.  Overall, the total number of words processed to obtain the different Jaccard indices for this study is 21, 939, 796 (see Table 1).

Data transformation
Each manifesto is then transformed before being compared to each other. First, the data type is changed to be considered as a string variable. Then, by using the tolower() function, each word is uncapitalized in order to have a uniform text. Finally, the text is divided into character N-gram of 5 characters using the substring() function. In order to account for each possible combination, the substring() function is replicated five times with a sliding selection window of one character per iteration.

Jaccard indices
After transforming each political manifesto into several sets of character N-grams of 5 characters, the comparison of all political manifestos for a specific country could then be assessed. In order to do so, the function jaccard_similarity() from the textreuse library in R [16] is used. The result is an indice ranging from [0; 1], with 0 meaning that two sets of text do not share any common elements and 1 meaning that both sets share exactly the same elements. The R code line used is the following one: With unigram5charA and unigram5charB two sets of processed texts to be compared. These measures could illustrate how political parties promotes platform that are similar to each other. For example, concerns Austria, with a total of 60 political manifestos compared to each other. The symmetric heatmap highlights areas with stronger similarities while the lighter areas presents lower value of Jaccard indices.