Data for Sochi 2014 Olympics discussion on social media

Presented data is related to the research article “Sochi 2014 Olympics on Twitter: Perspectives of Hosts and Guests” [2]. The data were collected through regular API Twitter search for five months windowing 2014 Sochi Olympic Games and further used for cluster analysis and analysis of the sentiment on the Games. The main dataset contains 616 thousand tweets, rigorously cleaned and filtered to remove irrelevant content. To comply with the Twitter API user agreement, the dataset presented in this article includes only generalized daily data with all information contained in individual tweets removed. The proposed use of the dataset is academic research of changing discussion on the topics related to Mega-events in conjunction with political events.


a b s t r a c t
Presented data is related to the research article "Sochi 2014 Olympics on Twitter: Perspectives of Hosts and Guests" [2]. The data were collected through regular API Twitter search for five months windowing 2014 Sochi Olympic Games and further used for cluster analysis and analysis of the sentiment on the Games. The main dataset contains 616 thousand tweets, rigorously cleaned and filtered to remove irrelevant content. To comply with the Twitter API user agreement, the dataset presented in this article includes only generalized daily data with all information contained in individual tweets removed. The proposed use of the dataset is academic research of changing discussion on the topics related to Mega-events in conjunction with political events.
& Value of the data Hashtags (keywords) are an emerging linguistic norm; as such are widely used in social media research and data mining to discern changes in main topics and sentiment of public discourse.
Database represents 613 thousand tweets, filtered, quality-evaluated, stripped of identifying content and generalized on a daily basis to comply with Twitter API data use agreement.
Additional datasets represent the hashtags contained in English and Russian messages that represent the majority of tweets in collected data.

Data
Mega-events such as the Olympic Games create new opportunities for communities and businesses and stimulate economic growth [4]. At the same time, the increased international country visibility promotes discussion of the issues only laterally related to main event such as human rights, politics, environment, etc. An emerging method for public opinion mining surrounding international mega-events employs data from online social media such as blogs and photo sharing platforms. After the data is collected, data mining methods are used to discern main topics of public interest, geographical patterns, changes in positive and negative emotions, and similar derived variables. This way, social media analytics has been successfully applied in diverse areas such as disaster management, election polls, and in formulating relevant social policies [1,3].
The published dataset (Supplementary Database 1; see metadata in Table 1) reflects the online discussion of 2014 Sochi Winter Olympic Games (February 7-23, 2014) and Winter Paralympic Games (March 7-16, 2014) on Twitter microblogging platform. The data shows the most frequent hashtags (keywords) contained in tweets published prior and immediately after the Olympic Games, allowing tracking changes in public discourse prior, during, and after the mega-event. To comply with the Twitter Terms of Service, raw data are not shared; the derived product contains frequency of the most frequent hash tags on a daily bases, stripped of any geographical or personal identification. The dataset was used in related research article [2], which contrasted main themes of discussion in Russian and English sectors of Twitter and compared pre-and post-Games sentiment expressed by the Olympic Games hosts and guests. Fig. 1 depicts daily number of collected tweets in different languages.

Experimental design, materials and methods
Raw data were collected through Twitter REST API search with adaptive frequency (from six times daily and up to one time every three minutes during the Games) using the key words sochi, olympics, paralympics and their translations into Russian for one year, starting November 1st, 2013, resulting in 7.8 million tweets. For quality control, a stratified random sampling was applied to collected data and the selected sample of 600 tweets was manually classified to two classes of those related and unrelated to Sochi Olympics. The manually classified sample was used to extract classification rules, which were applied to filter collected data based on (1) month of data collection and (2) hash tags with the goal of minimizing percentage of tweets unrelated to the Games, which resulted in the retained subset of 616,333 tweets spanning from November 1st, 2013 to March 31st, 2014.
For quality assessment, an independent sample of the retained data was evaluated; it was concluded that for each of the five months from November, 2013 to March, 2014 in data collection the final dataset contained at least 85% of relevant tweets ( Table 2). Note that this evaluation was applied only to those tweets published in two most common languages in the dataset, English and Russian; in total, these tweets represent 439,106 tweets out of 616,333. Accordingly, the provided data contain language attribute.