Dataset on dynamics of Coronavirus on Twitter

In this data article, we provide a dataset of 8,982,694 Twitter posts around the coronavirus health global crisis. The data were collected through the Twitter REST API search. We used the rtweet R package to download raw data. The term searched was “Coronavirus” which included the word itself and its hashtag version. We collected the data over 23 days, from January 21 to February 12, 2020. The dataset is multilingual, prevailing English, Spanish, and Portuguese. We include a new variable created from other four variables; it is called “type” of tweets, which is useful for showing the diversity of tweets and the dynamics of users on Twitter. The dataset comprises seven databases which can be analysed separately. On the other hand, they can be crossed to set other researches, among them, trends and relevance of different topics, types of tweets, the embeddedness of users and their profiles, the retweets dynamics, hashtag analysis, as well as to perform social network analysis. This dataset can attract the attention of researchers related to different fields on knowledge, such as data science, social science, network science, health informatics, tourism, infodemiology, and others.


a b s t r a c t
In this data article, we provide a dataset of 8,982,694 Twitter posts around the coronavirus health global crisis. The data were collected through the Twitter REST API search. We used the rtweet R package to download raw data. The term searched was "Coronavirus" which included the word itself and its hashtag version. We collected the data over 23 days, from January 21 to February 12, 2020. The dataset is multilingual, prevailing English, Spanish, and Portuguese. We include a new variable created from other four variables; it is called "type" of tweets, which is useful for showing the diversity of tweets and the dynamics of users on Twitter. The dataset comprises seven databases which can be analysed separately. On the other hand, they can be crossed to set other researches, among them, trends and relevance of different topics, types of tweets, the embeddedness of users and their profiles, the retweets dynamics, hashtag analysis, as well as to perform social network analysis. This dataset can attract the attention of researchers related to different fields on knowledge, such as data science, social science, network science, health informatics, tourism, infodemiology, and others.
© 2020 The Author(s

Value of the data
• Different global diseases have attracted the attention in social media before. When an event of this magnitude happens, a massive quantity of people goes to these platforms to inform, exchange, post and look for information. This way, this dataset provides a great perspective on the dynamics of coronavirus on Twitter. • Researchers on different fields of knowledge can use these databases to analyse the Twitter activity in the first stages of the coronavirus outbreak. Also, analysts or data scientists might be interested in the dataset to inform governments and agencies. This way, it would be possible to design effective strategies of communication and insertion on social media to face other potential pandemics. • The dataset contains 8,982,694 Twitter posts around the coronavirus global crisis. Thus, it can be used to analyse different posting patterns and users' dynamics, as well as diverse social networks, among other topics. • The data collection started when there was a boom in the use of hashtags related to the coronavirus outbreak in China. Those hashtags became trending topics very quickly. This way, the dataset covers the initial dynamics of the coronavirus outbreak on Twitter. • Seven filtered and analysed databases are provided, as well as a glance at their composition.
Further analysis can be done by crossing these databases. • A new variable, called "Type" of tweets, was created. This variable categorises each post into (1) tweets without mentions, (2) tweets with mentions, (3) retweets, and (4) replies. Through its use, it is possible to see different interactions and dynamics on Twitter.

Data Description
We conglomerate a dataset of 8,982,694 Twitter posts (tweets). These tweets reflect the early discussion around the coronavirus health global crisis on this social network. All the collected data were searched using the keyword "Coronavirus". This implies that tweets containing this word were retrieved, as well as the posts including its hashtag version (#Coronavirus). The 8.98M were gathered from January 21 to February 12, 2020, i.e. 23 days. We selected those dates since, in the first one (21-Feb-2020), there was a boom around the topic on Twitter, where several hashtags related to the coronavirus and the outbreak in China were trending topics [1] . Then, the data collection was closed (12-February-2020) when the World Health Organization (WHO) changed the official name of this disease from "Coronavirus" to "COVID-19" ( link to the official communication ).
This paper compared to other publications, which also tracked diseases on Twitter, conglomerates a considerable quantity of Twitter posts within 23 days. For instance, Chew and Eysenbach [2] archived over 2 million tweets related to H1N1 or swine flu over eight months; while Stefanidis et al. [3] collected 6.25M tweets regarding the Zika outbreak for 12 weeks (three months). More recently, Cinelli et al. [4] analysed social media infodemic, and gathered around 1.18M posts on Twitter, in 18 days. Another example in another field of knowledge is the data on olympic-themed tweets, with 21.2M tweets collected over four months [5] . Fig. 1 shows the daily distribution of the downloaded Twitter posts. It is possible to appreciate that the process of downloading data retrieved some tweets from 13-Jan-2020. It is worth mentioning that the amount of information that we could gather depended on two factors: (1) The increasing discussion of the topic on Twitter, and (2) The times in which the code for downloading the data was run. When we started to see the quantity of information on Twitter, the code was run as much as possible. This way, the trend in Fig. 1 shows that there was increasing attention on the topic on Twitter.
Since the raw data were vast, we had to filter and create different databases. All of them are available in a Mendeley dataset. This dataset has been created in line with Twitter's Terms & Conditions [6] ; none of the databases contains the text of the collected tweets. Table 1 shows the name of each database and a brief description of its variables. It is worth mentioning that all the databases contain the variable "status_id", which can be used to join them. This way, the Mendeley dataset can be further applied to cross databases and, thus, analyse different topics. It will depend on the objective of the analysis, or the researchers' interests. If you have further questions about the dataset in Table 1 or you need more information about the raw variable, please contact us. Furthermore, through the "status_id" variable, it is possible to hydrated Twitter content using the Twitter APIs, and thus to get all the information related to the IDs. In the database "02.Type.Tweets", a new variable was created based on other four variables which are not included in the dataset; but in the downloaded data they were: "is_retweet", "re-ply_to_status_id", "mentions_screen_name", and "reply_to_user_id". Since a user can tweet different messages with different characteristics, each one of the posts was classified in one of four categories, based on the following: 1. Tw, this was used for original tweets, when the post did not include any mention; 2. MT, this was created when the user mentioned other users within the tweet; 3. RT: Retweets, this category was used for those tweets which retweet one post and; 4. Re: Replies, this type of tweets were created when one user replied to another one.
By doing the above, in Fig. 2 it is possible to see that a lot of the interaction and discussion on Twitter around the coronavirus topic had been done by the retweets (category 3. RT). 56.3 % of the 8.98M Twitter posts were retweets. Fig. 2 also shows that users had been publishing original tweets without any mention within them (1. Tw, without mentions, 31.2%). The other databases and analyses provided in this paper are based on the differentiation of tweets by their "Type". By including this variable in different databases (see Table 1 ), we want to prove that these categories influence a lot the interaction dynamics on Twitter. This, at the same time, enriches the applications and uses that the databases can have beyond this paper.

Data collection
The raw data of the 8.98M Twitter posts were downloaded from Twitter REST API search using the "rtweet (version 0.7.0)" R package, which was designed to simplify the interaction with Twitter's APIs and make it accessible to a wider range of users [7] . The collection process comprised from January 21 to February 12, 2020, i.e. 23 days. We searched for the term "Coronavirus" which included the word itself and its hashtag version (#Coronavirus). It is worth mentioning that the procedure to search tweets only returns data from the past 6-9 days, and it typically can return up to 18,0 0 0 Twitter statuses in a single call (see this link ). Based on this, we run the process several times, creating a total of 424 files, which summed 17,974,805 records. We eliminated the duplicated ones, and the final number of retained tweets was 8,982,694, which are unique. All the collected data is multilingual since we did not use any kind of restriction about it. The most used languages on the tweets were English (en = 56.8%), Spanish (es = 19.4%), Portuguese (pt = 5.1%), French (fr = 4.7%) and, Italian (it = 2.19); but in total, the tweets were written up in 65 different languages ( Table 2 ).
Since the raw data contained 90 variables, we had to filter and create different databases regarding the purpose of each analysis (see Table 1 ). This also enabled us to handle the data easier. We did this by taking into account Twitter's Terms & Conditions [6] . This way, in the next subsection, the results of analyses are shown related to each database explained above.

Links and media
Two of the main features of Twitter posts are if they have links and media into the text. This is because tweets with these characteristics have more probability of being retweeted [8] . In order to explore that, we created a database called "03.Links.Media.Tweets" (see Table 1 ). This way, first, we excluded the posts whose type was "3. RT" (see Fig. 2 ); thus, 3,928,380 tweets (43.73%) were retained. We found that almost 60% of the Twitter posts contained a link; the tweets that quote other tweets are also included. Links were more used in the types 1. Tw, and 2. MT ( Fig. 3 ). In the case of media, we conversely found that a considerable proportion (79%) of the tweets did not contain any kind of media (pictures, gifs, video, etc.). Twitter posts with media were more used in type 1. Tw ( Fig. 4 ). Further information can be found in this database, which includes: the number of retweets received in each tweet; in this case, 75.7% of the 3,928,380 tweets did not receive any retweet. Also, it includes the complete URL to each post, the URL for the external links, as well as the URL for media.

Users and types of tweets
In order to know how many users had been involved so far in the Twitter discussion around the coronavirus topic, we created a database called "05.Users.And.Type" (see Table 1 ). After analysing it, we found that 3,320,754 unique users had tweeted the 8.98M tweets. Interestingly, Fig. 6 shows that a considerable proportion of users tweeted only once (67%). But also, surprisingly, there was only one user that published more than 10,0 0 0 tweets.
Going more in-deep, when we included the types of tweets ( Fig. 7 ), the analysis got enriched, and it was possible to see that more than 50% of the users who tweeted only once, they did it through the retweet (3. RT). 28% of the accounts that also posted once did it by using the tweet without mentions (1. Tw). Meanwhile, 11% replied to another tweet, and they did it in only one occasion. The last proportion was the type "2. MT", with 5.6%. For the other categories based on the number of Twitter posts, the interpretation goes in the same direction. All this means that people on Twitter had been interacting by using different types of interaction (Tw, MT, RT, and Re); people were posting, others replying, and many others retweeting. These types of patterns can attract the attention of other researchers.
Following Fig. 6 , it is possible to identify that there were 61 users more dynamic than the rest; they tweeted more than 1,0 0 0 tweets. But, when the variable of the types of tweets was introduced, in Fig. 7 we could appreciate that there were only 49 users. It shows that people  who tweeted a lot was because they were mixing the types of interactions. Thus, they were classified into different categories in Fig. 7 . To exemplify this type of mixing interaction, in Fig.  8 we present the 20 most active accounts by its total number of posts and the types of tweets. It is worth noting that neither account in Fig. 8 appeared in Fig. 5 and vice versa. This again highlights the different patterns of interactions on Twitter around the coronavirus topic.

Hashtags
The use of hashtags on Twitter posts is one of the primary mechanisms to get inserted into different trends and topics [ 9 , 10 ]. Thus, we created a database called "06.Hashtags" (see Table 1 ). First, we excluded the "3. RT" tweets type; then, we worked with 3,928,380 tweets. In this set of data, 1,289,328 (32.8%) tweets contained one or more hashtags into the text; the rest of the tweets (2,639,052 -67.2%) did not include any kind of hashtag. Within these 1.28M tweets, we found 205,890 hashtags terms. But, 61.8% of those terms (127,270) were used only once. This way, we filtered the hashtags that were used more than 100 times; 2,189 terms were retained. Finally, we screened the hashtags that we considered as rare because they were based on numbers, symbols and were too long; 2,101 terms were obtained.
By using a wordcloud, we show the diversity of multilingual hashtags used around the coronavirus topic ( Fig. 9 ). Only for visual reasons, we removed four hashtags that were used more than 50,0 0 0 times because they distorted the wordcloud. They were: coronavirus, china, wuhan and, coronavirusoutbreak. In Table 3 , the 50 hashtags most used are presented. We are sure that some researchers can be interested in the use of hashtags on the tweets; for instance, this can be applied in the infodemiology research topic (see Eysenbach [11] ). Also, concerning the terms used on tweets (co-mentions), and how this affects the trends on Twitter, as well as the embeddedness of users.

Social network analysis
A major contribution of this paper is that, from the 8.98M collected tweets, we constructed two databases for social network analysis (SNA); they were called "07a.Tw.edges" and "07b.Tw.nodes" (see Table 1 ). Both together will let other researchers study the network structures that emerge from the interaction among the Twitter users involved in the discussion of the coronavirus topic. SNA has been applied in other Twitter data-based papers, for instance [ 3 , 12 , 13 ]. SNA let us tackle and understand complex networks shaped by different actors, as well as visualise their links, and measure their characteristics [ 14 , 15 ].
Based on the two databases mentioned above, we got a network shaped by 3,565,497 nodes (Twitter users) and 7,296,841 links (without differentiating between mentions, retweets, and replies; but, in the "07a.Tw.edges" database this variable was included for further analysis). In this case, we used the "igraph (version 1.2.4.1)" R package [16] . Then, we simplified the links summing their recurrence and removing the loops. 3,565,4 97 nodes and 5,909,6 81 links formed the resulting network. The links are weighted since pairs of nodes could be linked in several times through different tweets. Also, we computed the number of components in the global network, and we found that 750,108 subnetworks constituted it. But, there were 645,305 components (86.0%) of size one; it means, those nodes did get any kind of interaction. The biggest component comprised 2,668,807 nodes (74.9% of the total) and 5,757,797 links (97.4% of the total). A diversity of different researchers could be interested in these two databases since they can analyse different types of interactions, networks structures, and node properties.
In order to provide an example of the relevance of these databases and the use of an SNA approach, we analysed the network of the account: @realDonaldTrump. We filtered the database and got the users who interacted with this account, as well as how they interacted among them. Fig. 10 shows a network shaped by 30,197 nodes and 65,088 links. Interestingly, @re-alDonaldTrump tweeted only four tweets without any mention; but he was mentioned 3,719 times, retweeted 14,260 times and, replied 3,549 times. Fig. 10 also shows other prominent nodes based on the degree centrality.