Introduction

In 2020, Americans returned to cast their vote for the next president of the US: incumbent Republican Donald J. Trump or the Democratic challenger, and former Vice-President Joseph R. Biden. We began collecting tweets in May 2019 in an effort to capture online chatter surrounding this defining democratic process and to make this collection available to the research community.

Historically, the incumbent president is favored to win their party’s nomination for president;Footnote 1 although Trump did face a few challengers from the Republican party, it became increasingly clear that he would gain the Republican party’s nomination.

Joe Biden officially accepted the Democratic nomination during the Democratic National Convention.Footnote 2 Donald Trump officially accepted his nomination on August 27, 2020, during the Republican National Convention.Footnote 3

As the final sprint to election day on November 3, 2020 began, Americans took to online social platforms to voice their opinions and engage in conversation surrounding the elections. Twitter has historically been a platform used by politicians to reach their base [10], and has recently begun more aggressive efforts to tag posts as misleading and potentially incorrect in order to mitigate the spread of misinformation that had already been prevalent on the platform [4].Footnote 4 On election day, many again used social media to express their thoughts on the unfolding elections. News outlets were unable to call the elections for several days after election day, as many key states were still counting ballots; social media was used as a means to spread information (both factual and misleading) and to both protest and advocate for controversies surrounding ballots and the influx of mail-in ballots caused by COVID-19.Footnote 5\(^,\)Footnote 6

On November 7, the media was finally able to call the election and named Biden as the president-elect, and Kamala Harris as the vice-president-elect.Footnote 7 Yet, in the aftermath of this pronouncement and in the current polarized nature of the United States political landscape, social media has become an environment where misinformation and disinformation can flourish and spread. President Trump refused to concede the election, and continued to promote the claim that the election had been stolen.Footnote 8\(^,\)Footnote 9 These claims from Trump bolstered the basis for the “stop the steal” campaign, and eventually culminated in a riot at the United States Capitol on January 6, 2021.Footnote 10\(^,\)Footnote 11 This led Twitter and other social media platforms to either semi-permanently or permanently suspend President Trump’s accounts from their services, citing the riot and the potential for further incitement of violence as grounds for the bans.Footnote 12 Many vendors began to cut ties with right-wing social media platform Parler due to the role it played in coordinating the January 6 riot.Footnote 13 President Biden was inaugurated into office on January 20, 2021, along with Vice President Harris.Footnote 14

Inspired by the positive impact that our similar initiative to share a COVID-19 Twitter dataset has had on the research community [3], in this paper, we document the release of our 2020 US Presidential election-related dataset that we have been collecting for over one year, a period covering all the events described above and more. We hope that, in releasing this dataset, the research community can leverage its content to study and understand the dynamics in a highly contentious election held during a pandemic. This dataset enables researchers to directly study the impact that the pandemic has had not only on the political landscape, but also on misinformation, disinformation and coordinated actors, with reports of confirmed foreign interference attempts already surfacing [7].Footnote 15

Data collection

Data collection method

We uninterruptedly collected election-related tweets beginning May 20, 2019, and have continued collection efforts since then. We use Twitter’s streaming API through the Tweepy library and follow specific mentions and accounts related to candidates who were running to be nominated as their party’s nominee for president of the United States, in addition to a manually-compiled, general election-related list of keywords and hashtags.Footnote 16 As candidates officially announced the suspension of their campaigns, their respective accounts and mentions were removed from our real-time tracking list. In response to real-world events, we decided to restart tracking for a subset of these accounts, in addition to adding supplemental keywords and accounts to our tracking list. This is documented in Table 1.

We will continue to collect election-related tweets at least through the first six months of the Biden administration, so as to capture the nation’s post-election and post-transition activity. In total, our dataset comprises well over 1 billion tweets. Release v1.12 contains 1,258,209,617 tweets, spanning from 12/01/2020 through 1/22/2021. In our latest (v1.16) and future releases, we will continue processing and adding data we collected prior to 12/01/2020 and after 1/22/2021.

Note: Twitter’s Developer Agreement & Policy stipulates that we are unable to share any data specific to individual tweets except for a tweet’s Tweet ID. As a result, we are releasing a collection of Tweet IDs that researchers are then able to use in tandem with Twitter’s API to retrieve the full tweet payload. We recommend using tools such as DocNow’s HydratorFootnote 17 or TwarcFootnote 18; if tweets have been deleted from Twitter’s platform, researchers will be unable to retrieve the payloads for those tweets. We provide ready-to-use Python code scripts to perform all the operations described above in our repository.

Tracked keywords and accounts

In order to capture the chatter surrounding the 2020 US presidential elections, we followed specific user mentions and accounts that were and are tied to the official and personal accounts of candidates who ran for president. Twitter’s streaming API gives us access to approximately 1% stream of all tweets in real-time, and takes in a list of keywords, returning any tweet within that sample stream that contains any of the keywords in the metadata and text of the tweet payload.Footnote 19 Thus it is unnecessary to track every permutation of each keyword. We list a sample of the mentions and accounts that we tracked in release v1.12 in Table 1 and a sample of the keywords we tracked in Table 2. A full list can be found in the accounts.txt file and keywords.txt file in our data repository.

Table 1 A sample of the mentions and accounts that we actively tracked (v1.12 — January 25, 2021)
Table 2 A sample of keywords that we actively tracked in our Twitter collection (v1.12 — January 25, 2021)
Table 3 Top 40 hashtags (v1.12 — January 25, 2021)

Data and access modalities

We upgraded our data collection pipeline on June 20, 2020 for data collection reliability purposes. Data prior to June 20, 2020 experienced higher rates of technical collection issues. While our most recent release is Release v1.16, containing 1,355,356,627 tweets from December 1, 2019 through February 19, 2021, we focus on and detail release v1.12 throughout this study.

Release v1.12 (January 25, 2021)

Release v1.12 includes tweets collected from December 1, 2019 through January 22, 2021, containing 1,258,209,617 tweets in all. We are still continuing our computational efforts to pre-process and clean the rest of our existing dataset, and will be uploading batches of past and future data as they become available. A sample of the mentions/accounts and keywords that we followed can be found in Tables 1 and 2, respectively, with full lists of both available on our Github repository. Furthermore, Table 3 shows the top 40 most popular hashtags, grouped by general categories. We can clearly see that most of the hashtags are directly related to party campaigns and conspiracy theories surrounding the elections. Others are related to political events, social movements and the COVID19 pandemic.

Table 4 Top 10 language breakdown for release v1.12. Languages were automatically tagged by Twitter and returned in a tweet’s metadata

As this dataset was curated for the 2020 US Presidential election cycle, it is unsurprising that the majority of these tweets are in English (see Table 4 for a breakdown of the languages in release v1.12).

Data access

The dataset is publicly available and continuously maintained on Github at this address: https://github.com/echen102/us-pres-elections-2020.

The dataset is released in compliance with the Twitter’s Terms & Conditions and the Developer’s Agreement and Policies.Footnote 20 This dataset is still presently being collected and will be periodically updated on our Github repository. Researchers who wish to use this dataset must agree to abide by the stipulations stated in the associated license and conform to Twitter’s policies and regulations.

Data analysis

Although we are continuing to collect tweets to add to our data collection as we follow the transition to the Biden-Harris administration, we first present an analysis on tweets from our dataset from January 2020 through the end of December 2020. This enables us to examine political discourse on Twitter through the Presidential primaries, debates and election. Highly political divisions have emerged in COVID-19 discourse [9], alongside conspiracy theories [6] and public heath related trends that have emerged due to COVID-19 [3]. Our recent work on this dataset has also shown that partisan trends drive the discourse on Twitter, with conservative users posting at much higher volumes compared to their liberal counterparts. Conservative users also tended to share more known conspiracy-related narratives [7]. We have also observed that there are highly connected conservative users that are more prone to spread public health and voting misinformation [2].

During the 2020 Presidential election, the incumbent former President Trump, faced little difficulty in securing the Republican nomination.Footnote 21 Although Trump did face three Republican challengers (Mark Sanford, Joe Walsh and Bill Weld), Trump earned 2395 delegate votes, an overwhelming majority.Footnote 22

The Democratic primaries were more competitive, with a historic 28 candidates vying for the nomination.Footnote 23 However, as national poll results began to roll in and initial primary results were tallied, candidates began to drop out of the race (see Table 5 for dates candidates from both parties suspended their campaigns). The advent of COVID-19 in the United States in March 2020, and the ensuing regulations to encourage social distancing, forced the remaining campaigns to shift to a virtual models. The race narrowed down to two candidates: Vermont senator Bernie Sanders and former Vice President Joe Biden. As more primaries took place and results reported, it became clear that Biden would win the 1991 delegates needed to become the presumptive Democratic nomineeFootnote 24. Sanders conceded to Biden on April 8, 2020 and endorsed Biden.Footnote 25\(^,\)Footnote 26

Table 5 This table lists each of the 2020 US Presidential candidates’ names, party affiliation and campaign suspension date.

Overview of presidential candidate Twitter discourse

Our dataset specifically tracked 2020 US Presidential elections-related keywords and accounts. As a result, we expect to see that the captured discourse reflects major events that took place throughout our collection period. We limit our analysis to tweets from our dataset that were collected from January 2020 through December 2020.

The fight for the Democratic Presidential Nomination

Fig. 1
figure 1

The above figure shows a time series analysis of tweets that mention keywords related to a Democratic nominee’s campaign from January 2020 through May 8, 2020. Sanders announced the suspension of his presidential campaign on April 8, 2020, so we capture all discourse through a month after Biden was declared the presumptive Democratic Presidential nominee. We measure the percentage of total tweets collected on a particular day that mention the candidate on a rolling 7-day average. The keywords we use for each candidate can be found in Table 6 and descriptions of the noted dates in the table below the time series. We also include the raw volume of all tweets collected on a particular day on a rolling 7-day average above the time series

We first investigate the chatter surrounding the Democratic primaries, as the race to win the nomination was competitive and multiple candidates emerged as favorites. While Biden may have held an early lead, Sanders, Elizabeth Warren and Pete Buttigieg were also serious contenders.Footnote 27 In Fig. 1, we tracked mentions of each of the Democratic presidential candidates’ names and Twitter handles who were still campaigning in March 2020, and found the 7-day daily rolling average percentage of all collected tweets that mentioned each candidate. This particular time series ends on May 8, 2020, which is one month after Sanders conceded to Biden, and Biden became the presumptive Democratic presidential candidate.

Throughout the Democratic primary timeline in Fig. 1, we can see that the attention that specific candidates attract on Twitter fluctuates greatly. We can clearly see that Sanders and Warren initially led most of the discourse on Twitter in January 2020, but that Sanders would eventually dominate Twitter chatter throughout most of the primaries. This dominance continues until February 25, 2020, when James Clyburn, a prominent South Carolina African American Representative, endorsed Biden. From there, we see a sharp increase in Biden mentions, and Biden quickly overtook Sanders not only in polls, but also in Twitter discourse.Footnote 28 Biden continued to hold a majority in Twitter mentions throughout the rest of the primaries, through Sanders’ concession on April 8, 2020. All other candidates saw a general decrease in tweet mention percentage after an initial increase in percentage after candidates announced that they had suspended their presidential campaigns.

While most of the mention percentages generally followed the popularity of certain candidates, in particular Biden, Sanders, Warren and Buttigieg, we find an increase in mentions surrounding Michael Bloomberg during the 9th Democratic debate.Footnote 29 The 9th Democratic debate was the first debate that Bloomberg was able to qualify for, but his performance was widely criticized.Footnote 30 He also attracted social media attention after having heavily funded his campaign’s ads with his personal money.Footnote 31

Table 6 Keywords for each Democratic candidate that had not suspended their campaign by March 2020, and for Republican candidate Trump. We used these keywords to identify whether or not a candidate was mentioned in a tweet. We note that one tweet can be counted towards multiple candidates, if multiple candidates are mentioned in a tweet

Chatter during the Presidential elections: Biden versus Trump

Fig. 2
figure 2

The above figure shows a time series analysis of tweets that mention keywords related to either Trump or Biden from December 2020 through January 2020. We measure the percentage of total tweets collected on a particular day that mention the candidate on a rolling 7-day average. The keywords we use for each candidate can be found in Table 6 and descriptions of the noted dates in the table below the time series. We also include the raw volume of all tweets collected on a particular day on a rolling 7-day average above the time series

We now turn to the final race in the 2020 U.S. Presidential election between Biden and Trump. As shown in Fig. 2 the percentage of all tweets that mention Trump is significantly greater than the percentage of tweets that mention Biden (see Table 6 for keywords associated with each candidate). This gap in mentions is not unexpected, as Trump was the incumbent President and thus already had a significant presence on Twitter. While our current analysis is based on percentage of mentions in the tweets collected, our prior work in clustering users by political affiliation based on shared media found that conservative users have a more vocal presence on the political Twitter scene [7]. Despite Trump’s general dominance in the chatter, we see that as major events occur, such as when Democratic primaries began to be called for Biden and during the Presidential debates, Biden began to see an increase in mentions. While a tweet may be counted as mentioning both Trump and Biden, we still see a corresponding decrease in percentage of Trump’s mentions when Biden’s mentions increase. This suggests that the discourse shifted away from Trump and towards Biden, particularly as election day neared, culminating in a similar percentage of tweets mentioning either Biden and/or Trump.

It appears that the tweets we collected in our dataset track well the real world events. However, the sheer percentage of our collected tweets that mention a particular candidate does not necessarily represent the sentiment and popularity of those candidates at the time. As Twitter has evolved as a platform, likewise the user base has also changed [11]. This disparity between Twitter attention and real-world popularity was highlighted during the Democratic primaries. Sanders held the majority of percentage of tweet mentions from early January through the end of February. It was not until the initial primary results began to be tallied and reported that it became clear that Biden had actually won the Democrat’s vote.Footnote 32 Sanders’ dominance in Twitter discourse underscored how Biden’s eventual momentum took much of the Democratic party by surprise.Footnote 33 This can give us insight into how news and public discourse on social media platforms can misrepresent or give a false impression of the nation’s sentiment.

Twitter Location Engagement

Every tweet we collect is returned with metadata describing the tweet itself, including Twitter’s automatic language tag and post date. Each tweet also includes information about the author, and if the tweet was a response (reply, retweet or quote) to another tweet, the tweet’s metadata also contains information on the original poster. This metadata can sometimes include a user’s location data; however, we found that less than 1% of our tweets actually contained this information [9]. Because of this, we leverage the included “location” field that a user manually populates as a part of their profile. We tag each tweet with its country of origin and, if the tweet originates from the United States, the detected state [9]. While some users may list locations that are not accurate, do not exist or are unable to be identified through our algorithm, we leverage this as a proxy for tweet location.

We examine the domestic geographical flow of information within the United States. In isolating only retweets and quoted tweets (retweets with a comment), we find tweets that directly represent one user re-posting the tweet of another. Retweets and quoted tweets also return both the user specified location data for both the user who retweeted or quoted the tweet and the original poster. The user who retweeted or quoted the tweet will be referred to as the retweeter for clarity. Then, we retain all tweets within our dataset where we are able to identify a state for both the retweeter and the original poster, which directly implies that both the retweeter and original poster are also located in the United States. Figure 3 illustrates the flow of the top 200 most frequent state-to-state engagements, with the flow following retweets and quoted tweets from the original poster’s state to the retweeter’s state.

States in which the most tweets originate from generally coincide with the most populous states in the United States. The US Census Bureau lists California, Texas, Florida and New York as the most populous states in their 2019 estimate.Footnote 34 However, most tweets actually originate from the District of Columbia area, which is both the political center and the capital of the United States. This is consistent with the nature of the political landscape, as many politicians are located in the D.C. area. In general, Fig.  3 suggests that while there exists a substantial amount of intra-state tweet engagement, states with larger populations account for larger proportions of the measured intra-state engagement activity.

Fig. 3
figure 3

We remove all tweets without an identifiable state, and visualize the intra-state tweet engagement activity within the United States in our dataset. For each retweet or quoted tweet, we visualize the geographic flow from the original poster’s state to the retweeter’s state. The line color corresponds to the location of the original poster

Discussion

Limitations

While this dataset gives us a glimpse of the political chatter on Twitter, there are still limitations to this dataset that warrant discussion. Due to the nature of the keywords we were tracking, the tweets in our dataset are highly skewed towards English and tweets that originate from the United States. Another limitation of the dataset is that the users on Twitter do not necessarily represent the collective sentiment of the United States. The audience that uses Twitter, according to a 2019 study conducted by Pew Research Center, skews younger and more Democratic than the general population; the most vocal on Twitter also tend to engage in political discourse.Footnote 35

Twitter also significantly rate limits the number of tweets that one can rehydrate, and tweets that have either been removed by the user or removed because a user was banned or suspended can no longer be retrieved through Twitter’s API. Our collection was also highly contingent upon the stability of our network and hardware, which means that there may be gaps in our data collection, particularly prior to our migration to AWS. Twitter has recently released an Academic Research track that enables researchers and academics to access the full-archival search; however, this still imposes rate limits that unfortunately makes filling these gaps in time hard.Footnote 36

Potential research avenues

There are many potential areas that can be explored using our dataset.

Recent work using our dataset has already begun to explore the prevalence of bots and misinformation within the 2020 political landscape [6, 7]. Luceri et al. also scrutinizes the bot engagement in political discourse in 2018 and found that many of these bots remained active during the 2020 election cycle [12]. Our previous work has found that out of all major conspiracy theories that had taken root during the election, QAnon supporters were the most vocal and active. We also found that, when grouping users by their political affiliation, tweets from accounts most likely to be bots outnumber tweets from accounts that are most likely human for both the Republican and Democratic parties. Conservative accounts that are the most likely to be bots also have higher bot scores, suggesting that these accounts are more likely to be automated compared to their left-leaning counterparts [7]. We used Indiana University’s Botometer, a tool that assigns a bot-score to a Twitter account based on an account’s activity [14, 15]. Others have also leveraged the polarized nature of the 2020 elections to model and estimate echo chambers based on a user’s political stance [13].

While this is just a sampling of current literature, there are many areas that are also being explored, including the presence, effect and detection of trolls [8] and foreign influence during the elections [7]. Many new nascent and promising questions are also emerging in the wake of the elections, particularly as the COVID-19 pandemic has forced individuals to physically social distance and, consequently, seek community online.

After aggressive action to mitigate misinformation and the incitement of violence on major social network platforms, many flocked to alternative social network platforms that have espoused their support for freedom of speech, such as Parler and Gab.Footnote 37 While there has been much prior work in leveraging these alternative right-wing platforms to understand fringe views in conjunction with more main stream platforms [16,17,18] the recent high profile suspensions of major political figures’ accounts led to an increased public awareness and exodus to these platforms. Before Parler went offline, researchers even scraped post data.Footnote 38 Data collected across multiple platform have the potential to give insight into how fringe communities not only survive these rebuffs by the community but also thrive in the controversy.

Another interesting question that arises is how the pandemic and the resulting shift to online platforms changed the nature and effectiveness of political campaigns. As some politicians quickly cancelled in-person events as the severity of COVID-19 rose, others chose to continue in-person rallies [1].Footnote 39\(^,\)Footnote 40 Social media became an integral part of the campaign process, more so than before, as events such as the Democratic National Convention were held virtually.Footnote 41 Cross-platform studies will be essential in beginning to understand the full scope of how and to what extent COVID-19 has fundamentally altered our elections system.

Conclusion

The 2020 US Presidential election cycle has been mired both by the COVID-19 pandemic and controversy. In this paper, we presented a Twitter dataset that we have collected from May 5, 2019 through the months after the transition to the Biden campaign. Twitter is by no means the only platform that campaigns leveraged to reach their base or where the public discussed their opinions. However, there has already been evidence that misinformation still persists on Twitter and other platforms, even as social media companies’ are making efforts to address this problem [5,6,7]. Having access to this curated dataset will allow researchers to delve into how a contentious election unfolded and its surrounding chatter, as traditionally offline events transitioned online.

Inquiries

If you have technical questions about the data collection, please contact Emily Chen at https://www.echen920@usc.edu.

If you have any further questions about this dataset please contact Dr. Emilio Ferrara at https://www.emiliofe@usc.edu.