Using COVID-19 Vaccine Attitudes on Twitter to Improve Vaccine Uptake Forecast Models in the United States: Infodemiology Study of Tweets

Background Since the onset of the COVID-19 pandemic, there has been a global effort to develop vaccines that protect against COVID-19. Individuals who are fully vaccinated are far less likely to contract and therefore transmit the virus to others. Researchers have found that the internet and social media both play a role in shaping personal choices about vaccinations. Objective This study aims to determine whether supplementing COVID-19 vaccine uptake forecast models with the attitudes found in tweets improves over baseline models that only use historical vaccination data. Methods Daily COVID-19 vaccination data at the county level was collected for the January 2021 to May 2021 study period. Twitter’s streaming application programming interface was used to collect COVID-19 vaccine tweets during this same period. Several autoregressive integrated moving average models were executed to predict the vaccine uptake rate using only historical data (baseline autoregressive integrated moving average) and individual Twitter-derived features (autoregressive integrated moving average exogenous variable model). Results In this study, we found that supplementing baseline forecast models with both historical vaccination data and COVID-19 vaccine attitudes found in tweets reduced root mean square error by as much as 83%. Conclusions Developing a predictive tool for vaccination uptake in the United States will empower public health researchers and decisionmakers to design targeted vaccination campaigns in hopes of achieving the vaccination threshold required for the United States to reach widespread population protection.


Background
Since the onset of the COVID-19 pandemic, there has been a global effort to develop vaccines that protect against COVID-19. Individuals who are fully vaccinated are far less likely to contract and therefore transmit the virus to others [1]. Up until recently, public health experts have stressed the importance of achieving a numerical threshold of herd immunity, but this is only possible if a significant proportion of the population is fully vaccinated. More recent research suggests that the traditional concept of herd immunity may not apply to COVID-19 [2]. Instead, the goal is to increase vaccination uptake to optimize population protection without prohibitive restrictions on our daily lives [3]. Accurately forecasting vaccination uptake allows policymakers and researchers to evaluate how close we are to achieving normalcy again.
Researchers have turned to traditional methods for forecasting COVID-19 infection and vaccination rates [4][5][6]. For example, one of the most common forecasting methods used, univariate time series, involves predicting future vaccination rates using historical vaccination rates. While this method can be useful in many cases, it fails to account for other time-dependent factors that may also influence vaccinations. For example, the COVID-19 vaccine conversation on social media has been deemed an infodemic, with antivaccination misinformation spreading across social media platforms [7]. Researchers have found that the internet and social media both play a role in shaping personal or parental choices about vaccinations [8,9]. Additionally, previous research showed a positive relationship between positive sentiment scores in COVID-19 vaccine-related tweets and an increase in vaccination rates [10]. These findings suggest it is important to consider the daily conversations on social media when developing vaccine uptake forecast models.

Forecasting COVID-19-Related Measures Using Social Media
There is no shortage of studies that sought to forecast COVID-19-related measures using information from social media. Researchers Yousefinaghani et al [11] conducted a study using COVID-19-related terms mentioned in tweets and Google searches to predict COVID-19 waves in the United States. Researchers found that tweets that mentioned COVID-19 symptoms predicted 100% of first waves of COVID-19 days sooner than other data sources. Another study used data from Google searches, tweets, and Wikipedia page views to predict COVID-19 cases and deaths in the United States [12]. Researchers found models that included features from all 3 sources performed better than baseline models that did not include these features. Researchers also found that Google searches were a leading indicator of the number of cases and deaths across the United States. Another study [13] examined the relationship between daily COVID-19 cases and COVID-19-related tweets and Google Trends. In a study conducted by Shen et al [14], researchers used reports of symptoms and diagnoses on Weibo, a popular social media platform in China, in order to predict COVID-19 case counts in mainland China. Researchers found reports of symptoms and diagnoses on the social media platform to be highly predictive of daily case counts. Although each of these studies forecast COVID-19 cases and deaths, none of these studies forecast COVID-19 vaccination rates.

Forecasting Vaccinations
Very few studies have conducted time series forecasting of the COVID-19 vaccinated population in the United States. In a study conducted by Sattar and Arifuzzaman [15], researchers developed a time series model to predict the percentage of the US population that would get at least 1 dose of the COVID-19 vaccine or be fully vaccinated. Researchers projected that by the end of July 2021, 62.44% and 48% of the US population would get at least 1 dose of the COVID-19 vaccine or be fully vaccinated, respectively. Although this paper also included a separate tweet sentiment analysis, researchers did not include Twitter-related features in the forecast model. Additionally, researchers used aggregated vaccination data for the entire United States, rather than a more granular geographic level.
Another study aimed to evaluate if and when the world would reach a vaccination rate sufficient enough for herd immunity by forecasting the number of people fully vaccinated against COVID-19 in various countries, including the United States [16]. In this study, researchers used a common univariate time series forecasting method, autoregressive integrated moving average (ARIMA), to forecast the future number of fully vaccinated people using only historical vaccination data. Based on the resulting projections, researchers concluded that countries were nowhere near the necessary herd immunity threshold needed to end the COVID-19 pandemic.
A study conducted by Cheong et al [17] sought to predict COVID-19 vaccine uptake using various sociodemographic factors. Although not a time series forecasting model, the results of this study showed that geographic location, education level, and online access were highly predictive of vaccination uptake in the United States. The model predicted vaccine uptake with 62% accuracy.
Although there are very few studies related to COVID-19 vaccination forecasting, other studies have been conducted to predict immunizations for other illnesses. For example, 1 study analyzed electronic medical records of a cohort of 250,000 individuals over the course of 10 years [18]. Researchers developed a model to predict vaccination uptake of individuals in the upcoming influenza season based on previous personal and social behavioral patterns. Another study developed a tool for leveraging immunization related content from Twitter and Google Trends to develop a model for predicting whether a child would receive immunizations [19]. Researchers were able to predict child immunization statuses with 76% accuracy.

Study Objectives
Although previous studies have developed forecast models for COVID-19 vaccination rates in the United States, to our knowledge, there are no studies that aim to factor in the real-time vaccination attitudes present on Twitter. The vaccine attitudes on Twitter change daily, as do vaccination rates, so analyzing vaccine attitudes on social media might contribute to the performance of vaccine forecast models. Additionally, previous studies developed forecast models that focused on the entire United States as a whole. These forecast models fail to appreciate the differences in vaccination roll out, behaviors, and attitudes across different geographic regions. This study seeks to fill this gap by examining vaccine uptake at the metropolitan level.
The purpose of this study is to develop a time series forecasting algorithm that can predict future vaccination rates across US metropolitan areas. Specifically, this study aims to determine whether supplementing forecast models with real-time vaccine attitudes found in tweets-measured via sentiments and emotions-improves over baseline models that only use historical vaccination data. Developing a predictive tool for vaccination uptake in the United States will empower public health researchers and decision makers to design targeted vaccination campaigns in hopes of achieving the vaccination threshold required for us to reach herd immunity.

Twitter Data
The Twitter streaming application programming interface, which provides access to a random sample of 1% of publicly available tweets, was used to collect tweets from 8 of the most populated metropolitan areas in the United States from January 2021 to May 2021 (Textbox 1) [20]. We chose to focus on large metropolitan areas to gather a sufficient number of tweets for the analysis. Additionally, larger metropolitan areas also tend to have users who enable the location feature when tweeting [21,22]. All tweets had "place" information (usually city and state). The place information found in tweets was used to determine the metropolitan area associated with each tweet. Next, to extract tweets related to COVID-19 vaccines, tweets were further filtered by matching variations of vaccine-related keywords, such as vaccine, pfizer, moderna, johnson & johnson, and dose. Additional vaccine keywords can be found in Multimedia Appendix 1. A language filter was then applied to identify tweets written in the English language. The tweets sample was further preprocessed to minimize "noise" resulting from tweets that matched our vaccine-related keywords but did not necessarily reflect the thoughts and opinions of individual Twitter users. For example, companies often promote job postings and advertisements on Twitter using targeted hashtags in hopes of reaching their target audience. To prevent these tweets from adding noise to the sample, tweets related to job postings and advertisements were removed by excluding tweets with hashtags and keywords, including "jobs," "hiring," "advertisement," "apply," and "ad."

COVID-19 Vaccination Data
Daily COVID-19 vaccination data at the county level was collected for the January 2021 to May 2021 study period from the Centers for Disease Control and Prevention's publicly available vaccination data set [23]. This data set includes daily vaccination data from clinics, pharmacies, long-term care facilities, dialysis centers, Federal Emergency Management Agency and Health Resources and Services Administration partner sites, and federal entity facilities. Vaccination administration data are reported to the Centers for Disease Control and Prevention via immunization information systems, the vaccine administration management system, and data submissions directly to the COVID-19 Data Clearinghouse [23]. Each county was linked to its respective metropolitan area according to the US Census delineation file [24]. Next, the data were aggregated to the daily-metropolitan level and the 7-day rolling average of the percentage of individuals who have been administered at least 1 vaccine dose was calculated.

Sentiment and Emotion Analysis of Tweets
For the purposes of this study, we measure COVID-19 vaccine attitudes via sentiment and emotion analyses of tweets. We evaluated both sentiments and emotions because both methods offer different levels of granularity. Sentiment analysis focuses on determining the overall sentiment or polarity of a text, such as positive, negative, or neutral. It provides a high-level understanding of the sentiment expressed. Emotion analysis, on the other hand, aims to identify specific emotions within the text, such as joy, anger, and sadness. It offers a more detailed and nuanced understanding of the emotional states. By utilizing both sentiment and emotion analysis, we gain a comprehensive understanding of the text, covering both the overall sentiment and the specific emotions expressed.
To capture the sentiments and emotions found in COVID-19 vaccine-related tweets, a sentiment and emotion analysis of all tweets was conducted using bidirectional encoder representation from transformer (BERT) [25], a pretrained language model trained using bidirectional (left to right and right to left) context training to learn joint probability distributions of text. We leveraged the fine-tuned BERT models in the TweetNLP package in Python (Python Software Foundation) [26] to calculate the valence of 8 different emotions (fear, joy, anticipation, anger, disgust, sadness, surprise, trust), along with overall neutral, positive, and negative sentiment of tweets in our analysis sample. The sentiment analysis and emotion recognition BERT models were fine-tuned with the TweetEval benchmark [27].
The outputs from BERT are softmax of logits, one corresponding to each of the emotions or sentiments. For each tweet, we performed argmax over the probability distribution for each tweet, to get the most likely emotion and sentiment.
Next, we found the percentage of tweets classified as each of the emotions and sentiments for each day and metro area combination. For example, the count of anger tweets on January 1 for the New York-Newark-Jersey City, NY-NJ-PA metropolitan area divided by the total number of tweets on January 1 for the New York-Newark-Jersey City, NY-NJ-PA metropolitan area gives percentage of anger tweets for January 1 in the New York-Newark-Jersey City, NY-NJ-PA metropolitan area.
The total number of COVID-19 vaccine related tweets and users per 100,000 population was also calculated for each day of data collection, at the metropolitan level. Finally, user engagement metrics, including the average number of retweets and favorites, were calculated for each day of data collection, at the metropolitan level. Retweets and favorites suggest, after processing the information, that a user resonates with an idea expressed in a tweet [28,29]. Therefore, we believe these engagement metrics might also reflect vaccine attitudes.

Time Series Model
The data were divided into training and test data sets, where the time series analysis was trained using the data set created from the January 1, 2021, to April 12, 2021, time period, and tested on the data set created from the April 13, 2021, to May 20, 2021, time period. ARIMA models were executed for forecasting the proportion of individuals who have been administered at least 1 vaccine dose. Autoregressive integrated moving average exogenous variable model (ARIMAX) models, which are extensions of ARIMA models that include independent predictors called exogenous variables, were also executed. The ARIMA method has been widely used in time series forecasting and public health surveillance [30][31][32]]. An ARIMA model typically consists of three components: (1) auto-regression, notated in the model as p; (2) differencing, notated in the model as d; and (3) moving average, notated in the model as q [33]. In an ARIMA model, the present value of the time-series is a linear function of random noise and its previous values; the present value is also a linear function of both present and past values of the residuals in the model; and the auto-regressive moving average model includes both the auto-regressive and moving average models, in addition to the historical values in the time series and its residuals [30].
Stationarity of a time series is a key assumption when making predictions based on past observations of a variable [34]. Stationarity requires the properties (mean and variance) of a time series to remain constant over time, thus making future values easier to predict [35]. Otherwise, the results are spurious and analyses are not valid [30]. The stationarity of all variables included in the time series was assessed using the Dickey-Fuller (dfuller) test. If the null hypothesis is rejected, stationarity is satisfied. If stationarity is not satisfied, variables must undergo differencing, a process that removes any trend in the times series that is not of interest [35]. All differencing and model selection was performed by the auto_arima function from the pmdarima package in Python [36], which is a function that selects the optimal order of the model based on the Hyndman-Khandakar algorithm for automatic ARIMA modeling [37]. A combination of unit root tests and minimization of the Akaike information criterion and Bayesian information criterion allows this algorithm to select the best preforming model order by fitting several variations of model components p, d, and q [38]. By including a penalty that is an increasing function of the number of estimated parameters, the information criteria scores maximize the goodness of fit while minimizing the number of model parameters, effectively dealing with both the risk of overfitting and the risk of underfitting [39,40].
For each metropolitan area, a baseline ARIMA model with no exogenous variables was constructed to forecast the 7-day rolling average of the number of individuals who have been administered at least 1 vaccine dose, using only past values of this outcome. To assess the ability of vaccine attitudes on Twitter to improve COVID-19 vaccination forecasts, multiple ARIMAX models were executed, each with individual Twitter-derived features included as exogenous variables. Additionally, we executed a multivariate ARIMAX model that included those Twitter attitudes that showed improvement over the ARIMA baseline across all metro areas. A final ARIMAX model that contained all Twitter features regardless of performance was attempted but did not converge. A complete list of the constructed time series models can be found in Table  1.

Twitter Data
A total of 59,687 COVID-19 vaccine-related tweets were collected during the data collection period, across 23,878 users ( Table 2) The temporal trends for the number of COVID-19 vaccine-related tweets from January to May 2021 are presented in Figure 1. The number of COVID-19 vaccine-related tweets fluctuated over time; however, a peak in the number of tweets was observed during the week of April 5, 2021, to April 11, 2021. This was the week that President Joe Biden announced that every adult in the United States would be eligible to receive a COVID-19 vaccine starting April 19, 2021 [42].

Sentiment and Emotion Analysis
A sentiment analysis classified most tweets across all metropolitan areas as having neutral sentiment, with joy as the predominantly expressed emotion ( Table 3)

Time Series Forecast
Multiple time series models were constructed to forecast the vaccine uptake rate (7-day rolling average). The results of the Dickey-Fuller (dfuller) test for stationarity revealed that across all metropolitan areas, stationarity did not hold for several of the variables (Tables 4 and 5). However, the necessary differencing was automatically applied via the auto_arima function.
The performance of the optimal models across all regions, as determined by the auto_arima function, can be found in Tables  6 and 7. The best-performing model for each metropolitan area is marked by an asterisk. Models that performed better than the baseline model are bolded. Model performance for the "out-sample" forecasts was evaluated using the root mean square error (RMSE) instead of Akaike information criterion because RMSE measures how close the data are around the line of best fit [43]. This measure is commonly used in time series forecasting to evaluate how close the forecasted values are to the actual values [44]. When evaluating model performance using RMSE, across all metropolitan areas, the addition of a Twitter-derived feature related to COVID-19 vaccination attitudes improved model performance by up to 83%. For example, across all metropolitan areas, adding the percentage of vaccine tweets expressing joy, negative sentiment, surprise, or trust individually as exogenous variables resulted in a lower RMSE compared to the baseline ARIMA model. Additionally, across all metropolitan areas, most of the ARIMAX models, which each had 1 Twitter-derived feature related to COVID-19 vaccination attitudes, showed improvement over the baseline ARIMA model that did not factor in Twitter-derived features.

Effect of Models on Performance
To understand the effect of modeling choices on the usefulness of Twitter-derived features to improve COVID-19 vaccination rate predictions, we evaluated 2 additional models: one that used the Syuzhet package [45]-instead of BERT-to extract the same set of sentiments and emotions from tweets and then ARIMA/ARIMAX to predict COVID-19 vaccination rates; and another model that used BERT to extract sentiments and emotions from tweets and deep learning-a Temporal Fusion Transformer Model [46]-to predict COVID-19 vaccination rates, instead of ARIMA/ARIMAX. We confirmed that independently of the model selected, the same findings hold-the results of these models show that adding Twitter-based features to COVID-19 vaccination rates in predictive models improves most baselines, independently of the model and the city, albeit with higher RMSE than the ones shown in Tables 6 and 7. We have included descriptions, results, and a discussion of these other 2 models in Multimedia Appendix 2. Figure 2 illustrates the performance of the baseline ARIMA models and the best-performing ARIMAX models, compared to the observed values of the outcome variable during the "out-sample" forecasting period (April 13, 2021, to May 20, 2021). Across all metropolitan areas, the ARIMAX time series models with Twitter-derived features aligned more closely with the actual values of the vaccination rates compared to the baseline ARIMA model that relied on past historical vaccination data alone.

Principal Findings
In this study, we sought to determine whether supplementing forecast models with COVID-19 vaccine attitudes found in tweets-modeled via sentiments and emotions-improves over baseline models that only use historical vaccination data. When evaluating model performance across all metropolitan areas, the addition of COVID-19 vaccine attitudes found in tweets resulted in improved model performance, as reflected by RMSE, when compared to baseline forecast models that did not include these features. Specifically, compared with the traditional ARIMA model with vaccination data alone, ARIMAX models with the predictions of both historical vaccination data and COVID-19 vaccine attitudes found in tweets reduced RMSE by as much as 83%. We were able to replicate similar findings across various modeling choices, including the Syuzhet package to extract sentiments and emotions, instead of BERT, and deep learning (temporal fusion transformer model) to predict COVID-19 vaccination rates, instead of ARIMA/ARIMAX.

Study Findings in Context
The ongoing COVID-19 pandemic emphasizes the need for innovative approaches to public health surveillance. The global public health community has monitored the COVID-19 pandemic by tracking case counts, hospitalizations, deaths, and vaccinations. For the United States, these data sets are publicly available. Forecasting case counts and vaccination rates using existing historical data has been a key approach in COVID-19 surveillance efforts [47]. Previous forecast models for predicting vaccine uptake rate relied on traditional ARIMA methods, where historical data were used to predict future rates [48]. However, social media data sources, such as Twitter, reveal society's attitudes toward the pandemic and current vaccination efforts on a real-time basis. This provides an opportunity for a large volume of raw and uncensored data related to vaccine attitudes, across various geographic locations, to be leveraged for disease surveillance, which can subsequently be used to supplement and improve existing models.
The findings of this study suggest that attitudes extracted from Twitter data can be added to existing forecast models for monitoring vaccination uptake across various metropolitan areas. In certain metropolitan areas, the mere volume of tweets and users engaged in vaccine-related conversations improved model performance when compared to baseline models. These results echo the findings in the study by Maugeri et al [33], which revealed another social media source, Google Trends data, improved the prediction of COVID-19 vaccination uptake in Italy when compared to baseline models. In this study, Google Trends data were represented as the relative search volume for each vaccine-related keyword. Another similar study developed a framework for predicting vaccination rates in the United States based on traditional clinical data and web search queries [49]. The results of this study also revealed the ability for online networks to predict societal willingness to receive vaccinations. Specifically, the authors similarly found improvement in model performance as in this study-with a reduction in RMSE of 9.1%.
Although few studies sought to supplement current vaccine models with social media data, to our knowledge, there are no studies that go beyond the mere volume of relevant Twitter data and factor in the sentiment and emotion of vaccine-related conversations. Over the course of the pandemic, some states experienced low vaccination rates despite comprehensive vaccine roll out programs. In these cases, it is important to consider the public's emotions and sentiments toward vaccines. This study contributes to the literature by evaluating the ability for sentiments and emotions related to the COVID-19 vaccine to predict vaccine uptake. Specifically, the results show an improvement in model performance across metropolitan areas when models were supplemented with the percentage of tweets expressing anger, fear, joy, positive sentiment, or neutral sentiment. A study conducted by Alegado and Tumibay [48] examined the association between sentiments and emotions found in tweets and vaccine uptake via regression coefficient analysis. This study showed similar insights-tweets expressing fear, sadness, and anger appeared to be significantly associated with vaccination rates.
The results of this study have several implications for the present COVID-19 response. Public health experts now argue that the traditional concept of herd immunity may not apply to COVID-19 [2]. Instead, the focus is to increase vaccination uptake to substantially control community spread, without the societal disruptions caused by the virus [3]. Accurately forecasting vaccination uptake allows policy makers and researchers to evaluate how close we are to achieving normalcy again. Additionally, similar algorithms allow public health practitioners to better anticipate vaccine uptake behaviors and therefore develop targeted policies. As the global community builds toward achieving herd immunity, researchers should also "listen" to the vaccine conversation on social media-monitoring misconceptions and misinformation and implementing targeted vaccine education campaigns that address these misconceptions. Although the COVID-19 pandemic appears to be improving, the present framework can also be used to improve vaccine forecast models for future pandemics.

Limitations and Future Work
It is important to note that this study has some limitations. The study period was limited to the first half of 2021. However, vaccines were not yet available to most of the US adult population until April 2021. Therefore, the study period did not capture the height of vaccination efforts. Another limitation is that as the COVID-19 pandemic evolves, vaccine related keywords may change, requiring frequent updating of the model. Future work may involve the use of topic modeling to capture the general themes surrounding the COVID-19 pandemic.
Another limitation is related to the geographic scope of this study. This study only focused on forecasting vaccine uptake in the United States. However, it is important to note that vaccination efforts must be addressed on a global scale, not just domestically, for normalcy to be attained. Future work should consider collecting tweets and vaccination data from other countries to see if similar models improve vaccine forecasts globally. Additionally, this study only examined tweets posted in the English language. Limiting the study to the collection of Tweets only in the English language poses a limitation as it may overlook valuable insights and perspectives expressed in other languages. This exclusion could lead to a biased understanding of sentiments and emotions, potentially missing out on crucial data from non-English-speaking populations. Language barriers may hinder the study's generalizability and restrict the representation of diverse cultural contexts. Future work should involve the use of sentiment and emotion classifiers that include lexicons in other languages.

Conclusions
Researchers have found that the internet and social media both play a role in shaping personal or parental choices about vaccinations. Although few previous studies have developed forecast models for COVID-19 vaccination rates in the United States, to our knowledge, there are no studies that aim to factor in the real-time vaccination attitudes present on Twitter. This study suggests the benefits of using the linguistic constructs found in tweets to improve predictions of the COVID-19 vaccination rate. In this study, we found that supplementing baseline forecast models with both historical vaccination data and COVID-19 vaccine attitudes found in tweets reduced RMSE by as much as 83%. Developing a predictive tool for vaccination uptake in the United States will empower public health researchers and decision makers to design targeted vaccination campaigns in hopes of achieving the vaccination threshold required for widespread population protection.

Conflicts of Interest
None declared.