Social media mining under the COVID-19 context: Progress, challenges, and opportunities

Social media platforms allow users worldwide to create and share information, forging vast sensing networks that allow information on certain topics to be collected, stored, mined, and analyzed in a rapid manner. During the COVID-19 pandemic, extensive social media mining efforts have been undertaken to tackle COVID-19 challenges from various perspectives. This review summarizes the progress of social media data mining studies in the COVID-19 contexts and categorizes them into six major domains, including early warning and detection, human mobility monitoring, communication and information conveying, public attitudes and emotions, infodemic and misinformation, and hatred and violence. We further document essential features of publicly available COVID-19 related social media data archives that will benefit research communities in conducting replicable and reproducible studies. In addition, we discuss seven challenges in social media analytics associated with their potential impacts on derived COVID-19 findings, followed by our visions for the possible paths forward in regard to social media-based COVID-19 investigations. This review serves as a valuable reference that recaps social media mining efforts in COVID-19 related studies and provides future directions along which the information harnessed from social media can be used to address public health emergencies.


Introduction
The COVID-19 pandemic has posed a global crisis, causing serious social, economic, and health challenges. Due to social media's interactive nature and popularity amongst users throughout the pandemic, there is rising democratization of health communications, which poses a sharp contrast to decades ago, when communications were predominantly controlled by individuals and entities endowed with the power, money, public trust, or platforms required to drive the conversation (Schillinger et al., 2020). The emerging concepts of "Web 2.0" (Murugesan, 2007), "Big Data" (Yang et al., 2017), and "Citizen as Sensors" (Goodchild, 2007) have greatly promoted social media as the platforms and virtual communities where users worldwide can create and share information, forming vast sensing networks that allow information in certain topics to be collected, stored, mined, and analyzed in a rapid manner (Li et al., 2021a;Gong and Yang, 2020).
Since the early stage of the COVID-19 pandemic, governments, local authorities/agencies, and organizations have started to disseminate crucial information to the public via social media platforms. In addition, we have seen a massive influx of opinions, perceptions, and attitudes towards COVID-19 related events and/or public health policies from regular users on social media platforms. If appropriately utilized, the vast amount of minable information in the social media space would allow scholars to address various aspects of COVID-19 challenges. Extensive social media mining efforts have been made to tackle COVID-19 issues from various perspectives, including but not limited to case hotspot prediction , policy compliance monitoring (Huang et al., 2020), misinformation modeling , and sentimental analysis (Nemes and Kiss, 2021). Despite the existing studies, there is a lack of review work that cohesively summarizes the current findings in the COVID-19 context. Tsao et al. (2021) examined 81 peer-reviewed empirical studies relating to COVID-19 and social media published between November 2019 and November 2020. Their review predominantly targeted the early stage of the pandemic; therefore, it did not capture the major milestones amidst the middle and later stages of the pandemic after the mass vaccination. Other reviews related to social media and public health more broadly merely describe the general functionality and utility of social media in public health applications but lack the focus on social media analytics that are derived via data mining efforts (Giustini et al., 2018;Grajales III et al., 2014;Moorhead et al., 2013). Additionally, these reviews gave scarce attention to the challenges present in different domains of COVID-19 related studies (e.g., Asian hate, lockdown debate, and vaccination preferences)-prevalently discussed on social media platforms.
To address these knowledge deficits, we review existing social media mining efforts related to the COVID-19 crisis, document publicly available data archives, summarize social media mining challenges as well as their potential impacts on derived findings, and envision the future directions of social media-based COVID-19 and public health investigations. Due to the strong interdisciplinarity and the fact that we specifically target data mining efforts, a systematic reviewing workflow using keywords and databases for queries fails to provide a satisfactory article pool without intensive post-selection trimming. Thus, we organize this review in a narrative manner. The narrative review has been widely used to obtain a broad perspective on topics of interest. Instead of systematically searching for all relevant literature, it specifically focuses on pivotal papers known to the subject expert. The articles reviewed in this effort are purposively selected by the authors with rich experience in social media mining and who have conducted interdisciplinary COVID-19 investigations using social media data.
In the following sections, we group the progress of social media data mining studies to address COVID-19 challenges into six major categories and summarize notable efforts in each category (Section 2). These six categories include 1) early warning and detection; 2) human mobility monitoring; 3) communication and information conveying; 4) public attitudes and emotions; 5) infodemic and misinformation; 6) hatred and violence. Note that the authors' expertise and research experience well cover the identified six categories. We further document essential features of publicly available COVID-19 social media data archives that will benefit research communities in conducting replicable and reproducible studies (Section 3). In addition, we discuss seven challenges in social media analytics associated with their potential impacts on derived COVID-19 findings, followed by our visions for the possible paths forward in regard to social media-based COVID-19 investigations (Section 4). These challenges include 1) biased population spectrum; 2) multilingual investigations; 3) posting incentives; 4) positioning accuracy; 5) uncertainties in sentiment and emotion acquisition; 6) bots, retweets, and skewed posting behaviors; 7) data sharing. The structure of this review is presented in Fig. 1. We believe that this review can serve as a valuable reference that recaps COVID-19 related social media mining efforts and provide potential future directions for better employing the information harnessed from social media to address future public health emergencies.

Early warning and detection
Public health surveillance is critical for monitoring the spread of infectious diseases, rapidly detecting outbreaks, and proposing effective countermeasures. With the support of early warning signs, governments are able to better prepare for public health emergencies such as the COVID-19 pandemic. The initial hotspot of COVID-19 was reported in China before cases were reported in European countries and in the United States (U.S.), which became the new epicenter of the disease as its number of confirmed cases surpassed that of Italy's on March 26, 2020. The rapid viral spread on a global scale demands public health authorities in many countries to develop mitigation strategies within a rapid timespan. Social media has played a crucial role in supporting traditional surveillance systems for tracking the progress of the COVID-19 pandemic and informing the judgments and decisions of public health officials and experts (Samaras et al., 2020). The real-time information from a massive sensor network consisting of millions of social media users provides timely situational awareness that uncovers early warning signs of an upcoming hotspot of cases, greatly facilitating the estimation of disease prevalence in (near) real time.
For example, Kogan et al. (2021) found that digital data sources may provide an earlier indication of the epidemic spread than traditional COVID-19 metrics, such as confirmed cases or deaths. By proposing a metric that combines six digital sources, including COVID-19 related Twitter activity, into a multiproxy estimator, their study demonstrated the potential of situational awareness that is derived from digital sources in estimating the probability of an impending COVID-19 outbreak (Kogan et al., 2021). By analyzing a multilingual dataset of tweets (i.e., English, German, French, Italian, Spanish, Polish, and Dutch posts that contain the keyword "pneumonia"), Lopreite et al. (2021) uncovered early-warning signals of the COVID-19 outbreaks in Europe during the winter season 2019-2020, before receiving the first public announcements of local sources of infection. This evidence suggests that European countries saw unexpected levels of concerns regarding COVID-19 cases, and whistleblowing came primarily from the geographical regions that eventually turned out to be the new COVID-19 hotspot (Lopreite et al., 2021). Qin et al. (2020) predicted the number of newly suspected or confirmed COVID-19 cases by analyzing social media search indexes for symptoms, coronavirus, and pneumonia. Via a series of analytical approaches, e.g., lasso regression, ridge regression, and elastic net, their study proved the feasibility of social media search indexes in predicting new suspected COVID-19 cases 6-9 days in advance (Qin et al., 2020). Similarly, Li et al. (2020a) analyzed COVID-19 related internet searches (Google and Baidu) and social media data (i.e., Sina Weibo) and demonstrated that for trend data and the number of cases, the highest correlation between these two variables occurred 8-12 days before an increase in confirmed COVID-19 cases, and the highest correlation between trend data and the newly suspected cases occurred 6-8 days before the increase in newly suspected cases. The above studies, as well as other social media based early warning and detection efforts (Lu and Zhang, 2020;Mackey et al., 2020), highlight the necessity of establishing social media surveillance systems that facilitate the identification of disease communication.

Human mobility monitoring
The COVID-19 pandemic highlights the importance of rapid human mobility monitoring. User-generated information from social media platforms (e.g., Twitter, Facebook, Sina Weibo, and Instagram), when coupled with geo-information (i.e., geograohic coordinates and information on place names), allows human-human, human-place, and place-place interactions to be monitored in an active and less privacyconcerning manner (Huang et al., 2020;Li et al., 2021a), thus serving as an important venue where timely human mobility dynamics can be collected and analyzed to assist with decision making. Despite the existence of many social media platforms, only a small proportion of them permit information mining or open-source aggregated mobility records for researchers and the public, while for some social media platforms (e. g., Facebook and Sina Weibo), certain agreements have to be met to access to the records. Below, we review notable efforts that address COVID-19 challenges by monitoring human mobility dynamics via geotagged social media data.
With several categories of publicly available application programming interfaces (APIs), Twitter has become the most popular social media platform that allows geographic data mining. These APIs return certain percentages of their total content, with some of them containing geo-information at various levels. Studies have found that the Twitterderived mobility patterns can approximate commuting patterns (Petutschnig et al., 2021) as well as mobility records released by Apple, Google, and Descartes Labs . Using 580 million geotagged tweets collected worldwide, Huang et al. (2020) measured human mobility by proposing the concept of single-day distance and cross-day distance, which highlight the users' daily travel behavior and the users' displacement between two consecutive days, respectively. Their investigations, conducted at various scales (i.e., global, country, and U.S. states), suggest that Twitter-derived mobility dynamics are amenable to reflect the geographical differences in policy implementations and discrepancies with policy compliance. Notably, Xu et al. (2020) proposed and utilized a Twitter Social Mobility Index, which measures social distancing and travel derived from geotagged Twitter posts, to analyze U.S. weekly travel patterns. Similar efforts were made to monitor global human mobility dynamics (Bisanzio et al., 2020;Lai et al., 2021;Li et al., 2021c;Li et al., 2021d) as well as country/regionspecific dynamics where Twitter is widely used, such as the U.S. Zeng et al., 2021) and Australia (Nguyen et al., 2020a).
Facebook is another popular social media platform with a large global user base. Beginning in the initial phases of the COVID-19 pandemic, Facebook Data for Good began to provide human mobility information to assist with pandemic mitigation. For example, Chang et al. (2021) explored Facebook-derived movement patterns and used meta-population models to assess the potential effects of local travel restrictions imposed within Taiwan. Zachreson et al. (2021) used Facebook mobility data to estimate future spatial patterns of relative transmission risk and examine the degree to which these estimates correlate with observed cases in Australia. Besides these two efforts, Facebook mobility records were employed for mobility monitoring at a continental scale, as well as at a country/sub-country scale, e.g., the U.S. (Holtz et al., 2020;Ilin et al., 2021), the U.K. (Shepherd et al., 2021), Italy (Beria and Lunkar, 2020;Bonaccorsi et al., 2020), Spain (Pérez-Arnal et al., 2021), Japan (Fraser and Aldrich, 2020), Germany (Fritz and Kauermann, 2020), Demark (Edsberg Møllgaard et al., 2022), and Australia (Zachreson et al., 2021).
Several other social media platforms were harnessed to address COVID-19 challenges as well, such as U.S.' Instagram and China's Tencent and Sina Weibo. Zarei et al. (2020) constructed the first Instagram dataset, which featured COVID-19 related posts with locational information. Using Tencent's mobility data derived from Tencent's media various platforms, Li et al. (2020d) revealed daily human movement patterns in Sichuan, China (which covers the mobility of 90% of Sichuan citizens) during the initial stages of the COVID-19 outbreak, and Wei et al. (2021) evaluated how people in Wuhan, China reduced their mobility in response to city lockdowns. Another Chinese social media platform, Sina Weibo, also renders geotagged posts that allow researchers to mine the spatiotemporal patterns of human interactions and place visitations (Peng et al., 2020).
Social media platforms have proven to be one of the most vital sources of mobility data, enabling researchers to obtain critical insight into human mobility amidst COVID-19. Due to their active sharing characteristics, social media mobility records are less abundant compared to other passively collected records (e.g., mobile phone data, smart cards, or wireless networks), though they are less intrusive, more accessible, and more harmonized (Li et al., 2021c). However, limitations such as the necessity for users needing pre-existing incentives to make posts and varying positioning accuracy need to be recognized (discussed in Section 4.1).

Communication and information conveying
Social media platforms are not only popular among individual users for user/news following, microblogging, and content sharing (Gong and Yang, 2020;Kietzmann et al., 2011) but have also become crucial tools for institutions (such as governments, organizations, and universities, etc.) to disseminate information, foster connections, and even manage crises (Gong and Lane, 2020;Kelly, 2013;Kostkova et al., 2014). Crisis communication refers to the sharing of information among individuals and institutions to improve crisis management and understanding (National Research Council, 1989). Crisis communication has been reshaped by social media in numerous ways, including raising public awareness through collaboration and participation, distributing information and instructions in real time, and monitoring and managing risks with greater efficiency (Olteanu et al., 2015;Reuter et al., 2016;Yoo, 2019). In spite of the virtual nature of social media interactions, the spatial social networks they have formed still reflect the geography of communication (Ye and Andris, 2021). Human interactions in real life and in cyberspace are similar in terms of their social, economic, cultural, and linguistic constraints; thus, spatial social networks tend to mimic real-life patterns (Bild et al., 2015;Stephens and Poorthuis, 2015). Many studies have used social media data to examine crisis communication under the COVID-19 context from a geographic and social network perspective. The majority of the COVID-19 crisis communication research focused on governmental agencies, but some also examined other public health stakeholders, such as non-governmental organizations (NGOs), educational organizations, and the public.
As the COVID-19 crisis unfolds, government organizations at different levels must act quickly to communicate crisis information to the public in an efficient and effective manner; failure to do so could lead to an increase in fear, uncertainty, and anxiety among the public (Chen et al., 2020b). Based on spatial-temporal analyses, network analyses, and text mining of the U.S. state governors' crisis communication on Twitter during the pandemic, Gong and Ye (2021) found that the current usage patterns are generally consistent with effective crisis communication principles (listening, informing, providing feedback, and establishing connections) and provided some concrete recommendations for improving the process. One qualitative analysis of how world leaders of the Group of Seven (G7) communicated about the COVID-19 pandemic indicated that 82% of their tweets were informative; many of them dealt with government resources, morale boosting, and political issues (Rufai and Bunce, 2020). According to Zhu et al. (2020), the analysis of Sina Weibo posts related to COVID-19 confirmed that early warnings of crises are vital because public attention to COVID-19 was relatively limited until the Chinese government acknowledged that the novel coronavirus could be transmitted between humans and designated control of the outbreak as a high priority on January 20, 2020. Through analyzing tweets from 292 federal members of the Canadian parliament, Merkley et al. (2020) reported a moment of cross-party consensus on COVID-19 communication. No matter which party the members were from, they emphasized social distancing and proper hand hygiene as a necessity for combatting the COVID-19 pandemic (Merkley et al., 2020). Wang, Hao, and Platt (2021) analyzed 13,598 COVID-19-relevant tweets from 67 U. S. federal and state-level government agencies from January to April 2020. They identified inconsistencies and incongruities in four crucial prevention topics and found that communications coordination increased over time. Using tweets from Texas-based public health agencies, Liu, Xu, and John (2021) examined interagency coordination at different stages of the pandemic. In addition to stage-specific variations in peer-to-peer and federal-to-local coordination, they also observed consistency in content across stages, i.e., state and federal agencies acting as agenda setters . Studying 138,546 tweets from 696 public health agency accounts from February 1 to March 31, 2020, Sutton, Renshaw, and Butts (2020) observed that longitudinal COVID-19 risk communication shifted as secondary threats emerged. In addition, there are studies addressing the best practices in COVID-19 crisis communication on social media. Government agencies can improve public engagement and crisis communication efficiency on social media by leveraging narrative evidence (Gesser-Edelsburg, 2021;Ngai et al., 2020), adopting an empathic communication style (Liao et al., 2020), actively using the dialogic loop rather than media richness (Chen et al., 2020b), and joining forces with leading scientists from various domains (Tsoy et al., 2021) to generate persuasive and potent content. These findings may help government agencies to create communication plans for future crises and assist the public in understanding, preparing for, and predicting governments' response strategies.
The pandemic has spawned a wealth of complex problems, such as healthcare resource shortages, economic recession, mental health issues, and other social problems, all of which are difficult to resolve by governments alone. Therefore, it is imperative for public health stakeholders, NGOs, education institutions, and the general public to collaborate with government agencies within and across boundaries to address problems collectively (Head and Alford, 2015;Li et al., 2021b;Roberts, 2000;Weber and Khademian, 2008). After examining the evolution of Twitter-based networks and discourse across 2,588 U.S. NGOs in the first five months of the COVID-19 outbreak, Li et al. (2021b) discovered that social media usage helped NGOs to connect with each other by removing geographical barriers and specialty constraints. Over time, distinct organizational communities emerged around different topics, mostly reflecting theoretical predictions based on Issue Niche Theory (Yang, 2020). The interactions and connections among NGOs and government agencies during the COVID-19 pandemic are well reflected on social media platforms, reflecting their goals to share information, build communities, and take action for disaster response. The government agencies played a leading role in the NGO-government collaborations, while NGOs from the Human Services, International and Foreign Affairs, and Public and Societal Benefit sectors, especially the American Red Cross, played a more central role in the NGO collaboration network. The study of social media usage by 189 Greek libraries during the pandemic revealed that although libraries embraced social media quickly as a channel for communication, only a few highlighted their roles in the promotion of public health by providing timely and reliable information (Koulouris et al., 2020). The COVID-19 pandemic has forced all educational institutions to move from face-toface to online instruction. Students in higher education use social media primarily to build an online community and to support one another, whereas faculty members use it exclusively for teaching and learning (Sobaih et al., 2020). Based on analyses of tweets from 492 U.S. K-12 school districts in March-April 2020, Michela et al. (2022) found that these districts followed recommendations for social media crisis communication by posting more announcements and engaging more collaboratively during the early pandemic phases, and by sharing more community-building contents later. Crisis communication among the general public is also crucial to disaster response. Yu et al. (2021) analyzed 10,132 COVID-19 related online comments on TripAdvisor and discovered a dynamic shift in risk perceptions and communication intensities among the general public as a result of the pandemic's rapid and unpredictable spread. During the COVID-19 pandemic, many instances of stereotyping and discrimination toward Asian Americans and the elderly population have been posted on social media, many of which are associated with stigmatizing and blaming these populations (Croucher et al., 2020;Meisner, 2021). Meisner (2021) urged the public to be aware of and to resist ageism that devalues later life in crisis communication. All of these findings provide unprecedented insight into how different public health stakeholders are working collaboratively to combat the pandemic, which can help the entire society prepare for the implantation of crisis communication strategies in anticipation of future global hazards.

Public attitudes and emotions
The COVID-19 pandemic has led to a major uprise in studies that apply sentiment analysis towards social media platforms' text-based content in order to gauge the public's attitude and sentiment revolving both the pandemic and related categories, such as public health policies (e.g., mask-wearing) and/or events (e.g., vaccination) (Ewing and Vu, 2021;Kwok et al., 2021;Manguri et al., 2020). Sentiment analysis is thought to enable the derivation of the users' emotional response to a particular event or phenomena via the text-based contents that they post (e.g., words, expressions, languages, and syntaxes) (Agarwal et al., 2011;Kouloumpis et al., 2011). Moreover, social mediabased sentiment studies have also been used to indicate the public's awareness, opinions, or mental health signals based on the quantification and intensity of sentiments (e.g., positive V.S. negative, or optimistic V.S. pessimistic) and the type of emotions (e.g., fear, sadness, joy, and surprise) (Coppersmith et al., 2014). Such studies are further able to supplement survey-based mental health assessments, enabling researchers to mitigate issues such as a limited data pool (e.g., limited spatial and temporal data coverage, data under-representativeness) (Balcombe and De Leo, 2020). Sentiment research that seeks to quantify sentiments via social media data typically relies on advanced measuring techniques, including artificial intelligence (AI) models and machine and/or deep learning algorithms (Ewing and Vu, 2021;Hu et al., 2021;Kwok et al., 2021;Wang et al., 2020a;Wang et al., 2022). The advanced AI models include the Valence Aware Dictionary for sEntiment Reasoning (VADER) (Wang et al., 2022) and the National Research Council Canada Lexicon model (NRCLex) (Hu et al., 2021), both of which target English-based contents, as well as the XLM-R or XLM-T model (Conneau et al., 2019;Imran et al., 2022) and the Hugging Face (Barbieri et al., 2021), which targets multilingual contents. More nuanced reviews and surveys of the models and algorithms used in sentiment analysis can be found in Alsaeedi and Khan (Alsaeedi and Khan, 2019) and Medhat et al. (2014).

Infodemic and misinformation
With the rapid dispersion of COVID-19, a tsunami of related information rushed across the internet. Yet, such information remains unfiltered, and many contain misinformation, rumors, and conspiracy theories. On this influx of information, the World Health Organization's (WHO) Director, General Tedros proclaimed, "We're not just fighting an epidemic; we're fighting an infodemic". Thus, on February 15, 2020, at the Munich Security Conference, Tedros officially coined this phenomenon as the "Infodemic", which describes a situation where, during a period of disease outbreak, there exists a vast amount of information that is false or misleading in nature and is present in physical and digital environments. During the COVID-19 crisis, misinformation can spread faster and farther on social media platforms than the virus itself. According to a Reuters report, the number of English-language fact-checks rose more than 900% from January to March 2020 (Brennen et al., 2020). This information covers a wide range of topics, e.g., "5G virus is true", "eating garlic can prevent coronavirus", and "Bill Gates is planning to microchip the world through a COVID-19 vaccine". Such misinformation and the resulting risk-taking behaviors can lead to mistrust in health authorities and undermine public health response. In light of this context, scholars around the world have started to investigate misinformation spread on social media platforms.
Social media generates a massive amount of information related to the COVID-19 pandemic every day, and manual identification of misinformation is time-and labor-consuming. As an alternative solution, advanced machine learning techniques have been deployed to detect misinformation automatically. The accuracy of misinformation detection models relies on sufficient and reliable datasets. Thus, many efforts have been made to provide high-quality social media misinformation datasets. Researchers collected ground-truth data from fact-checking websites (Ceron et al., 2021;Saakyan et al., 2021;Shahi and Nandini, 2020) and reliable websites (Cui and Lee, 2020;Zhou et al., 2020) for the misinformation detection task. Specifically, FakeCovid (Shahi and Nandini, 2020) is a multilingual cross-domain fact-check news dataset that contains 5,182 articles circulated in 105 countries (40 languages) from 92 fact-checkers. CoAid is a healthcare domain dataset containing 4,521 true news articles and claims from reliable media outlets (e.g., Healthline, ScienceDaily, and WHO) (Cui and Lee, 2020). CoAid collects fake news by retrieving URLs from multiple fact-checking websites such as LeadStories, PolitiFact, and FactCheck. ReCOVery is a multimodal repository for COVID-19 news credibility research, which contains 1,364 news articles from 22 reliable websites (e.g., National Public Radio and Reuters) and 665 news articles from 38 unreliable websites (e. g., Human Are Free and Natural News).
Another research direction is to collect misinformation-related posts from social media users to explore user engagement (Cui and Lee, 2020;Kim et al., 2021;Li et al., 2020c) and public opinion (Gupta et al., 2020;Wang et al., 2020b;Xue et al., 2020;Yin et al., 2020). Misinformation detection is essentially a classification task. The common workflow is to develop a dataset with true and false labels for model training, adjust the model based on the results of the test set, and apply it to unknown data in order to generate predictions. Machine learning (ML) and deep learning (DL) models have been widely used for misinformation detection (Alenezi and Alqenaei, 2021;Elhadad et al., 2020;Gundapu and Mamidi, 2021;Kar et al., 2020;Koirala, 2020). Traditional ML models, such as decision trees, support vector machines, and logistic regression, usually serve as baseline models in fake news detection model experiments. Al-Rakhami and Al-Amri (2020) proposed an ensemble framework for misinformation detection by using traditional ML and conducting extensive experiments on a self-collected Twitter dataset. Their work demonstrates that a combination of models outperforms a single model. For DL models, a variety of models have been used to address the COVID-19 misinformation identification challenge, including Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), and Bidirectional Encoder Representations from Transformers (BERT) (Al-Rakhami and Al-Amri, 2020), to list a few. Alkhalifa et al. (2020) introduced a CNN-based classification system with different preprocessing and embedding methods to classify COVID-19 rumors. An ensemble deep learning technique that was implemented to detect misleading information for COVID-19 had achieved satisfactory performance (Elhadad et al., 2020). Other two advanced models, i. e., BiLSTM (Boukouvalas et al., 2020;Dharawat et al., 2020;Hossain et al., 2020;Kumar et al., 2021) and BiGRU (Cui and Lee, 2020;Elhadad et al., 2020), have also been widely adopted in recognizing misinformation on social media. The recent development of BERT has pushed natural language processing to a new level, thanks to its capability in capturing both left and right contexts, given its bidirectional design. Some BERT variants were adopted for COVID-19 misinformation detection (Alkhalifa et al., 2020;Glazkova et al., 2021;Heidari et al., 2021;Perrio and Madabushi, 2020;Tziafas et al., 2021). The multilingual BERT (mBERT) is a notable variant trained on Wikipedia andconsiders a total of 104 languages. COVID-Twitter-BERT was trained on the 160 million COVID-19 related corpus on the Crowdbreaks platform and performed very well on many textual representations related to COVID-19 (Müller et al., 2020).

Hatred and violence
Since the first confirmed case of COVID-19 in the U.S. on January 19, 2020 (Hossain et al., 2020), hateful and xenophobic language has surged on social media. This was quickly followed by prejudice and discriminatory acts against minorities, particularly the Asian and Asian American population (Croucher et al., 2020;Fan et al., 2020). Notably, during the period of March 19, 2020, to September 30, 2021, the Stop AAPI (Asian American and Pacific Islander) Hate reporting center recorded a total of 10,370 hate incidents against Asians and Asian Americans (Horse et al., 2021). Today, racism is recognized as a public health threat by the American Medical Association, as the connection between hateful social media posts and offline racially and religiously aggravated crime has previously been documented (Williams et al., 2020). Moreover, for both traditional media and social media, the spread of hate during COVID-19 on such platforms results in potentially negative effects on population health, and such an observation has also been previously recorded (Gao et al., 2020a;Quintero Johnson et al., 2021). Malicious content like racism and disinformation is spreading quickly beyond the control of individual social media platforms, thereby subverting their efforts to moderate content (Velasquez et al., 2021).
In response to the rapid dissemination of such malicious content, the scientific community has responded with a series of actions that are of similar fervor: for instance, annotated tweet datasets for the detection of racism and sexism were readily available before the pandemic (Davidson et al., 2017;Waseem and Hovy, 2016) in many languages other than English, i.e., a dataset of Spanish tweets for misogyny detection (Fersini et al., 2018), an Arabic dataset for detection of hate speech and fake news (Ameur and Aliane, 2021), and an annotated dataset of abusive language in German (Wich et al., 2021). Early work on social media mining during COVID-19 established the theoretical groundwork for detecting hate speech on social media by using keyword-based classifiers. Nguyen et al. (2020b) found evidence of increased negative sentiment towards Asians associated with the "#chinesevirus" hashtag in early 2020 (Nguyen et al., 2020b). Anti-Chinese and anti-Asian attacks on social media platforms were mainly targeting eating habits, hygiene, and in general, culture (Stechemesser et al., 2020). Lastly, exploratory work during the early stages of the pandemic includes the application of space-time scan statistics (Kulldorff, 1997) to assess the spatiotemporal distribution of geotagged tweets regarding Asian hate (Hohl et al., 2022).
Other studies, though aspatial, have focused on identifying anti-Asian hate and counterspeech on social media via using BERT . BERT was used to identify hate-related keywords that targeted older people during the pandemic (Vishwamitra et al., 2020) and fine-tuned to analyze COVID-19 content on Twitter (Müller et al., 2020). This approach utilizes word embeddings in conjunction with machine learning to classify tweets, therefore providing an advantage over the keyword-based classifiers' method of incorporating word context. Such a method was used for analyzing the dehumanization towards LGBTQ people in articles in the New York Times (Mendelsohn et al., 2020). Further, crowdsourcing and ensemble learning algorithms were utilized to detect hate on social media in Germany (Garland et al., 2020(Garland et al., , 2022. Amidst sudden changes during the early stages of the pandemic, issues with obtaining costly training data regarding hate speech detection (e.g., labeled social media posts) were circumvented through the usage of unsupervised progressive domain adaptation based on a deep-learning language model (Bashar et al., 2021). Lastly, efforts to analyze the effects of content moderation policies on the propagation of malicious posts (within social media platforms) using mathematical models produced encouraging results, accompanied by actionable suggestions towards slowing the spread of online hate (Velasquez et al., 2021).

Available sources and application examples
In the past two years, we have witnessed a tremendous surge in the use of social media platforms during the COVID-19 pandemic to study misinformation, public opinion, human behavior, infodemic, and more. Social media platforms can be categorized as social networks (e.g., Twitter, Sina Weibo, and Facebook), media sharing networks (e.g., YouTube), and discussion forums (e.g., Reddit). However, limited social media datasets are shared with the public, hindering collaborative research and increasing the crisis of research reproducibility and replicability. As a result, this section summarizes and compares the most popular and publicly available social media datasets in terms of geolocation, content, advanced data analytics, geographic coverage, time coverage, and selected citations, as shown in Table 1.

Social networks
Twitter plays a significant role in featuring COVID-19 related research, and its original data includes user information, content, post time, and more. However, due to privacy concerns, the publication of personal Twitter data is not permissible. Therefore, after collecting COVID-19 related tweets, many researchers publicly share Twitter IDs to allow ease of access. With these Twitter IDs, users can hydrate tweets and access original information via the Twitter API. For example, Chen et al. (2020a) used Twitter's search API to gather global wide historical COVID-19 related Tweets based on the keywords (i.e., Coronavirus, Koronavirus, Corona, covid-19, and N95) dating back to January 21, 2020. Their team has shared their repository, which contains an ongoing collection of tweet IDs. So far, it is the most popular Twitter dataset cited by researchers across the world.
Although the act of sharing raw Twitter data is restricted, many researchers share their findings with the public via advanced approaches, such as releasing their findings regarding the sentiment, emotions, and topics of users' tweets. For example, Lopez and Caleb (2021) collected over 2.2 billion tweets across the globe in multiple languages. Additionally, they employed state-of-art algorithms to analyze sentiment and recognize named entities in Twitter content. Such aggregated information facilitates the researchers' exploration and hypothesis testing on social discourse regarding the COVID-19 pandemic (Lopez and Gallemore, 2021).
Locations and medical emergencies are intrinsically linked. Geotagged tweets are able to provide real-time information about human activities at a low cost and high spatial and temporal resolutions. They also enable researchers to, using geography as a common variable, join attributes across various datasets (e.g., sociodemographic) (Hu and Wang, 2020). Due to this advantage, Qazi et al. (2020) released the GeoCoV19 dataset, which contains around 378,000 geotagged tweets and 5.4 million tweets with locational information at the country, state, and city levels. Further, Lamsal (2021)'s publication of tweet IDs enabled individuals who hydrate these IDs access to the geotagged datasets.
Sina Weibo, commonly referred to as the "Chinese Twitter", is the leading social media platform in China, with 497 million active monthly users in 2019 . Given that China was the earliest country to report COVID-19 outbreaks, many researchers have shared and utilized datasets from Sina Weibo to analyze misinformation. For example, Leng et al. (2020) crawled Sina Weibo posts via Weibo API from December 7, 2019, to April 4, 2020, and shared the datasets on Harvard Dataverse. Fu and Zhu (2020) collected 11,362,502 posts between December 1, 2019, and February 27, 2020, which contains at least one outbreak-related keyword (e.g., mask, virus, or coronavirus).

Media sharing networks (YouTube and Instagram)
Media sharing networks such as YouTube and Instagram have also been important channels where people receive COVID-19 information through venues such as videos and photos. The YouTube Data API could be used to find videos through search queries. Video metadata, including title descriptions, tags, video statistics, comments, as well as the recommended videos, can be collected through the YouTube Data API. For example, Papadamou et al. (2020) collected COVID-related videos and recommendations through the API to analyze the effect of a user's watch history on video recommendations. Notably, YouTube has also served as a source of COVID-19 misinformation (Allington et al., 2021). These videos are often linked by content on other social media sites, including Reddit, Twitter, and Facebook. Knuutila et al. (2021) displayed a dataset of COVID-related video identifiers that were removed by YouTube, though the video's metadata were recovered through archive.org's Wayback Machine. Researchers have also actively studied the content of YouTube videos, as it could be both useful as a source of information (D'Souza et al., 2020) and play a role in spreading  (Zarei et al., 2020) misinformation . A common approach is to select the most viewed videos by search queries and analyze the video content along with its metadata. Basch et al. (2020) identified the 100 most widely viewed YouTube videos in January 2020, using the search term "Coronavirus". Their analysis revealed that only one-third of the videos covered key prevention behaviors. Instagram data, including post comments, geotags, and captions, could be retrieved with open-source tools such as the Instaloader. Researchers are able to share data through the Post IDs. For example, Zarei et al. (2020) used the Instagram Hashtag search API to retrieve public posts with a set of COVID-19 hashtags and crawl the reactions (comments or likes) for further analysis. Researchers have previously applied Natural Language Processing and deep learning techniques to Instagram posts as well. For example, Mackey et al. (2020) analyzed illicit COVID-19 product sales from Twitter and Instagram posts using unsupervised topic modeling and a recurrent neural network with long short-term memory (LSTM) unit to identify online sellers.

Discussion forums
Discussion forums, e.g., Quora, Yahoo answers (shut down on May 4, 2021), Infobot, and Reddit, provide users with online spaces to discuss news and answer questions, where the public comments and statements can be collected by researchers to study the influences of COVID-19. Among the above-mentioned discussion forums, Quora and Reddit are the two commonly used forums that enabled for many COVID-19 studies.
Quora is a popular question-and-answer (Q&A) website where users are allowed to ask questions and connect with people who contribute unique insights and quality answers. The COVID-19 pandemic has greatly stimulated people's interest in asking and answering COVID-19 related questions, and a large amount of content can be harnessed for COVID-19 studies. George et al. (2020), for example, analyzed the content, type, and quality of Q&As in Quora regarding the pandemic and compared the information with that on the WHO website by manually categorizing the tone of the question as either positive, negative or ambivalent and grading questions for accuracy, authority, popularity, readability, and relevancy. Another notable effort is by McCreery et al. (2020), who designed a fine-tuning neural network approach trained by Quora question pairs to identify similar posted questions.
Reddit, one of the most widely used discussion forums, allows registered users to submit content to the site, such as links, text posts, images, and videos, among which lots of content are related to current events. A popular method to collect Reddit data is through the PushShift API, which serves as a copy of Reddit objects. For example, Low et al. (2020) introduced the Reddit Mental Health Dataset that contains posts from 28 subreddits from 2018 to 2020. Reddit data provide a new lens for researchers to study emotion, gender differences, and mental health during the COVID-19 pandemic. Text mining and natural language processing techniques are the major analytical tools employed in research. Naseem et al. (2020) leveraged Non-negative Matrix Factorization (NMF) topic modeling on Reddit posts to study life during the pandemic and the effects of social distancing. Aggarwal et al. (2020) analyzed emotions through the Valence-Arousal-Dominance (VAD) affect representation. Word embeddings of Reddit data were used to train beta regression models in order to predict VAD scores. The results revealed considerable differences between male and female authors across all three emotional dimensions.

Biased population spectrum
In 2020, social media platforms were used by over 3.6 billion people worldwide, and this number is projected to increase to almost 4.4 billion in 2025 (Tankovska, 2021). Despite this growing trend, however, there has been an argument that the current demographics of social media active users are unrepresentative of the entire population across the world in terms of age, gender, race, education, or socioeconomic status. Jiang et al. (2019) found that Twitter users in the entire U.S. are biased towards certain age groups (18-29 and 30-39), females, and people with Bachelor's and Graduate degrees). They also discovered that U.S. Twitter users' spectrum presents strong spatial non-stationarity, suggesting that the biases of Twitter users vary by geographical location (Jiang et al., 2019). Facebook users are most represented by individuals between the ages of 25 and 35 years (Barnhart, 2022). The demographic representation on one of China's largest social media platforms, Sina Weibo, also has a user demographic that is considerably different from that of the national population statistics, with males composing 56.3% of users, 20-35 years old comprising 82% of users, and with 91% of users with Bachelor's degrees (Weibo-Sina, 2017). Such biases are also observed in other social media platforms, including WeChat and Instagram. Thus, it remains debatable whether place visitations, mobility patterns, sentiment, or emotions captured from social media space are representative of those of the entire population. Applying such findings derived from a small minority towards the general public is cautioned against, unless they are statistically compared with and supported by other means of data collection that are less biased, such as questionnaires and surveys.

Multilingual investigations
Social media data presents several advantageous characteristics, such as facilitating the process of intra-and inter-continental investigations due to its breadth of foreign languages and allowing comparisons between data derived from different regions where cellphone records from certain providers can differ geographically. Despite their advantages, multilingual posts in the social media space pose challenges towards contextual interpretation. For example, every month, there are over 330 million active Twitter users across the world, using tens of languages, with English (31.8%), Japanese (18.8%), and Spanish (8.46%) as the three most popular languages (VICINITAS, 2018). Current studies that extract situational awareness and perform sentiment/ emotion analysis on COVID-19 related posts tend to focus on monolingual posts (Griffith et al., 2021;Mansoor et al., 2020;Shofiya and Abidi, 2021) or multilingual posts with naïve translating approaches (Lin et al., 2021;Zhang et al., 2021). When applied to study areas with two or more dominant languages though, such investigative procedures ultimately ignore specific groups of people and introduce uncertainties when summarizing emotions and sentimental preferences across different languages. Despite the development in multilingual translation, which is supported by the advances in natural language processing techniques, the potential biases in extracting and quantifying sentiment and emotions across different languages are still deserving further exploration.

Posting incentives
For geotagged social media posts, the active sharing characteristics of social media data inevitably lead to a "warped reality", when compared to actual human-to-human interactions and place visitations. That is to say, human mobility patterns extracted from the social media space are a biased representation of actual human mobility. For example, geotagged social media posts derived from check-in records generally have to satisfy two requirements: 1) users are geographically close to the check-in locations (or at least they claim themselves to be); 2) the check-in locations are worth posting (i.e., "interesting" enough for them to create a post). In comparison, geolocations obtained via passive collecting means (e.g., WIFI, Call Detail Records, and GPS signals) only need to satisfy the former requirement. Such a biased and inevitably generalized representation may lead to uncertainties or even mistakes when they are applied to the decision-making process for COVID-19 mitigation. For sentiment and emotion mining, studies have shown that bursts of posting tend to occur following major events (Pohl et al., 2012;Zhou and Chen, 2014). In other words, a considerable amount of social media posts are event/news-driven. Therefore, the question as to whether emotions and sentiments from event-triggered posts largely reflect options towards the event itself or the general topic remains to be explored. Unfortunately, it remains a challenge to grasp the contextual meaning behind sentiments and emotions using the current natural language processing techniques.

Positioning accuracy
The levels to which social media data are geotagged vary greatly (depending on the social media platforms' terms of use and users' specific settings), posing challenges to studies that prefer certain geolocational accuracy for social media posts. In general, the geotagging levels include country, first-level subdivision, second-level subdivision, city, neighborhood/point of interest (POI), and exact coordinates. A study conducted by Li et al. summarized the positioning levels of 1.4 billion geotagged tweets worldwide: 1.1 billion (79%) at the city level, 138.1 million (9.8%) at the first-level subdivision (state or province), 90.4 million (6.4%) with exact coordinates, 46.2 million (3.3%) at country level, and 21.4 million (1.5%) at neighborhood/point of interest (Li et al., 2021a). Certainly, different social media platforms have varying preferences towards certain positioning levels. For example, Sina Weibo check-in data returned from Sina Weibo API are mostly positioned at the POI level (Hu et al., 2019), whereas Facebook Data for Good only provides re-aggregated data at certain administrative levels due to privacy concerns (Edsberg Møllgaard et al., 2022). The varying positioning levels of social media posts impose a great influence on the statistical findings of studies that summarize statistics within certain geographic units due to the modifiable areal unit problem (MAUP). For applications that demand accurate human moving patterns, integrating social media posts with mixed positioning levels produces significant uncertainties that should not be overlooked.

Uncertainties in sentiment and emotion
Uncertainties in sentiment analysis and the emotions that it extracts from social media posts have been widely acknowledged. Despite the fact that advanced natural language processing techniques, when applied to multilingual posts, enable reliable translation for certain languages, they still have relatively less consistency and lower performances for those that are less spoken (Balahur and Jacquet, 2015). This leads to increased uncertainties in the results of sentiment and emotion analysis when they are applied to multilingual regions, especially in those with less spoken languages. Certain social media platforms, such as Twitter and Weibo, have character limits, creating oddities (e.g., the usage of abbreviations and acronyms) found in posts that would otherwise not be present in normal language. Furthermore, the unique character-limit restrictions imposed on posts made on certain social media platforms demand the application of word vectors trained specifically from short-text documents instead of the ones that are from popular word representation models, such as Global Vectors for Word Representation (GloVe) (Pennington et al., 2014) and Embeddings from Language Models (ELMo) (Peng et al., 2019). In addition, within the context of COVID-19, we should note that certain words have sentimental tendencies that are opposite to their original meaning. For example, the sentence "I have been tested positive" has a negative sentiment polarity, despite the fact that the word "positive" presents a strong positive polarity in many sentiment analysis models. Another challenge is the treatment of neutral reporting of valenced information, e.g., "the daily death toll dropped to 1,000" and "100 more have been tested positive today". It is unclear whether these statements should be considered as neutral unemotional reporting of developments or assumed that users are in negative/positive emotional states.

Bots, retweets, and skewed posting behaviors
From 50 million tweets, Al-Rawi & Shukla (2020) identified the top 1,000 most active accounts that mention COVID. Within these accounts, 127 (12.7%) were identified as highly likely to be bots. In an early study involving Weibo, a random sample of roughly 30,000 users was found to contain 57% of either inactive users or "zombie accounts" due to these accounts' lack of consistent postings over time (Fu and Chau, 2013). The method by which social media bots are handled in social media analytics is important for studies that address COVID-19 challenges by mining information from authentic human users. Despite the bots composing a smaller population than that of authentic human users, relatively high posting volumes can greatly contaminate researchers' data. However, we have yet to find an automatic and correct approach to identifying bots. Hence, this issue remains a challenge. In addition, we must acknowledge the skewed posting behaviors of social media users, given that a majority of social media posts come from a minority of users. For example, 80% of tweets come from the top 10% of the most active users (Wojcik and Hughes, 2019), which means that our analysis of a collection of social media posts is likely to be skewed towards a small subset of users. Optimized weighting mechanisms based on posting frequencies and user ID indexed analytical workflows can be adopted to address this issue. However, our literature review suggests that few efforts have considered such a skewed user representation when performing social media mining in the context of COVID-19. The question of how reposting behaviors should be managed is yet another challenge because there is a multitude of methods to account for re-posting, which may alter the analytical results. Despite the fact that many studies have designated re-posts as an agreement to the original post, Metaxas et al. (2014) found that, on many occasions, this assumption may actually not be the case.

Data sharing
Data sharing in social media, especially during the COVID-19 pandemic, has become a crucial driving force in motivating social media studies to address COVID-19 challenges. Properly shared social media data archives support validity by advancing reproducibility, replicability, and comparability, addressing the 'digital divides' in data accessibility and saving efforts in the data collection processes (Weller and Kinder-Kurlanda, 2016). However, existing efforts often fail to be grounded in the general principles that underlie institutionalized data archiving and sharing, as the lack of standardized metadata, consistent documentation, and sustainable claim in current COVID-19 social media sharing efforts can be clearly observed. The lack of guidance for social media data sharing, especially during the COVID-19 pandemic, leads to sharing procedures that vary by different social media research communities. Thus, the current practices of social media data sharing need to be coherent and universally agreed upon in order to benefit not only future COVID-19 studies but also investigations on other public health emergencies.

Future directions
Based on the aforementioned challenges, we propose a number of research directions along which future efforts can be made to broaden and deepen the current research paradigm. These future directions are discussed in the context of the quantity and quality of social media data, the techniques used to process social media data, its application across multi-disciplines, and data archiving and sharing.
First, future efforts should be made to have a better understanding of the nature of social media data and to improve the quantity and quality of social media data. In particular, efforts towards enriching social media data with the demographic attributes of social media users under the protection of data privacy are much needed. Social media users' demographic attributes affect their participation in the social network and further influence their behaviors (e.g., mental health status) (Sinnenberg et al., 2017). Obtaining users' demographic attributes can be difficult because they cannot be directly collected from social media platforms. However, such demographic information, including age, gender, socioeconomic status, religion, and personality type, can be extrapolated from a user's tweets via machine learning, with an accuracy ranging from 60% to 90% (Bi et al., 2013;Burger et al., 2011;Ikeda et al., 2013;Pennacchiotti and Popescu, 2011;Rao et al., 2010). This is an underutilized resource for studies using social media data and can be applied towards understanding the subjects of investigation and reducing sampling biases. More specifically, studies based on individual tweets rather than individual users face the issue of skewed data problems, given that one user may post multiple tweets in a certain period of time. It can be addressed by the user indexed analytics, which is based on users' ID (e.g., as individuals or organizations). Such demographic information would also enable us to calibrate and justify the representativeness and reliability of social media data by cross-data validation based on other data sources (e.g., survey or census data).
Second, future work could explore new approaches and techniques in data retrieval, processing, and analytics to provide potential solutions to conquer the constraints inherent in social media data-based studies. For data retrieval, using Twitter APIs has been the most common approach to retrieve tweets that target certain topics. In early 2021, a new academic-oriented API was released by Twitter, which grants free access to full-archive search for researchers to obtain more precise, complete, and unbiased data, greatly benefitting future Twitter-based analytics thanks to its increased data representativeness (Twitte, 2021). Facebook posts can be retrieved via CrowdTangle API, a public tool owned and operated by Facebook (CrowdTangle, 2016). However, the representativeness of retrieved Facebook posts deserves further investigation . Posts in the discussion forums, such as Reddit and Quora, are valuable sources to gauge public attention. Additional cautions are needed when retrieving topic-relevant questions and answers. In addition, further efforts are encouraged to conduct cross-comparison on analytical results from different social media platforms, given that social media platforms can have user bases that vary in population spectrum. For data processing and analytics, technical solutions lie in the rapid development of computational skills and platforms (e.g., artificial intelligence, digital twins, and crowdsourcing) as they are more effective and efficient ways to quantify human behaviors (e.g., fuzzy logic lexical metrics and multilingual sentiment analysis). Comparison studies across different methods for data pre-processing are also needed. Taking Twitter as an example, analytical results might be different when using tweets V.S. retweets, tweets with or without URLs, tweets including emoticons or not, and tweets generated by robots or not. We also need to establish standards for social media data reporting and a generic metadata architecture to better compare the scalability, replication, and reliability of social media data-based studies.
Third, we call for investigations on social media bi-directional communications (e.g., organization-to-individual, individual-to-individual, and individual-to-organization) before, during, and after the COVID-19 pandemic as well as during other disruptive events. We also call for broader potential and opportunities for using social media data in multidisciplinary studies across social, geographic, environmental, and computational sciences to better understand the impact of COVID-19 on human-environment interaction. In addition to the popular domains mentioned in Section 2, social media data can be applied to a wider network of fields under the context of the COVID-19 pandemic, such as commercial industry, transportation, security and information management, and social phycology. For example, social media data has great potential for understanding and addressing COVID-19 related cyber-bullying (Das et al., 2020), detecting suicide (Morese et al., 2022) or mental disorders within a certain population (Sher, 2020), and evaluating the recovery of restaurants (Laguna et al., 2020) and hospitality industry (Park et al., 2020). Social media data provides a unique opportunity to support governments and public/private sectors by monitoring the recovery of human society in the later stages of the pandemic and preparing for future public health emergencies and crises.
Finally, there is a need for universal guidelines that address the ethics of social media research, with a focus on maintaining the privacy and anonymity of social media users. Sharing social media data, though they are largely claimed as 'anonymized', via public repositories and platforms should be supported by discussions of obtaining consent and/ or ethical approval for research purposes . This is particularly important for datasets containing information of users' profiles due to the fact that such datasets have the risk of being identifiable via cross-referencing data attributes (Sinnenberg et al., 2017). Under the protection of data sharing regulations and the spirit of reproductivity, we should endeavor to facilitate the sharing of the processed social media data via public repositories and platforms and establish reproducible workflow that can be employed by end-users without a coding background.

Conclusion
Social media have been widely used as platforms and virtual communities where users worldwide can create and share information. The vast sensing network constituted by millions of active users and billions of posts allow information on certain topics to be mined and analyzed in a rapid manner. During the COVID-19 pandemic, we have witnessed extensive social media data mining efforts with diversified data mining techniques. These efforts address COVID-19 challenges from various perspectives, including early warning and detection, human mobility monitoring, communication and information conveying, gauging public attitudes and emotions, monitoring infodemic and misinformation, and mitigating hatred and violence. We also notice that an increasing number of COVID-19 related social media datasets have been made publicly available to benefit research communities by promoting replicability and reproducibility. Despite the remaining challenges (e.g., biased population spectrum and difficulty in multilingual investigations), we believe the future is bright for social media analytics to address future public health emergencies.