Using geospatial social media data for infectious disease studies: a systematic review

ABSTRACT Geospatial social media (GSM) data has been increasingly used in public health due to its rich, timely, and accessible spatial information, particularly in infectious disease research. This review synthesized 86 research articles that use GSM data in infectious diseases published between December 2013 and March 2022. These articles cover 12 infectious disease types ranging from respiratory infectious diseases to sexually transmitted diseases with spatial levels varying from the neighborhood, county, state, and country. We categorized these studies into three major infectious disease research domains: surveillance, explanation, and prediction. With the assistance of advanced computing, statistical and spatial methods, GSM data has been widely and deeply applied to these domains, particularly in surveillance and explanation domains. We further identified four knowledge gaps in terms of contextual information use, application scopes, spatiotemporal dimension, and data limitations and proposed innovation opportunities for future research. Our findings will contribute to a better understanding of using GSM data in infectious diseases studies and provide insights into strategies for using GSM data more effectively in future research.


Introduction
As a significant threat to public health, infectious diseases are distinguished from many other types of diseases by characteristics such as unpredictability, transmissibility, and preventability.Global pandemics and local epidemics in history could even influence the course of a war, determine the fate of nations, and shape the progress of civilization (Fauci and Morens 2012).Infectious disease research has undoubtedly been a priority for the scientific community.
Traditionally, survey questionnaires, census data, and medical records have been widely applied to diverse aspects of infectious disease research, such as monitoring the prevalence and incidence (Khanal, Adhikari, and Karkee 2013) and forecasting the epidemic trends (Inampudi et al. 2020).However, the utilization of these traditional data continues to face challenges, including the long update periods of the dataset, data scope restricted to a certain geographic area, and relatively small sample size (Kitchin 2013;Jing et al. 2021;Sha et al. 2021).As a result, new data sources and methodologies have been introduced to address these issues and inform policy decision-making.
Social media platforms such as Twitter (Li, Erfani, et al. 2021;Li, Huang, et al. 2021), Facebook (Ascani et al. 2021), and Instagram (Puspitasari, Ariful, and Nuqoba 2021) have been recognized as unique and powerful data sources for studying infectious diseases over the last decade due to their accessibility, timeliness, and richness.Social media are internet-based applications that enable communication and resource sharing, where users post and share their opinions, experiences, and emotions, including texts, images, and videos (Kaplan and Haenlein 2010).In some cases, it is referred to as crowdsourced data (Chunara, Smolinski, and Brownstein 2013).Among these various types of social media data, the Geospatial Social Media data (GSM) data (i.e.social media data with geolocation/spatial information) provide rich information about geolocation and mobility in addition to individual behaviors and user characteristics (Sun et al. 2019;Li, Wachowicz, and Fan 2021).Such timely spatial data enable researchers to rapidly obtain a wealth of useful information for a variety of infectious disease studies that inherently include a spatial component, including surveillance (Chang et al. 2021), prediction (Fakhry, Asfoura, and Kassam 2020), response (Chang et al. 2021).
Despite several studies reviewing the application of social media data on public health (Tang et al. 2018;Edo-Osagie et al. 2020), no systematic review of the use of GSM data in infectious diseases exists to the best of our knowledge.To bridge the gap, this article aims to examine the use of GSM data on infectious diseases in terms of data type, research topics, research methods, and spatial levels.Specifically, this review expands the previous literature reviews in three ways.First, it includes empirical studies in common infectious diseases with a broader coverage from respiratory infectious diseases to sexually transmitted diseases.Second, it employs a spatial perspective (i.e.utilizing spatial data and spatial analytical techniques to disentangle the mechanisms that drive the spatial and temporal transmission of infectious diseases) in extracting and synthesizing information from the included papers, allowing for a comprehensive examination of how spatial information can contribute to a deeper understanding of outbreaks and transmission of infectious diseases and their related issues (including individual attitudes, government policies, and vaccine efficacy in response to public health emergencies caused by infectious diseases).Third, it identifies new opportunities in infectious disease research from a spatial perspective.

Eligibility criteria
This review follows the guidelines of the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISRMA) (Moher et al. 2009).Only original peer-reviewed journal articles reporting empirical studies on GSM and infectious diseases were included.To be eligible for inclusion criteria, a paper should (1) be an original study and peer-reviewed; (2) report empirical studies (i.e.not literature reviews or methodology papers); (3) be published in a journal or full-text conference proceeding (i.e.not book reviews, book chapters, or others); (4) be published in the English language prior to March 1, 2022; and (5) research analysis is based on user-generated data on the social media platforms rather than survey data collected via social media platforms.
The keywords used in the searches included terms for social media, health issues, and spatial dimensions.Our keywords for social media included mainstream social media platforms that are widely used in public health research, including Twitter, Facebook, Weibo, and Instagram, as well as the term 'social media' to cover other social media platforms.The rule of the keyword list for infectious diseases was as follows.First, since there is no official list of infectious diseases, we referred to the list at the university of UTAH (UniversityOfUTAH 2022), which includes 27 different types of infectious diseases.Second, we referred to other review articles on social media and infectious diseases (Tang et al. 2018).Third, based on the prior two steps, the authors discussed and identified the main infectious disease keywords and used 'infectious disease' to include other infectious diseases, totaling 37 keyword terms.To our knowledge, this is one of the most extensive keyword lists in the reviews of infectious diseases and social media to date.Lastly, five spatially related keywords were determined through discussion among the authors and referenced from relevant geospatial social media review literature (Barros, Gutiérrez, and García-Palomares 2022).See section 2.2 for a complete list of keywords.

Data sources and search strategies
Our review topics involve computer science, social sciences, and life sciences, which led to the selection of two databases.First, Web of Science is one of the most well-known databases for indexing academic literature in various fields worldwide, and it is widely used in the field of academic reviews (Li, Dong, and Liu 2020;Mollalo et al. 2021), so it was chosen as the first database.Second, PubMed is a well-known database of biomedical and life sciences literature, and it is frequently used in public health academic reviews (Edo-Osagie et al. 2020;Mollalo et al. 2021), hence it was chosen as the second database to avoid missing relevant articles.Third, we looked at other databases and found that the search results were generally consistent with the first two databases.As a result, the Web of Science and PubMed search results met our review database criteria.
We first conducted a literature search on Web of Science based on combinations of keywords, with a restricted deadline of March 1, 2022.In Web of Science, we used the 'topic' search, which searches for titles, abstracts, keywords, and keywords plus.Other search options defaulted.We then searched PubMed using the 'Title/Abstract' option.
The rule along with the selected keywords are shown as follows, taking the Web of Science rule as an example.The initial search yielded 685 articles from Web of Science and 322 articles from PubMed, with 712 non-duplicate articles in total.We then screened the abstracts of these articles using the criteria listed above.For those articles that could not be judged as eligible based on the abstract information, the full texts of these articles were examined to determine eligibility.These articles were discussed by the authors to determine whether they should be included in the review.The exclusion criteria were as follows: (1) exclusion of non-research papers, such as literature reviews; (2) exclusion of non-infectious disease research papers; (3) exclusion of methodology papers for data mining; (4) exclusion of research papers that mentioned 'spatial' in the article but did not use geospatial social data; (5) exclusion of research papers that mentioned social media platforms but did not actually use social media data.Following the screening, 86 articles were finally included in this review (Figure 1).

Publication characteristics
The number of empirical studies using GSM data in infectious diseases has increased rapidly in recent years.As illustrated in Figure 2, the earliest article was published in December 2013, and the latest in February 2022 (as of March 1, 2022).Before 2020, fewer than 10 relevant papers were published each year; however, the number has increased significantly after the COVID-19 outbreak, and most of them are COVID-19-related studies (n = 50).These papers are widely dispersed across various subjects, with the majority in journals related to multidisciplinary sciences, medicine, environment and geography, and computer science.Thus, GSM data is expected to remain a focus in geography, public health, and computer science in the coming years.In addition, most studies were undertaken in the United States, Europe, and East Asia, with relatively few studies conducted in other regions of the world.

Research characteristics
Figure 3 summarizes the characteristics of included papers in terms of information type, infectious disease type, and application domains.The information extracted from GSM data in these papers can be classified into three categories: contextual information, movement information, and spatialsocial network information.Specifically, contextual information includes content-based information such as texts, images, and videos posted on social media platforms; movement information refers to information about human mobility that is reflected in GSM data; and network information includes spatial connectivity or social connectivity extracted from GSM data (e.g.social connectedness between people from Facebook (Holtz et al. 2020) and place connectivity between places from Twitter (Li, Huang, Ye, et al. 2021)).The three types of information were used in these studies for 12 various infectious diseases.As illustrated in Figure 4, the majority of studies focused on COVID-19 (n = 50), followed by influenza (n = 12), HIV (n = 9), and other STIs (e.g.Syphilis and Gonorrhea) and other emerging infectious diseases (e.g.Zika, MERS, and Ebola).The distribution of papers on infectious diseases is generally determined by the scale and severity of the epidemic.The number of publications may also be affected by the epidemic cycle.For example, although the mortality of Ebola is high, the number of publications on Ebola is not very high since this epidemic was short-lived.More scientific evidence is required to explain this correlation.
From the perspective of public health implication, these studies are grouped into three independent domains (Figure 3), including surveillance (n = 36), explanation (n = 35), and prediction (n = 21).Studies belonging to each of these domains were subsequently reviewed in-depth, including themes, data, methods, and spatial levels.
Several pieces of scientific evidence support our classification.Surveillance is the monitoring of multiple statuses and is a popular domain in public health, with many review studies focusing on the surveillance of infectious diseases (Lee et al. 2016) and public health (Sinnenberg et al. 2017;Jordan et al. 2018;Edo-Osagie et al. 2020).In the current review, surveillance includes publications that use GSM data to monitor the spatiotemporal distribution patterns of infectious diseases, evaluate the effectiveness of various government practices to combat infectious diseases, and assess public attitudes and reactions to infectious diseases.
In terms of the explanation, several review papers on geography (e.g.Li, Dong, and Liu 2020) and social determinants of infectious diseases (Bishwajit, Ide, and Ghosh 2014;Duarte et al. 2021) and health (Palmer et al. 2019) have involved in mechanism/correlation explanations.Explanation studies in the current review are those that investigate the factors that correlate the prevalence and death of infectious diseases, and the factors that correlate the response to infectious diseases.
The prediction domain has also been highlighted in several health-related studies (Lee et al. 2016;Edo-Osagie et al. 2020).For instance, Edo-Osagie et al. ( 2020) defined a domain of forecasting, which includes articles on predicting trends in health-related events.Predictions of infectious disease outbreaks/deaths, as well as predictions on topics related to infectious disease response, are included in the prediction domain of this review.

Surveillance
Infectious disease surveillance is essential for identifying public health threats and developing prevention and control strategies.Key aspects of surveillance are the assessment of spatiotemporal patterns of specific disease transmission and the assessment of responses to public health emergencies caused by infectious diseases, such as individual attitudes, government policies, and vaccine efficacy.In this section, we summarized the efforts in assessing spatiotemporal patterns of and responses to infectious diseases using GSM data in Table 1.
Ebola transmission and people's reactions to Ebola in China (Liu et al. 2016) and across the world (Tran and Lee 2016) were also studied using contextual information.Tran and Lee (2016) analyzed spatiotemporal and social characteristics of Ebola-related tweets (e.g. the distance between Ebola-related tweets; the similarities between the two events; the role of social ties in spreading Ebola-related tweets).Liu et al. (2016) collected two keyword indices from Chinese social media platforms, Ebola-related Baidu index (BDI, and Baidu a search company like Google) and Sina Micro Index (SMI, and Sina Micro is a platform like Twitter), and evaluated the public perception of Ebola in China.Their results showed that Ebola-related BDI and SMI trends were highly correlated, but these two indices were not significantly correlated with Ebola disease-related indicators (e.g.case rates and death rates).Other studies assessed the transmission of Dengue fever (Nsoesie et al. 2016;Ye et al. 2016), HIV (Cai et al. 2020;van Heerden and Young 2020), and Zika virus (Pruss et al. 2019).Among them, Nsoesie et al. (2016) used machine learning methods to investigate spatiotemporal trends of dengue fever events on Twitter.Cai et al. (2020) collected geolocated tweets related to the HIV outbreak in Indiana to analyze geospatial characteristics of the epidemic.Van Heerden and Young (2020) used a combination of data from Twitter, Instagram, and YouTube to map and examine variations in posts related to HIV across regions of South Africa.Pruss et al. (2019) examined temporal and spatial variation in the tweets related to Zika virus.
Human mobility data derived from social media platforms also contribute to assessing the spread of infectious diseases, including the transmission of COVID-19 (Kraemer et al. 2018;Chang et al. 2021;Cowley et al. 2021;Shepherd et al. 2021) and the effects of governmental lockdown in response to the pandemic (Huang et al. 2020;Gibbs et al. 2021;Gulnerman 2021;Iranmanesh and Alpar Atun 2021).For example, leveraging Facebook data, genomics, and mobile phone data, Cowley et al. (2021) analyzed the pandemic trajectory and the emergence of variants.Using geolocated tweets from Lahore, Pakistan, Kraemer et al. (2018) explored spatiotemporal variation in dengue transmission and found that the variation was related to patterns of intra-urban human mobility.
Only one study in this domain used social network data (Holtz et al. 2020).Holtz et al. (2020) analyzed the effects of policies in one area on human movement and social distancing in other areas, and the effects of adopting disjointed local policies in the context of such spillover effects, utilizing mobile phone data from tens of millions of devices and social network connections from millions of Facebook users.The result revealed that policies in one area could affect residents' contact patterns in other areas.
Using multiple methods, researchers extracted various types of information from social media data to assess the patterns of the epidemics and their related issues over time, at various spatial levels, such as neighborhood (Legeby et al. 2022), county (Guntuku et al. 2021), state (Hu et al. 2021) and national levels (Tran and Lee 2016).These methods include emotion analysis (Hu et al. 2021), content analysis (Nsoesie et al. 2016), topic modeling (Tran and Lee 2016), mobility and network analysis such as social network analysis (Hung et al. 2020), and spatial analysis such as geospatial clustering analysis (Cuomo et al. 2021) and kernel density analysis (Peng et al. 2020).These studies provide essential scientific support for the government's prompt response to infectious disease outbreaks.

Explanation
To predict and prevent the spread of infectious diseases, it is necessary to understand the spatiotemporal mechanisms of disease transmission and to examine the correlations between diseaserelated social media posts and infection cases.In this section, we reviewed studies that explored the correlations between infectious disease-related social media posts and infectious disease outbreaks/deaths, and the associations between socioeconomic and environmental factors and diseases (Table 2).
Many studies used contextual information to investigate the relationships between diseaserelated posts on social media and disease cases/deaths.The words and behaviors of users on social media platforms may be related to individual or regional physical/mental illness and attitude.For infectious disease, the number of social media posts related to COVID-19 cases are associated with actual COVID-19 cases (Shen et al. 2020;Cuomo et al. 2021); tweets related to COVID-19 deaths are associated with actual COVID-19 deaths (Dahal et al. 2021;Turiel, Fernandez-Reyes, and Aste 2021); COVID-19 vaccine tweets are linked to actual COVID-19 vaccination rate (Chan, Jamieson, and Albarracin 2020).Research on other infectious diseases focused on the associations between HIV cases/new diagnoses and social media posts related to HIV (Young, Rivers, and Lewis 2014;Ireland et al. 2016;Nielsen et al. 2017;Li, Qiao, et al. 2021), dengue fever (Puspitasari, Ariful, and Nuqoba 2021), syphilis (Young et al. 2018), and influenza (Broniatowski, Paul, and Dredze 2013;Allen et al. 2016;Huang et al. 2019).These studies identified a strong correlation between disease-related social media posts and actual outbreaks of these diseases by analyzing data at different spatial scales (i.e.county and state).The social media data used were dominated by Twitter and Chinese Sina Weibo.
Using contextual information, researchers also investigated the socioeconomic and demographic determinants of COVID-19-related issues, such as public concerns about the pandemic (Scotti et al. 2020;Su et al. 2021), pandemic severity (Li et al. 2020) 2020) investigated whether socioeconomic and epidemiological variables influenced feelings on Twitter during the pandemic and revealed that areas with a higher mortality rate and a higher poor level are associated with more negative emotions.In China, researchers investigated the impact of urban environment on pandemic severity in Wuhan, using Weibobased COVID-19 case data as the outcome variable (Li et al. 2020).
Aside from COVID-19, socioeconomic and demographic factors correlating the prevalence of other infectious diseases were investigated using contextual information.For example, scholars explored spatiotemporal trends in dengue events reported on Twitter compared to confirmed cases and analyzed the associations with sociodemographic factors at the municipality level (Nsoesie et al. 2016).Using influenza-related tweets, researchers found more tweets about the influenza vaccine among women than men (Huang et al. 2019).Using Zika-related tweets in the travel context, researchers compared the differences in the network, demographic, and language characteristics between users who modified their behavior and the control group (Daughton and Paul 2019).HIV has been a hotspot topic in research on sex transmission diseases.Twitter activity among young men was found to be associated with overall HIV prevalence (Stevens et al. 2020).In another study (Chan et al. 2021), the number of men who have sex with men (MSM) appeared to be linked to the HIV-related tweets in a region, the in-person communications about preexposure prophylaxis (PrEP), and the ultimately actual PrEP use.
Indicators derived from the movement information were also used as social and environmental factors to assess their impacts on infectious diseases.For example, a range of mobility pattern indicators derived from Twitter, Facebook, and Google Trends have been developed, which serve as proxies for human movement that drives the transmission of infectious diseases.Using Twitterbased population mobility data, the significant link between population movement and COVID-19 outbreaks is illustrated (Zeng et al. 2021).Ascani et al. (2021) found a positive relationship between mobility within labor market areas and COVID-19 excess mortality in Italy using Facebook data as a proxy of individual mobility (Ascani et al. 2021).Using daily Community Mobility Reports data from Google as a proxy for social distancing and geotagged messages from Twitter to capture beliefs at the state level, it was observed that an increase in Twitter index of social distancing on one day was associated with a decrease in mobility on the day after (Porcher and Renault 2021).
Network information derived from social media data was used to explain infectious diseases and related topics.Social connectivity index derived from Facebook (Fritz and Kauermann 2022), place connectivity index derived from Twitter (Li, Huang, Ye, et al. 2021) and Facebook (Fan et al. 2021), have been used as environmental factors that influence disease transmission.For example, using data from Facebook activities, researchers explored the impact of human movement and social connectivity on the new COVID-19 infections in Germany (Fritz and Kauermann 2022).Also using Facebook data, Fan et al. ( 2021) examined the impact of population co-location reduction on COVID-19 cross-county transmission risk using cross-county human co-location data (Fan et al. 2021).Using the place connectivity index derived from billions of geotagged tweets, a significant impact of place connectivity on COVID-19 case infections is revealed (Li, Huang, Ye, et al. 2021).
Studies involving explanations have employed spatial and non-spatial statistical methods such as geographically weighted regression (Forati and Ghose 2021), negative binomial model (Stevens et al. 2020), and OLS regression (Turiel, Fernandez-Reyes, and Aste 2021), correlation analysis (Dahal et al. 2021), as well as machine learning such as random forest (RF) (Li, Yang, et al. 2021).The majority of studies were conducted at the county level, with a few at the state and other geographic levels.These studies have investigated some key research questions, such as the relationship between users' social media posts and actual relevant cases based on different spatial scales and the environmental and socioeconomic determinants of infectious diseases.These studies have deepened our understanding of infectious disease transmission processes and their risk factors, as well as the connection between social media use and public health problems.

Prediction
Infectious disease prediction is a crucial part of infectious disease prevention.Depending on the prediction results, countermeasures can be taken in advance for prevention.Scholars have used GSM data to forecast the infectious disease outbreak and the related issues caused by epidemics/ pandemics (such as the pressure on the medical system and the consequences of travel restrictions), with various spatiotemporal resolutions (Table 3).
Using contextual information, predictions of COVID-19 outbreaks (Fakhry, Asfoura, and Kassam 2020) and related health system overloads (Rivieccio et al. 2021) are hot research topics.Following state-level Twitter-based COVID-19 information, Fakhry, Asfoura, and Kassam (2020) predicted current and future COVID-19 cases based on a dual machine learning approach.To identify predictors of possible new health system overloads, Rivieccio et al. (2021) analyzed data from Twitter and emergency services.GSM data were also widely utilized to predict influenzalike illness (ILI) (Nagar et al. 2014;Wang et al. 2016;Elkin, Topal, and Bebek 2017;Kandula, Hsu, and Shaman 2017;Lu et al. 2018;Wakamiya, Kawai, and Aramaki 2018;Chen et al. 2019;Wang et al. 2020).For instance, using Twitter data and the partial differential equation (PDE) model, Wang et al. (2020) predicted the regional influenza outbreak.Using Google search trends, Kandula, Hsu, and Shaman (2017) predicted the subregional nowcasts of seasonal influenza.Using H7N9-related Baidu Search Index (BSI) and Weibo Posting Index (WPI) data, Chen et al. (2019) predicted H7N9 cases in China based on seasonal autoregressive integrated moving average models.Scholars have also used sex transmission disease (STD)-related tweets to predict the spread of STDs like HIV, gonorrhea, and chlamydia diagnoses (Chan et al. 2018;Adnan et al. 2020).Adnan et al. (2020) evaluated the temporal predictive strength for Campylobacter from five data sources including consumer helpline, general practice consultations, Google Trends, tweets, and school absenteeism.Their results showed that models using tweets and Google Trends can provide better prediction performance in early outbreaks compared to conventional data sources.
Using movement information, Twitter-based population mobility data were used to predict Dengue (Ramadona et al. 2019) and COVID-19 cases (Bisanzio et al. 2020;Cowley et al. 2021;Zeng et al. 2021;Lucas, Vahedi, and Karimzadeh 2022).For example, Ramadona et al. (2019) combined human mobility derived from neighborhood-level geotagged tweets and incidence data to predict intra-urban dengue transmission in Indonesia.Zeng et al. (2021) forecasted daily new cases of COVID-19 at state and county levels in South Carolina, US, using a Twitter-derived mobility index.Similarly, Bisanzio et al. (2020) predicted the global spread of COVID-19 during the early outbreak period using Twitter activity as a proxy indicator of human mobility.Regarding network information, the outbreak of dengue activity was predicted by integrating satellite imagery, weather data, clinical data, Twitter-based connectivity, and census data (Castro et al. 2021).
Several studies have integrated human movement and network information (Chang et al. 2021;Vahedi, Karimzadeh, and Zoraghein 2021;Lucas, Vahedi, and Karimzadeh 2022).Utilizing connectivity and human movement information derived from Facebook and cellphone data, as well as infection rates and socioeconomic compositions of counties as predictive features, Vahedi, Karimzadeh, and Zoraghein (2021) predicted the spatial and temporal patterns of COVID-19 cases in the contiguous US.Lucas, Vahedi, and Karimzadeh (2022) used weekly new cases as temporal features and Facebook movement and connectedness as spatial features to predict the county-level spread of COVID-19 cases in the US.Using Facebook movement data and colocation data, Chang et al. (2021) assessed the potential impact of COVID-19 intracity/intercity travel restrictions at multiple geographic levels in Taiwan and found intercity travel restrictions could reduce the outbreak's scope.
GSM data could aid in achieving acceptable infectious disease predictions and enhance our knowledge of the disease transmission direction and risk.Several studies using these alternative data sources outperformed traditional data in terms of prediction performance.For example, Ramadona et al. (2019) found that a mobility-weighted incidence index derived from Twitter outperformed conventional mobility and neighborhood centrality in predicting dengue risk; Vahedi, Karimzadeh, and Zoraghein (2021) showed that the model using social connectedness index outperformed than the base model without that social media-based index, with a 6.46% improvement in the mean absolute errors (MAE) over the two-week prediction.Other studies also reported improved accuracy, including COVID-19 prediction (Lucas, Vahedi, and Karimzadeh 2022), Influenza prediction (Elkin, Topal, and Bebek 2017;Lu et al. 2018), and Campylobacter prediction (Adnan et al. 2020).
Validation is an essential component of prediction.Many studies validated their model performance by comparing predicted outcomes with actual disease cases (Wang et al. 2016;Kandula, Hsu, and Shaman 2017;Chan et al. 2018;Lu et al. 2018;Wakamiya, Kawai, and Aramaki 2018;Adnan et al. 2020;Vahedi, Karimzadeh, and Zoraghein 2021), using a range of validation metrics such as Pearson correlation coefficient (COR), root mean square error (RMSE), mean absolute error (MAE), and mean absolute proportion error (MAPE).After evaluating model performance using out-of-sample data from four influenza seasons between 2012 and 2016, Lu et al. (2018) validated their model performances using real data from 2016 to 2017.The COR between the out-of-sample influenza activity estimates and officially reported influenza activity was 0.98.Kandula, Hsu, and Shaman (2017) validated their model estimates using state-level ILI counts from 2005 to 2010 season provided by CDC, achieving good predictive performance with a COR of 0.84, RMSE of 1.01, and MAPE of 0.83.

Discussion
Massive social media data provide unprecedented opportunities for health research to gain insights that were previously unavailable or hard to obtain from traditional data sources.This review systematically examines the use of geospatial social media data (GSM) data in infectious disease research.We found that the number of relevant publications has increased over time, covering a wide range of topics involving 12 infectious diseases.We classified these studies into three domains including surveillance, explanation, and prediction of infectious diseases, and thoroughly reviewed all articles in each domain.We observed that GSM data has been widely and deeply applied to these domains, particularly in surveillance and explanation using various statistical and spatial methods at diverse geographic levels.Following these results, this section discusses the knowledge gaps in this area in terms of data extraction, research topic, spatial analysis scope, and data quality, and proposes innovation opportunities for future research.

Data extraction: underuse of social media contextual information
While many studies used social media data with contents and geographic location, none of the articles in our review have addressed the demographic information of social media users.Users' demographic information is essential for determining the impact of individual-level factors on infectious disease cases and deaths, but it is not directly available from the data.One approach is to use machine learning techniques to extract such information.Studies have demonstrated that demographic information such as age, sex, income, education level, marriage status, and religious status can be inferred from users' posts using machine learning algorithms (Preoţiuc-Pietro et al. 2015;Poulston, Stevenson, and Bontcheva 2016).With demographic information, a range of new research can be conducted, such as analyzing the racial and socioeconomic disparities in cases/deaths of diseases and conducting more precise disease prediction.
Image and video information are other underutilized contextual information.Some scholars collected images from social media such as Twitter (Chen and Dredze 2018) and Instagram (Seltzer et al. 2017) to investigate public health issues, however, our review found no papers that used images from mainstream social media platforms to study infectious diseases.Aside from texts, images and videos from social media platforms can be utilized to extract socioeconomic and environmental determinants of epidemics.With image analysis, for example, elements of built environment can be measured via deep learning algorithms such as Fully Convolutional Networks (FCN) and Convolutional Neural Networks (CNN) (Gebru et al. 2017;Helbich et al. 2019), and perceptions of built environment can be evaluated.
According to our review, the underutilization of contextual information is also manifested by the fact that the indicators derived from contextual information may not be accurate.In some studies, for example, the keywords of the HIV risk behavior index do not include slang terms for depicting sex-related and drug-related behaviors (Li, Qiao, et al. 2021).Second, in most cases, emoji and function words in tweets are unprocessed (Chen et al. 2021).Due to the word limit of posts on Twitter and Sina Weibo, abbreviated words and acronyms appear frequently (Huang et al. 2022).The majority of these words from social media are unprocessed.Third, although English is commonly used on Twitter, not all conversations about infectious diseases use English.A 2018 report showed that the top three languages in monthly tweets are English (31.8%),Japanese (18.8%), and Spanish (8.46%) (Vicinitas 2022).Most studies excluded non-English tweets, restricting the ability to catch the Internet discourse of people who speak other languages (Stevens et al. 2020).For a few studies dealing with multilingualism, the corresponding tweets were analyzed using naïve translation methods (Zhang et al. 2021).Sentiment and emotion differ across languages.Direct translation of tweets may overlook such differences, and current sentiment and emotion analysis techniques do not adequately address this issue (Huang et al. 2022).In light of this, more work could be focused on improving text classification models (e.g. using more advanced classifiers), exploring how to extract slang, emoji, abbreviation, and function words, and analyzing multiple languages.
In addition, the sample size and methodological restrictions in sentiment analysis may hinder the extraction of useful contextual information.Many articles have conducted sentiment analysis, but they are incapable of fully capturing nuances between similar tweets (Surano, Porfiri, and Rizzo 2022).Furthermore, many common action words (e.g.'work' or 'exercise') tend to evoke mixed feelings (Ireland et al. 2016), but existing emotion analysis cannot extract these feelings accurately.More precise sentiment analysis can help assess a wider spectrum of sentiments and provide a clearer picture of how the general public perceives infectious diseases.Future research should expand the sample size to reduce sentiment uncertainty and explore more diverse sentiment dimensions (Hu et al. 2021) and more precise topic modeling methods using advanced natural language processing technologies such as deep learning or artificial intelligence.

Research topic: inadequate breadth and depth
From COVID-19 for respiratory diseases to gonorrhea for sexually transmitted diseases, infectious diseases encompass a wide range of illnesses.While the reviewed publications cover 12 infectious diseases (Figure 4), some infectious diseases with significant public health implications were overlooked.For example, with the prevalence of COVID-19, many disciplines have been dedicated to studying COVID-19 (n = 50), whereas other infectious diseases with high mortality rates, such as Tuberculosis and Malaria, and that affect small populations or areas, such as Hepatitis B virus (HBV), were less frequently linked to social media data.
In the following, we discussed the current application limitations and future directions of each of the three domains.For the surveillance research, a major limitation is the use of a single social media data source such as Facebook or Twitter, which represents only a portion of the population and may not reflect the spatiotemporal patterns of the general population.People's reactions and attitudes to the same epidemic may differ depending on their social backgrounds (Fan et al. 2021).To reduce bias and improve the accuracy of the assessment, more diverse datasets such as Electronic Health Record (EHR) data, mobile phone data, or survey data need to be used.
For the explanation research, some critical topics were not discussed in the selected papers, such as the relationship between Internet language use, individual behavior, and infectious diseases (e.g. the relationship between language use of condoms, protective sex behavior, and HIV-positive rates).The relationship between connectivity and COVID-19 cases was presented in some articles (Li, Huang, Ye, et al. 2021), but it remains to be examined the differences in connectivity within or between states in relation to COVID-19 cases, and how their relationships vary by socioeconomic characteristics (e.g.concentrated disadvantage and urbanicity).Future studies could further explore the potential mechanisms on these topics, for example, exploring various mediation and moderation effects of social media-based environmental factors on infectious disease outbreaks.Moreover, there are several other issues worth investigating.People on Twitter often choose this platform to get in touch with their friends and acquaintances, and cross-regional impact may also be important and worth considering in the future (Chan et al. 2021).Infectious diseases can be linked to other health issues, such as physical and mental health, and further research is required to better understand the association between infectious diseases (and the associated emotions) and mental health.Specifically, potential research topics include the investigation of the influence of COVID-19 on tweets-based mental health, the relationship between the epidemic risk perception and people's mental health, the relationship between risk perception, social support (or other community factors such as social capital and social disorder) and mental health, and the relationships between epidemics, epidemic risk perception, and depression.Lastly, we would like to note that most studies in the explanation domain attempted to identify correlations.Without using individual and longitudinal data, the potential causal relationships cannot be revealed.
Compared with the other two domains, prediction has received the least attention using social media data, with only 21 publications.One limitation of the prediction is the performance of past data cannot guarantee future quality.Future research could use updated data to train the model and incorporate other key factors such as weather patterns and traffic systems.

Spatiotemporal analysis: insufficient attention to the spatiotemporal dimension
The majority of the reviewed literature is at spatial levels such as county (n = 27) and state/provincial level (n = 28), rather than neighborhood/community level (n = 7) and country level (n = 3).The lack of neighborhood level studies is likely due to the scarcity of data at such levels.A recent study by (Li, Huang, Ye, et al. 2021) reveals that a majority of the worldwide geotagged tweets (79%) are geotagged at the city level, and tweets geotagged at the neighborhood level or lower only account for 7.9%.Since the neighborhood-level analysis could help reduce bias caused by the aggregation of spatial scales, future work could explore ways to increase sample size using machine/deep learning and natural language processing methods to infer the geolocations from the texts by geoparsing place names (Gelernter and Balaji 2013;Wang and Hu 2019).In addition, future research could also leverage the rich county level data to explore innovative and significant research topics, such as the association between global climate change, political environment, and infectious diseases, since infectious diseases often spread across countries.Regarding the study regions, most of the studies are in North America, Europe, and East Asia.Future research is suggested to pay more attention to Africa and South America because these regions are home to low-income countries, which are more vulnerable to infectious disease outbreaks.
The use of social media platforms is also not uniform across time, and related indicators such as sentiment scores extracted from social media data are thus not uniform across time (Padilla et al. 2018).Ignoring the time factor limits the interpretability of some research findings.For instance, many studies in the explanation research use cross-sectional data, which limits the ability to provide theoretical explanations and infer causal mechanisms.More longitudinal data is necessary to investigate potential causal pathways, such as clarifying the link between the use of risk-behavior words in Twitter and changes in HIV transmission across time.
4.4.Data quality: inherent flaws of social media data 4.4.1.Population bias Social media data is suffered from a number of inherent flaws.First, social media users are not strictly indicative of the general population (with regard to their socioeconomic and demographic status) or people at risk of pathogen exposure (Adnan et al. 2020;Guntuku et al. 2021).For example, the majority of Twitter users are in a specific age range (18-29 and 30-39) and are technologically savvy (Huang and Wong 2016;Jiang, Li, and Ye 2019).Facebook users are also mostly between the ages of 25-34 (Statista 2022a) and users of the Chinese social media platform Sina Weibo are young and well-educated (Statista 2022b).The use of geotagged social media data may exacerbate the population bias.For example, only around 1% ∼ 2% of tweets are geotagged (Hong et al. 2012;Cuomo et al. 2021), and those geotagged tweets tend to be concentrated on certain populations such as younger people and females who like to tweet with geographic location (Feng and Kirkley 2021).In addition, a small group of users may express the majority of posts, as 80% of tweets are posted by the top 10% of most active users (Wojcik and Hughes 2019).With such potentially uneven sampling distribution and nonrepresentative demographic distribution of the population, the relevant indices and findings may be biased.For instance, we cannot collect topic content and sentiment from those who do not use social media applications to express their opinions and comments, so we must be more cautious about generalizing the results.Furthermore, geotagged tweets generally over-represent the population in urban areas than rural areas.Since epidemic response differs between urban and rural areas, many research findings based on social media data may be restricted to urban populations.

Digital divide
Digital divide implies inequality in access to the Internet and Information and Communications Technology (ICT).This disparity grows when regions are considered.According to the Statista report (Statista 2022c), by the end of 2021, only 40% of Africans have internet access, compared to more than 80% of Europeans and people from the U.S. The following two factors may be at the root of this divide: (1) inequalities in access to the Internet and ICT.The cost of Internet access varies greatly among socioeconomic groups and countries; (2) inequalities in people's digital skills.Immigrant populations, for example, face barriers to telemedicine adoption that are not always related to Internet access (Wang, Do, and Wilson 2018), and may be due to a lack of digital skills, which impedes the smooth use of various social media software (Ramsetty and Adams 2020).For infectious disease research, digital divide makes it more difficult to track data on vulnerable groups, and relevant research is limited or ignored.Digital divide also makes it challenging to investigate some of the factors associated with infectious diseases.For example, during the pandemic, the opinions of vulnerable groups in some areas may not be sufficiently expressed on social media, resulting in a low number of relevant posts in that area being associated with high infection rates.

Misinformation
Social media narratives may contain misinformation about disease transmission, influencing the validity of findings.Misinformation has a significant impact on public awareness.For example, social media users can falsify flu-related posts to attract more awareness from authorities and obtain additional assistance such as vaccines and medical supplies (Allen et al. 2016).Google search patterns and social media-based medical visits may reflect media reports and the perception of the situation instead of the actual influence of the epidemic.Most studies did not engage with social media users to confirm whether they were actually taking a specific medicine or confirm their self-reported risk behaviors or health status, so social media data may not accurately reflect the dynamics of infectious diseases (Li, Qiao, et al. 2021).In addition, in the event of a large-scale outbreak, the response of the local media may differ from that of areas with a small number of diseases.Thus, media attention is likely to significantly influence predictions (Adnan et al. 2020).The censorship system may also have an impact on the research findings.Some studies suggest that COVID-19 cases reported by the Chinese Center for Disease Control and Prevention may be underestimated due to limited testing capacity, the presence of asymptomatic carriers, and Internet censorship (Shen et al. 2020).Also, a small proportion of geotagged tweets reported fake geolocation information (Xu, Dredze, and Broniatowski 2020).To obtain more authentic data, fact-checkers, social media companies, news media, professional organizations, and authorities need to coordinate efforts to control the spread of misinformation about infectious diseases, so that scientific information can be released and public awareness of the pandemic can be improved (Forati and Ghose 2021).

Limited data sharing
Sharing social media research datasets enables reproducibility, replicability, and comparability (Kinder-Kurlanda et al. 2017), and helps provide a direct data source for scientific studies, but most studies in our review did not share datasets due to legal restrictions, data sharing policies, format, storage issues, and benefit conflicts, even though most of their data originate from openaccessed social media platforms.From the data provider perspective, data sharing policies of social media platforms are constantly changing.For example, access of large-scale geospatial data from Chinese Sina Weibo is becoming increasingly difficult.With the acquisition of Twitter by Elon Musk, the future of Twitter's data sharing policies is unclear.
To overcome these limitations, it is critical to explore, assess and integrate new types of data sources in infectious disease studies.The advancement of information and communication technologies have enabled a plethora of new types of data, including EHR data, mobile phone data, street view data, credit card data, and traffic data (Jing et al. 2022).For instance, mobile phone data can be used to measure population movement more accurately by covering larger populations with finer geographic resolutions.In addition, major social media platforms are undergoing rapid evolution in recent years.For example, young people have shifted their preference from traditional platforms such as Facebook and Twitter to newer apps like Snapchat and TikTok (Mittmann et al. 2022).The incorporation of these new data sources has the potential to reduce the data bias as well as broaden the scope of infectious disease research.

Limitations of the current study
The current review has several limitations in terms of article selection.First, the inclusion/exclusion criteria would influence the result of article selection, leading to selection bias in the articles included for the review.Second, we reviewed only two main databases, leaving out other related databases such as Scopus, PsycInfo, and IEEE Xplore.Third, while we compiled an extensive keyword list for the major infectious diseases, some keywords may not be included for uncommon infectious diseases.Third, we did not review non-English publications which may have resulted in the omission of important papers published in other languages, such as Chinese, German, and Japanese.Lastly, publications in this field increased rapidly during the COVID-19 outbreak, and our review did not include relevant papers published after the time of our data collection.

Conclusions
Social media data has become increasingly important in the field of health and healthcare in recent years.This article conducted a systematic review of the use of GSM data in infectious diseases.We began with providing an overview of current publications and research characteristics, and then synthesized the use of GSM data in three domains: surveillance, explanation, and prediction.We further discussed the research gaps and proposed new research opportunities regarding information extraction, research topic, spatial analysis, and data sources.With the increasing availability of social media data, as well as the advancement of machine learning and artificial intelligence, future research can expand current applications to advance our understanding of infectious diseases and human health.

Figure 1 .
Figure 1.Flow diagram for article identification and selection.

Figure 2 .
Figure 2. Number of publications on the use of GSM data on infectious diseases (As of March 1, 2022).

Figure 3 .
Figure 3. Summary of the characteristics of the included papers.

Figure 4 .
Figure 4. Number of publications by infectious diseases.

Table 1 .
Related articles involving the surveillance.

Table 2 .
Related articles involving the explanation.

Table 2 .
Continued.status at the county level and found that COVID-19 concerns varied by socioeconomic status regardless of the hotspot and non-hotspot areas in the United States.Similarly,Scotti et al. (

Table 3 .
Related articles involving the prediction.