Revealing the linguistic and geographical disparities of public awareness to Covid-19 outbreak through social media

ABSTRACT The Covid-19 has presented an unprecedented challenge to public health worldwide. However, residents in different countries showed diverse levels of Covid-19 awareness during the outbreak and suffered from uneven health impacts. This study analyzed the global Twitter data from January 1st to June 30th, 2020, to answer two research questions. What are the linguistic and geographical disparities of public awareness in the Covid-19 outbreak period reflected on social media? Does significant association exist between the changing Covid-19 awareness and the pandemic outbreak? We established a Twitter data mining framework calculating the Ratio index to quantify and track awareness. The lag correlations between awareness and health impacts were examined at global and country levels. Results show that users presenting the highest Covid-19 awareness were mainly those tweeting in the official languages of India and Bangladesh. Asian countries showed more disparities in awareness than European countries, and awareness in Eastern Europe was higher than in central Europe. Finally, the Ratio index had high correlations with global mortality rate, global case fatality ratio, and country-level mortality rate, with 21-31, 35-42, and 13–18 leading days, respectively. This study yields timely insights into social media use in understanding human behaviors for public health research.


Introduction
The outbreak of the novel coronavirus, known as Covid-19, has profoundly impacted human society. In 2020, Covid-19 had infected more than 83.48 million people and caused nearly 1.82 million deaths in 191 countries and regions (Dong, Du, and Gardner 2020). Since the Covid-19 outbreak, governments worldwide have implemented several measures requiring or suggesting residents to wear masks, keep social distancing, or stay at home to control the spread of the coronavirus. However, residents in different countries showed diverse levels of awareness of Covid-19 and relevant policies during the outbreak, and suffered from uneven health impacts, including unequal infection, fatality, and recovery rates (Gollust et al. 2020;McCaffery et al. 2020;Hu et al. 2020;Saad, Hassan, and Zaffar 2020). Whether the changes and disparities in public awareness led to different responding behaviors and thus affected the pandemic's health impacts is unknown and needs to be investigated.
However, continuous long-term and near-real-time data describing disparities in public awareness and responses cannot be obtained through traditional survey methods, especially during the pandemic. With the development of Web 2.0 and GNSS-enabled portable devices, social media platforms, e.g. Twitter, Facebook, and Instagram have become increasingly popular worldwide for sharing feelings and discussing 'what's happening'. Data collected from such platforms provide an emerging channel to observe real-time human responses to different topics and events (Zou et al. 2018). During Covid-19, many people's social lives have shifted from in-person to online to stay connected while maintaining social distancing, and they spent more time sharing their experiences, concerns, and feelings toward topics relevant to Covid-19 on social media (Alqurashi, Alhindi, and Alanazi 2020;Chen, Lerman, and Ferrara 2020;Lopez, Vasu, and Gallemore 2020). As a result, the extensive social media data generated during the pandemic offer an innovative opportunity to observe the public reactions to Covid-19 in near-real-time.
Nevertheless, studying human behaviors at different locations during the pandemic and drawing scientific conclusions from social media data are challenging. First, social media data contain a sheer amount of noisy information irrelevant to . It is difficult to accurately collect pandemic-related messages from the big social media database. Second, the age distribution of social media users is biased towards younger people, which is different from the actual population. User behaviors reflected on social media may not represent behaviors of the general population (Sloan et al. 2015). Third, many social media platforms support multiple languages, and identifying and analyzing coronavirus-related messages of different languages necessitates advanced natural language processing (NLP) models (Lopez, Vasu, and Gallemore 2020). Finally, social media users are unevenly distributed across space, and only a small proportion of the generated data (around 1-2%) contains precise locations (Graham, Hale, and Gaffney 2014). Social media data need to be associated with geographic contexts and normalized through preprocessing to enable spatial and temporal analytics.
This study analyzed the global Twitter data, referred to as tweets, from January 1st to June 30 th , 2020, when Covid-19 developed from a regional epidemic disease to a pandemic causing a global health crisis. The overarching research questions are: What are the linguistic and geographical disparities in public awareness of the Covid-19 outbreak on social media? Is the changing public awareness on social media significantly associated with the pandemic outbreak? To address the research questions, three objectives are proposed and achieved: (1) to establish a social media data mining framework tracking the public awareness of Covid-19 by languages and regions; (2) to quantify disparities of awareness toward the pandemic at multiple spatial and temporal scales; and (3) to examine the lag correlations between Covid-19 awareness and health impacts globally and regionally. One hypothesis is tested: social media-derived public awareness changes are significantly associated with the pandemic outbreak at both global and regional scales. The results can inform governments to mitigate risks from current and future epidemics through social media data analysis.

Social media data mining for Public Health
Social media data contain abundant attributes, e.g. time, geographical information, contents, and relationships among users, providing a brand-new perspective for understanding human behaviors in spatial, temporal, contextual, and network dimensions. Since the emergence of social media, researcher have applied data collected from such platforms to address multiple healthrelated issues, e.g. obesity, depression, and insomnia (Choudhury et al. 2013;Gore, Diallo, and Padilla 2015;McIver et al. 2015;Sun et al. 2018;Gao et al. 2020;Li, Chaudhary, and Zhang 2020). Sun et al. (2018) collected the obesity data from the Gallup Healthways Wellbeing Survey and the U.S. Centers for Disease Control and Prevention (CDC) and 41 million tweets from 110 major cities in the USA during 2012-2013. They proposed an obesity estimation method via monitoring users' dietary habits, physical activities, emotions, and self-consciousness on Twitter, demonstrating that user activities on online social networks could help evaluate the obesity rate in urban areas. McIver and others (2015) identified a list of insomniacs based on their Twitter messages and investigated their Twitter use behaviors. The results show that insomniacs had fewer followers, expressed lower sentiments, and were less active on social networks compared with other users. Social media data mining also has a great potential in addressing mental health issues. A study finds that integrating social media users' activities, e.g. social engagement, linguistic styles, networks, and emotions, could characterize and forecast individual-level depression (Choudhury et al. 2013).
In addition to the above case studies focusing on specific health diseases, Paul and Dredze (2014) built a topic model called Ailment Topic Aspect Model (ATAM) to automatically discover healthrelevant topics on Twitter without human supervision or a priori knowledge. Culotta (2014) created 160 indexes based on Twitter data and performed regression analysis to predict the county-level statistics of 27 types of health conditions (e.g. obesity, teen births, and diabetes) in 100 counties in the United States. The results show that the predictive models incorporating Twitter-derived variables have higher accuracy for surveying county-level health conditions compared to models based on traditional questionnaires.

Social media and Covid-19
Since the beginning of 2020, researchers from different fields have made enormous efforts to collect and mine social media data during Covid-19 to understand human behaviors during the pandemic and combat it. For instance, several studies collected pandemic-related social media data and shared them through open-source archives (Alqurashi, Alhindi, and Alanazi 2020;Chen, Lerman, and Ferrara 2020;Lopez, Vasu, and Gallemore 2020). Rufai and Bunce (2020) pointed out that Twitter empowered world leaders to exchange Covid-19 information with citizens rapidly. La et al. (2020) analyzed the official Covid-19 news from online newspapers and concluded that immediate policies and supportive public responses are the key epidemic control measures. Sentiment analysis and epidemic-related topics surveillance were essential for understanding social activities during the pandemic Lwin et al. 2020;Nemes and Kiss 2020;Zhu et al. 2020). Leveraging social media data, worldwide trends of four emotions, fear, anger, sadness, and joy, were examined in Lwin et al. (2020). Their results showed that public feelings shifted strongly from fear to anger from January 28 th to April 9 th , 2020. Predicting the regional epidemic outbreak from social media activities had also been proved feasible in several investigations (Jahanbin and Rahmanian 2020;Qin et al. 2020). Specifically,  found that the Covid-19 discussion peak on social media occurred 10-14 days earlier than the peak of daily incidences in China. Online social networks were effective platforms for disseminating rumors or conspiracy theories (Allington et al. 2020;Gruzd and Mai 2020;Tasnim, Hossain, and Mazumder 2020). Shahsavari et al. (2020) investigated the narrative frameworks of fabricating and broadcasting Covid-19 conspiracies to monitor those messages near real-time.
Although the existing studies generate innovative methods and valuable knowledge in using social media for Covid-19 research, some limitations exist. For example, Covid-19 related social media data analysis across regions is needed to address the inequalities of public awareness and health impacts. Examining the mid-term and long-term Covid-19 social media activities is also critical because Covid-19 is a long-standing epidemic, and short-term analysis is unable to capture how the public perception of Covid-19 changes temporally.

Twitter data
This research collected social media data from Twitter, one of the most popular social media and network platforms with about 340 million users in more than 150 countries. Approximately 500 million tweets were published per day in various languages in 2020, enabling the tracking of global and regional awareness of Covid-19. Since the pandemic outbreak, many Covid-19 Twitter databases became available. For example, Covid-19 tweets dataset from IEEE DataPort (https://ieee-dataport.org/open-access/coronavirus-covid-19-tweets-dataset) provides about 2 billion Covid-19 tweets in English along with their sentiment scores from January 27 th , 2020 to February 23 rd , 2021 (Lamsal 2021 We collected Twitter data from January 1st to June 30 th , 2020 from Internet Archive (https:// archive.org/), an online free data library that stores around 1 percent of the whole Twitter database, including tweets relevant and irrelevant to Covid-19. Tweets are encoded in JavaScript Object Notation (JSON) format and represented as a list of name-value pairs. Figure 1 shows the Twitter data collection and preprocessing workflow. Five attributes were used in the subsequent analysis, create-d_at, text, lang, source, and verified representing each tweet's timestamp, content, language, source such as 'Twitter for iPhone', and user's verified status, respectively.
The first step of Twitter data preprocessing is data cleaning. We utilized the 'source' attribute in Twitter data to remove tweets posted by bots. Specifically, Li et al. (2021) created a list of sources (e.g. sent from iPhone) to help identify human-generated tweets. We checked the 'source' value of each tweet and excluded tweets whose sources were not listed in Li et al. (2021). Tweets from organizational accounts are also irrelevant to public awareness of Covid-19 and should be removed, which can be accomplished by examining the 'verified' attribute in Twitter data. 'Verified' Twitter users are accounts operated by non-general-public, e.g. governments, news organizations, companies, brands, non-profit organizations, sports and gaming leagues, individuals in those organizations and leagues, activities and organizers, content creators, and other influential individuals. Tweets sent from verified users were removed in the data cleaning step. The second step is identifying Covid-19-related tweets. We selected seven keywords, i.e. covid, virus, ncov, n95, pandemic, pneumonia, and quarantine, based on an overview of previous literature to search for the Covid-19 relevant tweets (Alqurashi, Alhindi, and Alanazi 2020;Chen, Lerman, and Ferrara 2020;Qin et al. 2020). Considering that Twitter supports multiple languages, we translated the seven keywords into different languages by the Google Translate Application Programming Interface (API) to collect the global Covid-19 discussion on Twitter. A total of 65 distinct languages (including English) were detected from the collected Twitter data, and the Google Translate API supports 61 of them. We manually validated the Google API-translated keywords by searching news using each translated keyword and examining if the returned results were relevant to Covid-19. The manual verification confirmed that using selected keywords and their translations could accurately identify Covid-19-related messages on social media. Any tweet containing at least one of the seven keywords in one of the 61 languages was identified as Covid-19-related. Reposting or quoting Covid-19-related messages on Twitter is a convenient, rapid way to disseminate timely information and express awareness during the pandemic. Therefore, reposted tweets (retweets) and quoted tweets containing Covid-19 keywords were included in Covid-19-related tweets for the subsequent analysis.
The third step is to determine the location where each tweet was sent from. There are three common metadata sources to geo-reference tweets -geotagged location (tweet from), user profile address (user from), and content mentioned place (tweet about). However, each source has its limitations for this study. First, less than 2% of the Twitter data contain the geotagged locations (Zou et al. 2018), and the proportion of geotagged Covid-19 tweets could be even less, which makes over 98% of tweets unusable. Due to the low use of precision geotagging and the increasing concern for users' privacy, Twitter has gradually removed the function of attaching point coordinates to tweets since June 2019. Second, a high percentage of tweets (40% to 60%) can be linked to county-level locations using user profile addresses. However, it requires the use of geocoding services (e.g. Google or OpenStreetMap Geocoding API) for toponym resolution, which is a time-consuming and computationally intensive task when processing a large amount of data and introduces geocoding uncertainty (Zou et al. 2019;Wang et al. 2021). Further, deriving locations from places mentioned in tweet contents is unreliable because places mentioned in tweet contents do not necessarily reflect the actual locations of Twitter users. Therefore, this study leveraged an alternative metadata source to associate each tweet with a country or a region: tweet language, which Twitter automatically detected and provided in the collected tweets. Users in each country are more likely to post messages on Twitter using their country's official/primary language (Mocanu et al. 2013;Zola, Cortez, and Carpita 2019). Thus, matching the tweet language and the official/primary language of each country could indicate Twitter users' locations and cultural backgrounds. For instance, a tweet written in Thai is considered sending from Thailand since it is the only country that uses Thai as its official language. Among the 61 selected languages, 29 can be paired to a single country ( Figure 2) based on the language-country matching list in Table A1. This language-country matching method has been applied in a previous investigation with an overall accuracy of 53% (Zubiaga et al. 2017). We validated the languagecountry matching approach by calculating the accuracy of language-detected countries in geotagged tweets. The results are summarized in section 4.1.
Finally, we computed the Ratio index using Equation 1 by languages and at multiple spatial-temporal scales to compare the levels of public awareness to Covid-19 in different linguistic cultures, geographical regions, and pandemic phases (Zou et al. 2018 (Dong, Du, and Gardner 2020). The population data were collected from the United Nations' 2020 mid-year population estimation (http://data.un. org/). The values of case fatality ratio, case rate, and mortality rate range from 0-100%, 0-10 7 , and 0-10 7 respectively, and the large value of each index means a more severe health impact by Covid-19. These three indicators can be computed or aggregated by different areas (e.g. global, continental, and country) within a defined time period (e.g. daily, monthly, and yearly

Methods of analysis
The public awareness was aggregated and evaluated by different languages and at multiple spatial-temporal scales to reveal its linguistic-geographical disparities and examine its relationship with the Covid-19's health impact indicators. First, we calculated the daily Ratio index of tweets in all languages globally to indicate the general trend of public awareness toward Covid-19 on Twitter. The global seven-day moving average case rate, mortality rate, and daily case fatality ratio were calculated to represent the general trend of health impacts caused by Covid-19 while removing the weekly pattern caused by the weekday/weekend reporting difference. To examine if significant associations and lag effects exist between public awareness and the Covid-19 outbreak worldwide, three types of lag correlations, i.e. Pearson, Spearman, and Kendall rank, between global Ratio index and the three Covid-19 health impact indicators were calculated using Equations 5-9.
In Equation 5-9, X t denotes the t days earlier time series of health impact indicators, and Y represents the time sequences of the Ratio index. Cov(X t , Y) is the covariance of X t and Y, calculated according to Equation 6. Var[X t ] and Var[Y] are the variances of X t and Y, and their product is calculated utilizing Equation 7. In Equation 8, R(X t ), R(Y) are the ranks of X t , Y. In Equation 9, for two sets of {X ti } i = 1 n and {Y i } i = 1 n , if a pair of (X ti , Y i ) and (X tj , Y j ) satisfies X ti , X tj and Y i , Y j , or X ti . X tj and Y i . Y j , the pair is considered concordant. Otherwise, the pair is discordant.
Second, the half-year and the monthly Ratio indexes by language were calculated to show the general differences and temporal variations of Covid-19 awareness among Twitter users with diverse cultural backgrounds. Third, we focused on countries having unique official/primary languages and generated maps of half-year and monthly Ratio indexes by country to reveal the spatial-temporal disparities of Covid-19 awareness. Finally, the lag correlations between the Ratio index and Covid-19 health impact indicators were tested at the country level to examine the associations between public awareness changes and the regional Covid-19 outbreak.

Results
This study collected 464,167,385 cleaned tweets out of 488,486,181 original tweets from the Internet Archive database. A total of 9,339,176 (2.01%) out of cleaned tweets were identified as Covid-19related (Table 1). Among all cleaned tweets, 185,377 (0.04%) have geotags, and the amount of We compared the agreement of geotag-derived and tweet language inferred country information in the collected tweets to validate the language-country matching method. The overall accuracy is 77%, indicating that this method is reliable to geo-locate tweets for country-level analysis. Figure 3 shows the global temporal patterns of the daily Ratio index and three health impact indicators (i.e. seven-day moving average case rate and mortality rate, and daily case fatality ratio) from January 1st to June 30 th , 2020, which reflect the general trend of worldwide public awareness and health impact of Covid-19 during its outbreak. Using the date when the World Health Organization (WHO) announced Covid-19 a pandemic as the breaking point, we defined January 1st to March 11 th as the pre-pandemic period and March 12 th to June 30 th as the pandemic-outbreak period. The daily Ratio index (black line) ranged from 0.19% to 8.37%, with an average value of 2.60%. The high values were found during March 12 th to 23 rd , and the maximum value was on March 18 th , a week after entering the pandemic-outbreak period. During January 1st to 20 th , very few tweets discussed Covid-19, even though it had been reported to spread in Wuhan, China. Since January 21st, when WHO claimed that the novel coronavirus pneumonia could spread from person to person, the Ratio index followed a fluctuating ascending trend with two peaks in the pre-pandemic period. One peak was on January 25 th , when the Covid-19 risk level was upgraded as 'medium' by WHO. The other was on February 28 th , when WHO raised the Covid-19 risk level to 'very high'. After March 23 rd , the Covid-19 discussion intensity on Twitter dissipated gradually, and the Ratio index lowered to around 2% by the end of June 2020.

Global temporal trends
The global seven-day moving average case rate (blue line) and mortality rate (red line) also changed over the six months, which ranged from 0.28-224.23 and 0.08-9.05 per ten million population respectively. Their average values were 84.02 and 4.09. Both moving average case rate and mortality rate were low in January and February 2020, when Covid-19 cases were reported in only a few countries. Since the beginning of the pandemic-outbreak period, both indicators started to increase rapidly. The global daily case rate reached around 100 per ten million in late March and through Figure 3. Global temporal trends of the Ratio Index and health impact indicators of Covid-19 (seven-day moving average case rate and mortality rate, and daily case fatality ratio).
April. The maximum global daily mortality rate was observed on April 14 th , 2020, one month after the Twitter-derived public awareness reached the highest. The daily case rate increased gradually and peaked at the end of June, while the daily mortality rate declined progressively in May and stayed at around 6 per ten million in June 2020.
The global case fatality ratio (green line) showed a relatively smoother trend compared to daily case and mortality rate, ranging from 2.04% to 7.22%, with an average value of 4.95%. In late January, the case fatality ratio displayed a downward trend and reached the lowest value on February 5th. Then the case fatality ratio showed an upward trend and reached the maximum value on April 29 th , 2020, around 37-48 days after the public awareness reached the highest on Twitter. Finally, the case fatality ratio gradually decreased to 4.87% in May and June. Figure 4 describes the three types of lag correlations (Pearson, Spearman, and Kendall rank) between the global daily Ratio index and the seven-day moving average case rate (a) and mortality rate (b), and case fatality ratio (c). We set the maximum number of leading days to 60. The largest Pearson's correlation coefficients between the Ratio and the Covid-19 mortality rate and case fatality ratio were 0.83 and 0.86, and the leading days were 25 and 39 days, respectively. The largest Spearman's correlations were 0.88 and 0.97, while the largest Kendall's correlations were 0.71 and 0.86, and the leading days were 23 and 35 days, respectively. No turning point (the point where the lag correlation coefficient starts to decline) was detected in the lag correlation between Ratio and case rate, and the maximum correlations were 0.53 (Pearson), 0.69 (Spearman), and 0.42 (Kendall).
Three phenomena were observed from the global temporal trends. Initially, public awareness of Covid-19 reflected on Twitter significantly correlated with the daily mortality rate and case fatality ratio. The result is consistent with another analysis exploring the possibility of forecasting the Covid-19 outbreak in China from social media data, which found significant correlations between Sina Weibo (a Chinese social media platform for microblogging) index and confirmed/suspected Covid-19 cases . It confirms that social media activities are highly correlated with the global outbreak and progression of infectious diseases like Covid-19. We also found that the global public awareness peak on social media was 21-31 days (correlation > 0.70) and 35-42 days (correlation > 0.85) earlier than the maximum daily mortality rate and case fatality ratio, respectively. The lagged days between global public awareness changes and Covid-19 health impacts are longer than the lag effect estimations in other research. For example, Chunara, Andrews, and Brownstein (2012) concluded that trends in the volume of informal sources related to Haitian Cholera, including the news media reports and Twitter postings, were two weeks earlier than official case data. The difference comes from three reasons. First, Covid-19 is a more destructive epidemic that affects the entire human being and lasts a more extended period than other infectious diseases. Second, the indicators selected to represent Covid-19 health impacts are inconsistent among various investigations. The previous research chose infections to describe the health impact. The detected lag effects were on mortality and case-fatality ratio, whose trends are usually delayed compared with the infection trend. Third, the keywords used for tweet collection in this study are different from other studies. Culotta (2010) and Iso, Wakamiya, and Aramaki (2016) have proved that choosing different keywords to collect disease-related information on social media could result in uneven leading effects.
Furthermore, Twitter users' awareness in the pre-pandemic period was mainly media-driven. The discussion intensity peaks corresponded with WHO announcements, meaning people tweeted about Covid-19 when official agencies reported it on social media platforms. It demonstrates that providing official updates and reports is an effective approach to enhance public awareness in the pre-pandemic period. Finally, although Covid-19 caused proliferating morbidity and mortality worldwide after becoming a pandemic and mass media and agencies continued to report this event, such social attention and health impacts were not entirely reflected in public discussions of Twitter users globally. The global Ratio index decreased gradually after March 18 th , with no significant discussion peaks by the end of June 2020.  Appendix Table A2. The most frequent tweeting language was English, followed by Japanese, Spanish, Portuguese, and Thai, consisting of 33.59%, 18.22%, 8.53%, 7.72%, and 6.30% tweets. Other common languages consisting of over 2% of global tweets were Arabic, Korean, Indonesian, Turkish, and French.

Global linguistic disparities
The Ratio index by languages ranged from 0.25% to 11.01%, with an average value of 2.50% and a standard deviation of 2.07%. The highest Ratio index was in Gujarati language tweets. Users tweeting in Bengali, Marathi, Punjabi, and Amharic languages also showed high public awareness with Ratio index values of 8.61%, 7.84%, 6.73%, and 5.15%, respectively. The lowest Ratio index values were found in users tweeting in Icelandic, Haitian Creole, Welsh, Hungarian, and Khmer, and their values were 0.25%. 0.33%, 0.38%, 0.46%, and 0.50%, respectively.
The top four languages presenting the highest public awareness toward Covid-19 on Twitter are Gujarati, Bengali, Marathi, and Punjabi, the official languages in India. Bengali is also the most widely spoken language in Bangladesh. Both India and Bangladesh Governments practiced a nationwide lockdown at the end of March, 2020, and extended the lockdown several times in April and May, resulting in a two-month lockdown. Meanwhile, the new confirmed cases in Bangladesh grew by 1,155% in early April, the highest in Asia. Such societal and health impacts caused by the pandemic aroused public concerns and awareness about Covid-19 in India and Bangladesh, leading to more intense discussions on social media platforms like Twitter. Figure 6 shows monthly variations of the Ratio index by the 20 most common languages on Twitter, indicating the temporal linguistic disparities of public awareness toward Covid-19. In each subplot, the bold green/blue line represents the monthly Ratio index in each language, while the monthly Ratio indexes in the other 19 languages are gray lines in the background. Numbers on the left top are the highest Ratio value in each language. The maximum monthly Ratio index was found in Hindi in April, and the value was 8.87%, followed by Thai with a value of 8.67% in March. Tweets in Filipino and Korean displayed the lowest maximum monthly Ratio index values of 1.19% and 2.07%.
Most users grouped by tweeting languages showed the highest monthly Covid-19 awareness in March and April, except for users tweeting in Chinese which expressed the highest awareness in February 2020 with a value of 7.77%. This phenomenon matched the expectation, as the Covid-19 outbreak was reported first in China in January 2020. The Chinese government quickly responded and issued a series of policies and strategies to prevent the spread of the virus in January and February, such as Wuhan lockdown, isolation measures, and the construction of Fire God Mountain hospital. Although Twitter is not the most popular microblogging platform in China, many Twitter users are Chinese speakers residing in different regions of the world and post messages in Chinese. Our analysis results indicate that the Covid-19 outbreak in China raised the public awareness of Covid-19 among Chinese Twitter users worldwide. As Covid-19 became a pandemic since February 2020, the awareness among users speaking different languages began to increase. It is worth mentioning that users tweeting in Thai also expressed a high level of Covid-19 awareness in the pre-pandemic period. The following reasons could explain this observation. Thailand was the most popular traveling destination for residents of Wuhan. Thus, people living in Thailand were concerned that the frequent commute between Thailand and Wuhan might spread the virus from the reported outbreak center to Thailand. Meanwhile, Thailand confirmed the first Covid-19 case on January 13 th , 2020, making it the first country other than China that had Covid-19 infected patients. Since then, the number of confirmed cases in Thailand grew to 25 on February 4 th , more than any country other than China. Further, the government in Thailand was one of the very few governments that issued an early warning of Covid-19 in January 2020. On January 28 th , their health minister claimed that the spread of the coronavirus could not be stopped. Both the health impacts caused by the pandemic and the government's immediate responses stimulated residents to discuss it on social media.

Country-level spatial-temporal patterns
The 29 countries with unique official/primary languages were included in the country-level spatialtemporal analysis. Figure 7 shows the spatial patterns of the Ratio index at the country level from January 1st to June 30 th , 2020. The value range was from 0.25% to 4.40% and the average was 1.66%. Figure 7. Global spatial patterns of the Ratio index at the country level from January 1st to June 30th, 2020.
In general, Nepal and Thailand from southeastern Asia had the highest Twitter-derived public awareness of Covid-19. The public awareness on Twitter in the eastern part of Europe was also higher than the awareness in other countries. Iceland showed the lowest Ratio index. The Ratio index in the selected Asian countries ranged from 0.50% to 4.40% with a standard deviation of 1.19%, showing a larger disparity than the country-level Ratio index values in Europe (range: 0.25∼3.70%, standard deviation: 0.98%).
Countries in Europe showed significant geographical discrepancies in Covid-19 awareness. Public awareness in the eastern part of Europe, including Russia, Armenia, Ukraine, Romania, Bulgaria, and Serbia, were generally higher than the awareness of countries in central Europe, including Norway, Denmark, Sweden, Estonia, Latvia, Lithuania, Poland, Czech Republic, Hungary, and Slovenia. The high public awareness on Twitter in the eastern part could be caused by two reasons. First, those countries were particularly vulnerable to the pandemic because their economies depended heavily on Western Europe, which was severely affected by disruptions in international production and transportation due to Covid-19 (European University Institute 2022). Second, Covid-19 caused more severe losses of working hours in the eastern part of Europe than in central Europe in 2020, based on the 'Covid-19 and the world of work' report (International Labour Organization 2020). Therefore, residents in the eastern part of Europe were more concerned about Covid-19 and posted more discussions on social media platforms.
The monthly country-level Ratio index reveals detailed information about how public awareness toward Covid-19 on Twitter changed through different phases of the pandemic (Figure 8). The monthly Ratio index ranged from 0% to 13.85%. Similar to the observed global and by language Figure 8. Spatial-temporal patterns of the monthly Ratio index at the country-level.
patterns, Twitter users in most countries showed the highest awareness in March and April 2020, and expressed low awareness in the pre-pandemic period and in May and June 2020.
As mentioned in section 4.2, public awareness in Thailand was significantly higher than in other countries in January 2020. Lithuania, Thailand, Bulgaria, Nepal, and Romania showed higher values of monthly Ratio index in February than other countries, with values of 5.25%, 4.49%, 2.47%, 2.40%, and 2.34%, respectively. The highest monthly awareness on Twitter was in Lithuania in March. Although Lithuania had only three confirmed cases by March 12 th , the Lithuanian government canceled all public indoor events of more than 100 attendees and closed all educational institutions (i.e. kindergartens, public schools, and universities) and entertainment places (i.e. museums, cinemas, and gyms) since that day. The shutdown had a massive impact on Lithuanians and intensified Covid-19-related discussion on Twitter. Israel presented the fastest growth rate of public awareness from 0.22% in January to 4.41% in April during the pandemic outbreak. The other countries with rapid awareness growth rates were Iran, Serbia, Nepal, Sweden, and Hungary. In May and June 2020, only Nepal maintained a relatively high Covid-19 awareness. Table 2 summarizes the maximum lag Pearson, Spearman, and Kendall rank correlation coefficients (max r) and corresponding leading days between the daily Ratio index and three health impact indicators (seven-day moving average case rate and daily mortality rate, and daily case fatality ratio) at 29 selected countries within 60 days' lag effect. The daily mortality rate and case fatality ratio in Vietnam, Cambodia, and Laos were close to zero. Therefore, the lag correlations in the three countries could not be calculated and were labeled as 'null.' The 'nd' in Table 2 indicates that a correlation turning point was not detected.

Covid-19 public awareness and health impacts
Significant turning points were detected in 22, 20, and 11 countries regarding the lag Pearson correlations between the daily Ratio index and case rate, mortality rate, and case fatality ratio, respectively. The largest Pearson max r between daily Ratio index and the case rate was found in Thailand (0.84) with 9 leading days. The largest Pearson max r between the daily Ratio index and mortality rate and case fatality ratio were 0.89 in Andorra and 0.72 in Sweden with 16-day and 40-day lag effects. Lag Pearson's correlation coefficients between daily Ratio and case rate, mortality rate, and case fatality ratio were found the least significant in Bulgaria, Myanmar, and Iceland. The largest Spearman max r between daily Ratio index and case rate, mortality rate, and case fatality ratio were 0.83 in Estonia and Andorra, 0.89 in Estonia, and 0.92 in Poland, with 7-day, 21-day, and 52-day lag effects, respectively. The most significant Kendall rank lagged correlations of Ratio and case rate, mortality rate, and case-fatality ratio were observed in Russia (0.81), Estonia (0.74), and Poland (0.76), and lag days were 50, 21, and 51.
The daily Ratio index showed uneven association with case rate, mortality rate, and case fatality ratio ( Table 2). The average max r between Ratio and mortality rate (Pearson: 0.61, Spearman: 0.70, and Kendall rank: 0.54) was higher than the averaged correlation coefficients between Ratio and case rate (Pearson: 0.53, Spearman: 0.54, and Kendall rank: 0.42) and case fatality ratio (Pearson: 0.43, Spearman: 0.61, and Kendall rank: 0.46). It demonstrates that the Ratio index has an overall higher correlation with mortality rate in different countries during the Covid-19 outbreak. The average leading days between Ratio and mortality rate, case rate, and case fatality ratio were detected as 17. 68,23.45,and 37.00 in Pearson correlation,13.30,19.30,and 33.07 in Spearman correlation,and 17.73,22.27,and 29.93 in Kendall rank correlation.
The correlation coefficients between the daily Ratio index and the three health impact indicators varied from country to country. Using Pearson correlation as an example, the daily Ratio index shows the highest correlation with Covid-19 outbreaks in nine countries, including Japan, Thailand, South Korea, Poland, Estonia, Andorra, Sweden, the Czech Republic, and Israel. The Pearson max r between the Ratio index and at least two health impact indicators exceeded 0.6 in these nine countries. On the contrary, none of the Pearson max r between Ratio and three health impact indicators was greater than 0.5 in Iran, Vietnam, Norway, Hungary, Latvia, Iceland, Myanmar, Bulgaria, and Armenia. No turning points in three types of lag correlations were detected in Nepal. The undesirable correlation is due to few Covid-19 cases in these countries during the time period, which makes outbreak detection difficult, and insufficient Twitter data to evaluate the public awareness levels across time.

Significant implications
The completion of this research yields several significant implications. First, this research demonstrates that the language-country matching method could overcome the limitation of insufficient social media data with geotagged locations and efficiently geo-reference big Twitter data for large-scale analysis with an accuracy of 77%. The method works the best for countries with unique official languages, offering a novel and rapid channel observing and comparing social media activities, e.g. public awareness toward different topics or events, among those countries. The generated datasets provide baseline information on Covid-19 awareness globally and by language and country. Second, this study found that the daily changes of public awareness reflected on Twitter had strong correlations with the daily mortality rate and case fatality ratio. This finding is consistent with prior investigations which applied social media to detect and forecast the outbreak of other infectious diseases, such as influenza (Signorini, Segre, and Polgreen 2011;Hassan Zadeh et al. 2019) and Cholera (Chunara, Andrews, and Brownstein 2012). Therefore, monitoring continuous social media activities could provide valuable information for future outbreaks of Covid-19 and its variants, and inform governments and communities to plan accordingly.
Finally, this study further proved that the associations between social media data and infectious disease outbreaks were distinct in different countries. It is in line with the analysis in Allen et al. (2016), which measured the correlations between Twitter rates and the official reports of influenza-like illness (ILI) in 31 major cities in the United States during the 2013-2014 flu season. Our results confirmed that the spatial variability of disease detection performance based on the social media data needs to be considered and mitigated in future work to assist the regional outbreak prediction of infectious diseases and develop responding strategies.

Limitations and future research
A few limitations exist in this investigation and necessitate further research. First, the languagecountry matching method is unable to capture social media activities in countries speaking leading languages of international discourse, such as English and Spanish, or multilingual countries listed in Table A3. This limitation can be resolved by incorporating the locations obtained from the geotags (tweet from), users' profiles (user from), location information in historical tweet contents (tweet about) in future research.
Second, health impact indicators were derived from the cumulative Covid-19 cases and deaths dataset from CSSE. However, these data have inherent limitations due to the under-testing or under-reporting of cases. For example, inadequate diagnostic facilities especially in the initial outbreak stage of the pandemic, and delayed diagnosis could underestimate the case and death numbers. Such uncertainty in original health impact data can subject the association assessment of the Ratio index to unavoidable errors.
Third, many researchers have leveraged multiple big data sources, such as Facebook, Twitter, and Google trends, to track Covid-19 awareness. Data from social media platforms provide more about users' experience and feeling about the epidemic, while specific keywords analysis from search engines make it possible to distinguish what people want to know about Covid-19, such as infection symptoms or prevention methods. More studies assessing and comparing various aspects of pandemic awareness from multi-sourced big data should be considered.
Furthermore, although this study successfully tracked the spatial-temporal disparities of Covid-19 awareness, other non-pharmacological interventions (NPIs), e.g. social distancing policies and human mobility behaviors, were not considered. Governments in different regions released disparate closure policies, which might affect public awareness and movement and shape the spread of Covid-19. Human mobility also directly influences virus diffusion. Future research should include Covid-19 relevant policies and human mobility to further elucidate the interactions of various NPI factors and reveal their influences on Covid-19 health impacts.
Finally, future investigations could categorize Twitter data by different user groups, such as citizens, celebrities, scientists, health agencies, and governments, and by different pandemic phases, e.g. pre-pandemic preparedness, pandemic-outbreak response, and post-pandemic recovery. Examining the public awareness of the pandemic among diverse social groups in each phase could gain valuable information on how different opinion leaders on social media affect the citizens' Covid-19 awareness in different countries and communities and how changing public awareness impacts pandemic recovery.

Conclusion
This study analyzed the global Twitter data from January 1st to June 30 th , 2020, aiming to answer two questions: what are the linguistic and geographical disparities of public awareness to Covid-19 on social media, and does the changing awareness associate with the Covid-19 outbreak? The study established a social media data mining framework, which uses the Twitter-calculated Ratio index to quantify the disparities of pandemic awareness by language at multiple spatial and temporal scales. The lag correlations between the Ratio index and Covid-19 health impact indicators were calculated at the global scale and in selected countries, confirming significant associations and lag effects between public awareness and the Covid-19 outbreak.
There are valuable findings from this research. First, global public awareness of Covid-19 changed over the six months and reached the highest in the middle of March 2020. The major official pandemic-related announcements triggered rising public awareness in the pre-pandemic period. The global public awareness dynamics on Twitter have high correlations with the mortality rate and case fatality ratio by 21-31 and 35-42 days ahead. Second, linguistic and geographical disparities of public awareness existed during the pre-pandemic and pandemic-outbreak periods. Users tweeting in India and Bangladesh's official or regional languages, e.g. Gujarati, Bengali, Marathi, and Punjabi, had higher public awareness toward Covid-19 due to the impacts caused by the nationwide lockdown in both countries. Users tweeting in Chinese were most concerned about the pandemic in February 2020, while the Covid-19 awareness of users tweeting in other languages was the highest in March and April 2020. Asian countries had greater disparities in Covid-19 awareness than European countries, and Eastern Europe's public awareness was generally higher than that of central Europe. Third, the Ratio index had the highest correlation with mortality rate with an average of 13-18 leading days at the country level. The association between the Ratio index and Covid-19 health impacts varies from country to country, suggesting that in-depth research on how discrepancies in public awareness, together with diversified socioeconomic conditions, medical resources, and public attitudes in different countries, affect the Covid-19 morbidity and mortality are needed.
The produced dataset describing public awareness of the Covid-19 will offer valuable insights into relevant studies. Researchers can leverage the public Covid-19 awareness to further investigate the consequences of awareness changes on human behaviors, economy, tourism, and Covid-19 health impacts in different countries. For instance, the public awareness can be incorporated with other NPIs, including Covid-19 policies (e.g. policy scores from the Oxford Covid-19 Government Response Tracker (Hale et al. 2021)) and human mobility measurements (e.g. Safe-Graph Mobility datasets), to explore their interplay. In addition, the developed framework can be used to track long-term Covid-19 awareness by language and country. Advanced time-series statistical models, e.g. vector autoregression (VAR) model, can be applied to reveal the causal relationships between various NPIs (policy, awareness, and behaviors) and their compounding effects on Covid-19 spread.