High variability in model performance of Google relative search volumes in spatially clustered COVID-19 areas of the USA

Objective: Incorporating spatial analyses and online health information queries may be beneficial in understanding the role of Google relative search volume (RSV) data as a secondary public health surveillance tool during pandemics. This study identified coronavirus disease 2019 (COVID-19) clustering and defined the predictability performance of Google RSV models in clustered and non-clustered areas of the USA. Methods: Getis-Ord General and local G statistics were used to identify monthly clustering patterns. Monthly country- and state-level correlations between new daily COVID-19 cases and Google RSVs were assessed using Spearman's rank correlation coefficients and Poisson regression models for January–December 2020. Results: Huge clusters involving multiple states were found, which resulted from various control measures in each state. This demonstrates the importance of state-to-state coordination in implementing control measures to tackle the spread of outbreaks. Variability in Google RSV model performance was found among states and time periods, possibly suggesting the need to use different frameworks for Google RSV data in each state. Moreover, the sign of correlation can be utilized to understand public responses to control and preventive measures, as well as in communicating risk. Conclusion: COVID-19 Google RSV model accuracy in the USA may be influenced by COVID-19 transmission dynamics, policy-driven community awareness and past outbreak experiences.


Introduction
Spatial spread is one of the most important aspects in understanding disease epidemics ( Franch-Pardo et al., 2020 ), including the coronavirus disease 2019  pandemic. During the outbreak, multiple studies have discussed COVID-19 spatial patterns in the USA using both state- ( Cordes and Castro, 2020 ;Maroko et al., 2020 ;Ramírez and Lee, 2020 ) and county-level analyses ( CDC COVID-19 Response Team, 2020 ; Dasgupta et al., 2020 ;Desjardins et al., 2020 ;Mollalo et al., 2020 ;Oster et al., 2020a ,b;Snyder and Parks, 2020 ;Wang et al., 2020 ;Andersen et al., 2021 ). year of COVID-19 spatiotemporal patterns along with temporal predictability performances of Google relative search volume (RSV) models in clustered and non-clustered areas. Google RSVs are emerging digital data that are being used as a secondary public health surveillance tool during the COVID-19 pandemic. These data are collected during information-seeking activities on Google search engines that are normalized during a specified period ( Google, 2020 ). These online search data potentially depict patterns of information-seeking behaviours that represent the public's concerns, awareness or restlessness ( Ayyoubzadeh et al., 2020 ;Husnayain et al., 2020a ). This approach was part of an infodemiological study that examined the determinants and distributions of health information for public health purposes ( Eysenbach, 2006 ). It may capture wider population events than conventional surveillance systems ( Milinovich et al., 2014 ), as people who are ill may not contact local healthcare facilities, but they may still search for online health information.
In the case of COVID-19, various studies in the early phase of the outbreak suggested that Google searches peaked earlier than newly confirmed cases ( Effenber ger et al., 2020 ;Strzelecki, 2020 ) and correlated well with the rise of COVID-19-related data ( Husnayain et al., 2020a ,b;Li et al., 2020 ;Ortiz-Martínez et al., 2020 ). Similar results were also reported by several studies in the USA ( Bento et al., 2020 ;Panuganti et al., 2020 ;Yuan et al., 2020 ). Certain studies also assessed the predictability performance of Google RSVs at national and regional levels, which resulted in high correlations (the highest correlation coefficients were 0.71 and 0.88) ( Kurian et al., 2020 ;Mavragani and Gkillas, 2020 ). Moreover, a high accuracy of Google search models was also found in an earlier state-level analysis ( Cousins et al., 2020 ). However, all of these studies were undertaken in the first 3 months of the outbreak, which potentially resulted in high performance of the models. Thus, an extensive study covering a longer-term assessment of the predictability of the Google RSV model, specifically in clustered areas, is needed urgently. Such a study is necessary to understand the role of Google RSV data as a secondary public health surveillance tool during a pandemic, and to be better prepared for future outbreaks. Therefore, this study aimed to identify COVID-19 hot and cold spots of disease clustering, and define the predictability performance of the Google RSV model in clustered and nonclustered areas of the USA.

Study area and data acquisition
County-level data of cumulative daily COVID-19 cases from 48 states (all US contiguous states except Alaska and Hawaii) and the District of Columbia were collected from Johns Hopkins University's Center for Systems Science and Engineering GIS dashboard ( Dong et al., 2020 ), along with new state-level daily COVID-19 cases from the COVID tracking project ( The Atlantic, 2021 ). Data from 20 January to 31 December 2020 were used. Google RSV data were retrieved from the Google Trends website ( Google Trends, 2020 ) for the USA at country and subregional level for health categories and web search type. Data were queried for COVID-19-related terms, topics and disease; the top related queries; and most-searched COVID-19 terms in 2020 with a lag of 7 days. This dataset gives the number of search activities made through Google search engines. Data were retrieved for the overall time period (on a weekly basis) and in monthly periods (on a daily basis) for the time frame of the entire study. The daily data were adjusted with weekly-based data to obtain adjusted daily data for the overall study period, as used in previous approaches ( Bewerunge, 2018 ;Rengasamy et al., 2019 ). In addition, Google mobility data were used in constructing Google RSV models. These mobility data represent changes in time spent in categorized places. Data were queried with a lag of 7 days from COVID-19 Community Mobility Reports ( Google, 2021 ). The datasets used for this analysis are listed in Table 1 . All datasets were aggregated into monthly subsets to describe epidemic progression patterns over time.

Data analysis
Getis-Ord General G and local G statistics were utilized to identify monthly hot and cold spots for COVID-19 incidence rate clustering patterns. G statistics are a distance-based approach ( Ord and Getis, 1995 ) that estimate a z-score from observed and expected spatial clustering patterns. The general G statistic was calculated as follows: where x i and x j are attribute values for features i and j, w i, j is the spatial weight between features i and j, n is the number of features in the dataset, and ∀ j = i indicates that features i and j cannot be the same feature ( Esri, 2021 ).
A positive z-score indicates spatial clustering in the dataset, whereas negative values represent low clustering patterns. In addition, a z-score close to zero may represent a random spatial pattern in the observation ( Getis and Ord, 1992 ). In this study, the monthly COVID-19 incidence rate was used as an input feature, and spatial relationships between spatial features were determined as contiguity edge corners. Furthermore, an optimized hot spot analysis of local G values was used to identify distributions of monthly COVID-19 hot and cold spots. P < 0.05 was considered to indicate statistical significance. A clustered state was defined as the presence of hot spot counties, cold spot counties or both.
Monthly country-level correlations between new daily COVID-19 cases and Google RSVs were assessed using Spearman's rank correlation coefficients due to the small numbers of observations and non-normal distributions of the response variables. P < 0.05 was considered to indicate statistical significance. A moderate correlation was determined as Spearman's rank correlation coefficient of ≥0.5, with ≥0.7 considered a strong correlation. The term 'COVID testing' (search term) was chosen to assess monthly statelevel correlations. This term was used as it may reflect the important issue of COVID-19 testing during the research period.
Moreover, Google RSV models employing highly correlated search data with a lag time of 7 days were calculated using Poisson regressions in a generalized linear model to predict current statelevel new daily COVID-19 cases. A Poisson regression was used as a response variable for count data that did not follow a normal distribution ( Johnston, 1993 ). Models were constructed using Google RSVs and mobility data (with the highest correlation coefficient with case data). Model performance in the in-sample data was determined by root mean squared error (RMSE) values, Akaike information criterion (AIC) and Bayesian information criterion (BIC) to compare the performance between models. Multi-layer maps were also created to define monthly predictability performances of Google RSV models in clustered and non-clustered areas of the USA for a 1-year analysis. All spatial analyses and visualizations were conducted using ArcGIS Pro Version 2.6.1 (ESRI, Redlands, CA, USA), and statistical analyses were performed using SAS Version 9.4 (SAS Institute, Cary, NC, USA).

COVID-19 spatial clusters in the USA
In the early stage of the disease outbreak, country-level incidence rates were 0.0 02~0.0 05 per 10 0,0 0 0 population, with higher incidence rates in county-level data, which ranged from 0.129 to 3.370 per 10 0,0 0 0 population, as shown in Table 2 . However, starting in March 2020, huge increases in cases raised the country-wide incidence rate to 57.110 per 10 0,0 0 0 population and the countylevel incidence rate to 1011.124 per 10 0,0 0 0 population. This increasing trend led to a massive national incidence rate, reaching more than 10 0 0 cases in November 2020. Furthermore, counties with the highest incidence rates differed from month to month, indicating the rapid spread of the disease throughout the country.
The Getis-Ord General G test ( Table 3 ) showed clustered patterns in all months during the study period, except in January 2020 due to the limited case count and distribution. Local G exhibited the first cluster identified in California in February 2020 ( Figure 1 ). Afterwards, clusters appeared in neighbouring states, including Washington, Idaho and Colorado, as well as a cluster in the eastern part of the country that grew until May 2020. During this period, two clusters were also found in counties in the southern USA that expanded into large clusters from April to August 2020. However, beginning in September 2020, clusters were circulating in counties in the central USA, and then progressed into morenortherly parts of the country. In contrast, cold spots formed constantly in eastern counties from June to December 2020.

Predictability performance of Google RSV models
During the study period, low to high significant correlations between new daily COVID-19 cases and Google RSVs were found in country-level data ( Table 4 ). Strengths of correlations were increased to the highest point in June 2020 and decreased as the outbreak progressed. For the state-level analysis ( Table 5 ), significant correlations began to emerge in March 2020 (38.78%) and this was the highest point. Percentages of significant correlations fluctuated and increased in June 2020 (22.45%) and in November 2020 (26.53%). While the number of states with clustered areas increased, numbers of significant correlations were only found in low percentages, ranging from 4.08% to 24.49%.
Strong significant correlations were found in several states with clustered and non-clustered counties during the research period, including California, Florida, Illinois, New York and Texas in March 2020, and Texas and South Carolina in June 2020 ( Figure 2 ). These findings suggest that strong correlations were rarely found in clustered areas in the USA during the COVID-19 outbreak. Moreover, the strength of the correlations tended to decrease as the outbreak progressed.
In terms of correlation signs (positive or negative), weak negative correlations were found in several clustered areas, as shown in Table 5 . A negative correlation in this study illustrates a declining trend in information searches as the number of cases increased. Furthermore, to understand the pattern of correlations over time and time series of cases, data from three states are presented in Figure 3 as examples. This figure shows time series patterns of new daily COVID-19 cases per 10 0,0 0 0 population in Florida, Illinois and Maryland, along with their monthly correlations with Google search volumes during the study period. Their cluster characteristics as a hot spot, cold spot or non-significant area were determined based on Table 5 . Figure 3 demonstrates that linearity between the strength of the correlation and the increase in cases and cluster characteristics differed between states. Significant correlations only tended to be found in the early stages of the outbreak. This finding suggests diverse performance of Google RSV data among states and outbreak periods.
Furthermore, the performance of the Google RSV models in strongly correlated areas ( Table 6 ) resulted in RMSE values in unclustered areas ranging from 81.94 to 95.87, while in clustered areas (hot spots, cold spots and both), RSME values ranged from 61.92 to 1629.92. These findings suggest that Google RSV models may have performed slightly better in clustered areas, but model performances tended to be unstable, as illustrated by the large RMSE range. In addition, mobility variables, particularly transit stations, workplaces and parks, were identified as important variables in model development. However, huge RMSE values may suggest the absence of other important explanatory variables in the models.

Spatial heterogeneity of COVID-19 cases at state level
As of 27 December 2020, new cases of COVID-19 in the USA accounted for 68% of all new cases in the Americas, placing the USA as the country with the highest number of new cases and deaths ( World Health Organization, 2020 ). The rapid spread of this disease was observed from geographic variations of the most affected counties in Table 2 , which is in line with a previous report ( Oster et al., 2020b ). In addition, COVID-19 spatial clusters in the USA began to emerge in March 2020 ( Figure 1 ) as a national emergency was declared and widespread testing was implemented ( Taylor, 2020 ). However, some clusters continued to expand with the rise of protests, social distancing restrictions, and the re-opening of public facilities in April 2020 ( Hauck et al., 2020 ;Taylor, 2020 ). Conditions worsened with the end of national social distancing guidelines on 30 April 2020, which led to the implementation of re-opening policies in various states in May 2020, but conditions varied between counties and cities ( Hauck et al., 2020 ). As a consequence, multiple new clusters began to arise in June 2020, as the highest numbers of new daily cases occurred in the south, west and midwest regions of the country ( Taylor, 2020 ). The US Government also loosened travel restrictions at the end of June 2020 ( US Department of Defense, 2021 ). During this period, clusters were found in southern and western counties, as reported  previously ( Oster et al., 2020b ). Massive clusters continued to grow in those areas as positive tests increased in older age groups, leading to higher numbers of hospitalizations, severe outcomes and fatal cases ( Oster et al., 2020a ). The high COVID-19 incidence rate continued to cause huge clusters in southern counties, which then circulated into central US counties and progressed into northern parts of the country. In addition, better control measures implemented in eastern counties may have been responsible for cold spots arising in those areas.
Research findings showed that small clusters in one or several neighbouring states in the early stage of the outbreak began to develop into larger clusters, involving multiple states, as the outbreak progressed. These results demonstrate the importance of state-tostate coordination in implementing control measures to tackle the spread of new infectious disease outbreaks. Having various preventive policies in neighbouring areas may have promoted the massive growth of clusters. As control measures at state and local levels directly influence the disease incidence and cluster magnitude ( CDC COVID-19 Response Team, 2020 ; Desjardins et al., 2020 ), coordinated responses are needed urgently. Moreover, this study illustrates that spatial analyses provided clear spatial patterns of disease spread, which could lead to the timely implementation of control measures before high-level community transmission has occurred. Therefore, this type of analysis should be considered as a crucial approach in public health surveillance during outbreak sit-uations to implement focused public health actions. However, spatial clusters may not be induced by the time variable alone, and incorporating other explanatory variables would be beneficial in understanding differences in spatial patterns.

Factors that may affect the predictability performance of Google RSV models
Furthermore, as described in the Results section above, correlations between RSVs and COVID-19 varied in space and time, and the strength of the correlations also tended to decrease as the outbreak progressed. Similar results were found in a previous study, which reported that COVID-19 Google searches did not correspond with actual disease dynamics in 40 European countries ( Szmuda et al., 2020 ). Diverse performances of Google RSV models found in this study suggest that the model performance in predicting new cases can be affected by several aspects, including COVID-19 transmission dynamics, policy-driven community awareness, and past outbreak experiences.
COVID-19 transmission dynamics may affect how the accuracy of the Google RSV model differed month to month as the outbreak progressed. In the early phase, high correlations may have appeared as a result of massive searches from affected communities and groups of people who were concerned about the emerging outbreak. However, with the extensive spread of the dis-  ease, people may have been overwhelmed by the enormous volumes of circulating information, and stopped searching COVID-19related issues. This may have decreased the volume of information searches and the correlation strength, as observed in earlier studies ( Husnayain et al., 2020a ,b). At this point, the Google RSV model should have been built based on specific terms rather than using general keywords. This study showed that the use of general terms of COVID-19 may have been robust only in the first 5 months after the outbreak began (February-June 2020), as shown in Table 4 . Beginning in July 2020, the more specific term of 'covid testing' (search term) had an increasing correlation coefficient. This possibly illustrates that more specific terms, such as vaccines, current control measures and preventive measures, should be used to better represent the public's current concerns, awareness or restlessness. Consequently, routine keyword identification is important to ensure precise analyses when utilizing Google RSV data. The performance of the Google RSV model may also have been affected by policy-driven community awareness. This means that policies implemented in response to COVID-19 may have influenced public awareness towards the growing outbreak. As statelevel policies are primarily affected by governors' decisions, governors' perceptions will contribute directly to the formation of community perceptions and reactions. However, these may also be influenced by the governor's political affiliation, which has been discussed in several previous articles ( Green and Tyson, 2020 ;Jiang et al., 2020 ;Adolph et al., 2021 ). Hence, public perceptions and reactions may have altered COVID-19 online information searches to a certain degree. A previous study showed that COVID-19 queries in the USA increased more slowly than they did in other countries ( Husain et al., 2020 ), which may also describe how the public responded to the degree of the emergency.
Finally, past experience with an outbreak may affect the robustness of the Google RSV model. As COVID-19 was a new outbreak that had global impacts, the public may have responded in diverse manners. Countries which were highly affected by the previous severe acute respiratory syndrome and Middle Eastern respiratory syndrome outbreaks may have exhibited high numbers of searches and strong predictability performance of Google RSV models, particularly China , Taiwan ( Husnayain et al., 2020a ) and South Korea ( Husnayain et al., 2020b ). In brief, as the accuracy of the COVID-19 Google RSV model may be influenced by these three major aspects, the Google RSV model derived from general terms in the USA was only valid for use in the first 5 months of the outbreak. More specific keywords should be used in later stages of the outbreak. Moreover, because of the limited strong correlations found in clustered areas, the Google RSV model in the USA may be better utilized for designing risk communication rather than for predictive purposes. The sign (positive or negative) of correlations can be utilized to understand public responses to control and preventive measures, as well as for communicating risk. Negative correlations could be used as an alert, indicating the need for intensive risk communication and a campaign of preventive measures. In addition, this study may be subject to several limitations resulting from errors in reporting case data and limited terms used for the data query. This study only used English terms, and did not consider Spanish or other indigenous languages which are also used in the USA. Future studies could incorporate spatial modelling tasks for predicting active clusters that combine distributions of Google RSVs with other significant explanatory variables. Such variables might include income inequalities, median household incomes, the proportion of black females, the proportion of nurse practitioners ( Mollalo et al., 2020 ), age, disability, language, race, occupation, urban status ( Andersen et al., 2021 ) and crowded housing conditions ( Dasgupta et al., 2020 ). However, more dynamic variables may be required to increase the performance of the model. This study found that mobility variables are important variables in model development. Transit stations, workplaces and parks became the most common variables included in the models for a few months in the early stage of the outbreak, as working from home was widely implemented. However, the model should be constructed carefully to prevent the introduction of biases when designing the models. Furthermore, this study only used Google RSVs and mobility data with a lag of 7 days for analysis. This period was chosen to prevent a mass media reporting effect on Google searches over longer lag periods. Further analysis in defining the best lag period is needed to increase the accuracy of the study.

Several considerations when utilizing Google search data as a public health surveillance tool
Utilizing Google RSV data as a secondary public health surveillance tool is promising for the future. Google search data are publicly available at low cost, and potentially cover online information-seeking behaviours of the majority of people as most people use the Internet to search for specific terms in search engines ( Mavragani, 2020 ;Schneider et al., 2020 ). Therefore, internet search data could potentially provide patterns unreported by traditional surveillance measures, such as the number of ill people who did not seek medical treatment but searched for healthrelated information ( Barros et al., 2020 ). This method can potentially be used as an online surveillance tool in countries with limited resources ( Schneider et al., 2020 ). Online queries also offer anonymous data that can potentially assess a large population ( Mavragani, 2020 ). These opportunities make this infodemiological method a valuable approach in understanding the occurrence of illnesses circulating in the general population that can be inspected promptly. However, the findings of this study suggest the variability of Google RSV model performance between states and time periods ( Figures 2 and 3 ; Tables 4 −6). Different states may utilize Google RSVs in different frameworks. In highly correlated states, Google searches may be used for prediction tasks, while other states may use them to understand public responses and design risk communication.
Although promising, some issues need to be considered when employing information search data. Changes in online information and communication patterns that reflect user-generated data in infodemiology need to be validated to distinguish a true epidemic from an epidemic of fear ( Eysenbach, 2011 ). People searching for flu information do not always reflect people suffering from flu, and can be affected by sudden incidents or events ( Barros et al., 2020 ;Eysenbach, 2011 ;Mavragani, 2020 ). Recent studies have shown that Google Trends data cannot distinguish whether searches represent public concern or interest ( Springer et al., 2020a,b ), and the surge in online information searches related to coronaviruses for particular terms was irrespective of the time occurrence of the outbreak, which indicates that Google Trends data were closely af- Table 6 Performance of the Google relative search volume (RSV) models in strongly correlated areas (with Spearman's rank correlation coefficients of ≥0.7).

Model
Coef. fected by media coverage ( Sousa-Pinto et al., 2020 ). Therefore, this proxy should be used with caution because it could be affected by false-positive events, such as in the case of an infodemic where Google searches may more closely represent the public's fear instead of disease dynamics. Regular updates of keywords used in search query monitoring are necessary proxies to maintain the validity of emerging trends and changes in a population's health-seeking information behaviours. Other issues in the infodemiological approach are related to internet penetration and access problems, preferences used by certain age groups, and transparency in how internet search data are collected ( Barros et al., 2020 ). In addition, information search data may leak from future to past observations in the case of retrospective analyses. Thus, future research should consider weekly data retrieval during the season to prevent information leaks from future to past observations ( Schneider et al., 2020 ). Other emerging data sources, including Twitter, websites/platforms, blogs/forums, Facebook, reviews, mobile apps, Instagram, news/media, Wikipedia, health records and online surveys, are also important in conducting digital surveillance.

Conclusions
Small clusters in one or several neighbouring states in the early stage of the outbreak triggered larger clusters involving multiple states as the outbreak progressed. In the later phase of the outbreak, clusters circulated in counties located in the middle of the country, and progressed into northern parts. These results demonstrate the importance of state-to-state coordination in implementing control measures to tackle the spread of new infectious disease outbreaks. In addition, better control measures may have been performed in eastern counties based on the rise of cold spots in those areas.
Variabilities in Google RSV model performances were found among states and time periods. This suggests that different frameworks need to be implemented in each state when utilizing Google RSV data. In addition, mobility variables were identified as important variables in predicting new daily COVID-19 cases. Google searches may be used in prediction tasks in highly correlated states, while they can be used in other areas to understand public responses and design risk communication. Moreover, the sign (positive or negative) of the correlation can be utilized to understand public responses to control and preventive measures, as well as in communicating risk.

Declaration of Competing Interest
None declared.

Ethical approval
Not required.