Data-related and methodological obstacles to determining associations between temperature and COVID-19 transmission

More and more studies have evaluated the associations between ambient temperature and coronavirus disease 2019 (COVID-19). However, most of these studies were rushed to completion, rendering the quality of their findings questionable. We systematically evaluated 70 relevant peer-reviewed studies published on or before 21 September 2020 that had been implemented from community to global level. Approximately 35 of these reports indicated that temperature was significantly and negatively associated with COVID-19 spread, whereas 12 reports demonstrated a significantly positive association. The remaining studies found no association or merely a piecewise association. Correlation and regression analyses were the most commonly utilized statistical models. The main shortcomings of these studies included uncertainties in COVID-19 infection rate, problems with data processing for temperature, inappropriate controlling for confounding parameters, weaknesses in evaluation of effect modification, inadequate statistical models, short research periods, and the choices of research areal units. It is our viewpoint that most studies of the identified 70 publications have had significant flaws that have prevented them from providing a robust scientific basis for the association between temperature and COVID-19.


Introduction
The coronavirus disease 2019 (COVID-19) pandemic, which is ongoing at the time of writing, has attracted increasing research interests (Gong et al 2020). An understanding of the driving factors of COVID-19 transmission is urgently needed owing to the extensive public health implications (Kraemer et al 2020). Whether warm temperatures suppress the spread of COVID-19 has become a hot topic of discussion that has attracted considerable social media and political attention worldwide, since preliminary laboratory studies indicated the high temperature can lower the survival of COVID-19 virus (Baker et al 2020, NAS 2020. Inputting the keywords 'temperature' and 'COVID-19' into the Web of Science yielded hundreds of results (as of 21 September 2020), but the main findings of these publications were not consistent (Fang et al 2020, Jüni et al 2020, Pan et al 2020. As a large proportion of this research had been conducted in a rush (Glasziou et al 2020, Heederik et al 2020, its findings may be more likely to generate public confusion than to contribute to scientific knowledge (Zeka et al 2020). A recent study criticized all of the studies associated with ambient air pollution and COVID-19 incidence and mortality, arguing that they were susceptible to significant sources of bias (Villeneuve and Goldberg 2020). Compared with studies on air pollution associated with the COVID-19 pandemic, more research has been conducted on the correlations between temperature and COVID-19 transmission. Data-related and methodological concerns are particularly prominent in the latter studies, inhibiting their efforts to explicitly elucidate the complexity of the role of temperature in COVID-19 spread. In this study, we first identified relevant reports and then attempted to explore the adequacy of data and methods used, rather than concluded that whether temperature could influence the COVID-19 transmission or not.

Methods
To identify articles associated with temperature and COVID-19 spread, we searched Science-Direct (www.sciencedirect.com/search), PubMed (https://pubmed.ncbi.nlm.nih.gov/), and Web of Science (www.webofknowledge.com) using the search terms 'COVID-19' or 'SARS-CoV-2' and 'temperature' and 'association' through 21 September 2020. After examination of the titles, abstracts, and full text, 70 studies remained, as illustrated in figure 1. Since we excluded papers without peer review, we did not use other search engines to examine pre-printed literature posted on the Internet.

Research status
The details of the 70 retrieved articles, including their locations, study design, adopted models, study period, confounding variables, and main findings, are presented in supplementary material table S1 (available online at stacks.iop.org/ ERL/16/034016/mmedia). Approximately 35 reports indicated a negative association between temperature and COVID-19 transmission (table 1), whereas 9 studies suggested a positive association. Some researchers demonstrated that such associations were piecewise, or found no clear link between temperature and COVID-19 spread. Regarding location, approximately 73% of the studies (10 in one city and 41 in multiple regions) had been conducted within one country. Of these 51 studies, 15 had been conducted in China; this is unsurprising, because COVID-19 was first detected in Wuhan, China. Seven studies had been conducted in the U.S. and India, followed by four in Spain, three in Brazil, and three in Japan (figure 2).

COVID-19 infections
As shown in table 1, the daily new or cumulative COVID-19 counts were the most commonly adopted dependent variables, of which most were from official health departments. During the early stage of the COVID-19 outbreak, the underreporting of COVID-19 infections and deaths due to the lack of adequate testing in most countries might have influenced the determined temperature-associated effects (Chatterjee 2020). Furthermore, testing ability commonly increases as a pandemic evolves (Tromberg et al 2020), thereby inducing bias in the time-series analysis. Nonetheless, few of the reports we retrieved considered the effects of testing ability in their analyses (Pan et al 2020).
There are marked discrepancies in testing ability between regions worldwide (https://ourworldi ndata.org/ coronavirus-testing#testing-for-covid-19background-the-our-world-in-data-covid-19-testing -dataset). Testing coverage is particularly low in some developing countries. Such inequalities should be inspected carefully because they may cause considerable estimation errors in ecological studies (Iqbal et al 2020, Pan et al 2020. In addition, uncertainties associated with asymptomatic COVID-19 infections or variations in silent transmission between regions can significantly modify the estimation of the associations between temperature and COVID-19 spread (Jia et al 2020).
The changing definitions or misclassification of COVID-19 during the pandemic also affected the COVID-19 counts. Using China as an example, the case definition was initially narrow and was broadened later to include more infection cases as knowledge increased (Tsang et al 2020). However, most authors did not consider the effects of changing the case definition in their statistical analyses.

Study design
Of the identified 70 publications, there are 24 ecological studies and 45 time-series studies (table 1). Particularly, the time-series studies can be further divided into two types: temporal (31) and spatio-temporal (14) studies. Each study type has inherent possible biases (Villeneuve and Goldberg 2020), i.e. the ecological fallacy or cross-level bias in the ecological studies. The study design is particularly crucial in relation to the statistical models and confounding variables. For example, in most temporal studies, the correlation analysis was commonly adopted, without any confounding variables. Both the ecological and time-series studies can be analyzed by regression and correlation analysis. Some statistical models, including the (S)ARIMA approach, are widely used in time-series analysis.

Statistical model
Correlation analysis was conducted in more than 30% of the reports. In particular, of the 21 studies that used correlation analysis, 13 implied a negative association, whereas 9 exhibited a positive association (table 1). The conclusions of the correlation analyses were not always solid because they did not control for any other confounding factors, which might have masked the true effect. Over the last 6 months, the temperature has increased or decreased owing to seasonal changes. Meanwhile, the spread of COVID-19 has in some cases been strongly suppressed by strict policy interventions . Thus, although most of the reviewed authors declared that their correlation analysis results did not indicate causality, these publications may still confuse public opinion regarding driving factors. Regression models were also widely used in the retrieved studies. Most of the researchers had conducted time-series analysis, whereas some did not follow the accepted methods of time-series analysis. We noted that multiple linear analysis was utilized in some studies (Haque andRahman 2020, Ladha et al 2020), implying that the error in daily new cases was assumed to have a normal distribution. For count data (such as infection cases), negative binomial, Poisson, and zero-inflation regression models are more suitable to avoid overdispersion (Villeneuve and Goldberg 2020).
Besides correlation and regression analyses, some of the researchers used machine learning techniques (Malki et al 2020, Pramanik et al 2020. However, we found the methodologies of these studies are not easy to follow (Malki et al 2020, Pramanik et al 2020, and their conclusions evinced insufficient understanding of the mechanisms involved.

The factor of temperature
Another concern is how to choose a sound factor to represent temperature. In the identified studies, the authors used the maximum, average, or minimum daily temperature (Goswami et al 2020), diurnal temperature range, moving average (Xie andZhu 2020, Qi et al 2020a), lagged effect (Briz-Redón and Serrano-Aroca 2020) and cross-basis of temperature (Runkle et al 2020, Shi et al 2020, and yearly or monthly average temperature (Mandal andPanwar 2020, Wei et al 2020). However, at this stage, the differences in model performance between these approaches remain unclear. Furthermore, as a large proportion of the publications did not include sensitivity analysis or explain the reasons for their choices, we cannot determine whether these choices were based on statistical significance, scientific evidence, or other factors. In addition, the median incubation period for COVID-19 is estimated to be 4-5 d, and incubation can extend to 14 d (Bi et al 2020). Together with the additional days for laboratory confirmation, using the temperature on the day of case confirmation is not appropriate.

Meteorological factors and air pollutants
Approximately 25 studies did not include any confounding variables, and most of these studies adopted correlation analyses. Most confounding variables can fall into two types: the time-varying factors (meteorological factors, air pollutants, policy intervention, and others) and location-varying factors (e.g. demography, socioeconomic status, and population). Of the identified 70 studies, different confounding factors pose threats to different types of studies. In particular, time-varying risk factors are threats to both types of time-series studies, whereas Note: a some reports did not have detailed information on the study period. b Some reports used multiple methodologies or confounding variables. c Some publications used multiple dependent variable. location-dependent factors are threats to ecological and spatio-temporal but not purely temporal timeseries studies. With respect to time-varying factors, we noted that approximately half of the retrieved reports controlled for meteorological factors, particularly humidity, wind speed, and visibility (table 1). However, similar to the measurement for temperature, the lagged effects of meteorological factors should be considered. Some studies conducted at the country or global scale just averaged the temperature, the humidity, or other meteorological factors (Kumar andKumar 2020, Sarmadi et al 2020), even though the weather conditions in some countries, such as the U.S., Russia, India, and China, vary considerably. In contrast, the authors incorporated regional measures for nationwide COVID-19 counts (Iqbal et al 2020, Sarkodie and Owusu 2020, Sarmadi et al 2020, because COVID-19 is prone to outbreaks in mega-cities, particularly with more people traveling to and from international locations (Dong et al 2020a). Thus, appropriately weighting the corresponding meteorological factors between regions is crucial to disentangle the temperaturerelated correlations. Some of the retrieved studies also used air pollutants as covariates, such as particulate matter, sulfur dioxide, and nitrogen dioxide (NO 2 ) (Adhikari and Yin 2020, Azuma et al 2020, Jiang et al 2020). The major objective of these studies was to explore the correlations between exposure to air pollutants and COVID-19 transmission, considering air pollutants are widely associated to human health . Some scientists have argued that such analyses add incremental value during an active pandemic (Heederik et al 2020, Villeneuve and Goldberg 2020).

Policy interventions
Prior studies have demonstrated that strong policy interventions, including face masks, social distancing, hand hygiene, travel or work restrictions, and community isolation, can greatly lower the transmission of COVID-19 (Chu et al 2020, Zhou et al 2020. However, only four of the retrieved studies controlled for social distancing , Rubin et al 2020, non-pharmaceutical interventions (Fang et al 2020), or strict COVID-19 measures (Ozyigit 2020) in their analyses. In a time-series analysis, policy intervention would bend the growth curve in the later period of COVID-19 spread and also decrease the reproduction number or prevent the number of positive counts (Davies et al 2020). It is questionable whether robust conclusions can be generated by models that omit policy interventions. Existing studies have already determined that the stringency indexes for governments' responses (e.g. social distancing, school closing, and public event cancellation) vary substantially between regions (Ashraf 2020, Hale et al 2020. This spatial inequality could reshape the curve between temperature and COVID-19 spread. However, none of the studies we reviewed evaluated how the effects of spatial variations in the responses of governments influenced the associations between temperature and COVID-19, especially in ecological studies.

Location-varying factors
Approximately 50% of the publications included the effects from location-varying factors, such as demographic factors, socioeconomic factors (e.g. race, occupation, education, income, age structure, number of hospital beds, and life expectancy), and spatiotemporal factors (e.g. number of days since the first confirmed case), especially in the ecological and spatio-temporal studies. These time fixed factors that vary over locations may modify the association of COVID-19 with temperature in multi-location temporal studies. Research has shown that the age structures of North Americans and Europeans increase their vulnerability to COVID-19 mortality (Esteve et al 2020), which may be attributable to the relatively high proportions of older people in these regions. Positive correlations were also demonstrated (figure S1) between the proportion of older people, testing number, life expectancy, and gross domestic product per capita worldwide. Thus, researchers need to carefully investigate the potential collinearities between the confounding variables before data analysis. Some data processing techniques, such as principal component analysis and stratified analysis, may be required prior to further analysis.

Study period and duration
Some ecological studies utilized the confirmed or accumulative COVID-19 counts on a specific day as the dependent variable (Gupta et al 2020, Sarmadi et al 2020. However, these COVID-19 data on a specific day may be greatly influenced by the initial status, growth rate, and calendar date of the first case. Furthermore, the exposure duration of more than 50% of the studies was in the range of 1-3 months or less than 1 month (table 1). Some studies may only select a short study period before the execution of policy intervention, and this short study period raises another issue: are data from a short study period sufficient? Although there is no uniform criterion to determine the minimum size for time-series studies, it is questionable whether a study period of 1-3 months is sufficient. For example, the determination of exposure to air pollution and mortality generally requires a study period of multiple years to control for the long trend of adverse health effects and address the seasonality of temperature , Dong et al 2020b.
To some extent, it is a paradox to researchers. At the early stage of pandemic, a number of countries or regions were still in the stage of epidemic growth, and the growth curve may be less influenced by policy intervention. However, an inherent question is the data that may be not sufficient to account for temporal trend. Contrastingly, if longer study period is adopted, associated parameters might be heavily determined by policy intervention, demographic factors, and socioeconomic factors than by temperature.

Research areal unit
The authors of the retrieved studies investigated temperature and COVID-19 transmission at the community, city, provincial or state, country, and global scales. One study using the daily number of new cases nationwide in India revealed a positive association (Kumar 2020), whereas provincial data in India suggested that temperature was negatively associated with the number of COVID-19 cases (Goswami et al 2020). This difference may have been due to the modifiable area unit problem (MAUP), which is a form of statistical bias that arises when incorporating point measurements into districts. A recent study also found that the correlations between COVID-19 mortality and NO 2 were contradictory when aggregated at different levels, indicating that the MAUP should be investigated when exploring the environmental determinants of the COVID-19 pandemic (Wang and Di 2020).

Other issues
Other limitations were also noted. First, none of the existing studies considered how the infectivity of the virus changed during the COVID-19 outbreak, although this is an important time-varying factors. In addition, the geographical variations in the viral strains with distinct infection capabilities may trigger biases in ecological studies. Second, some of the authors adjusted the new/cumulative COVID-19 cases using the baseline on previous days (Zhu and Xie 2020), whereas others did not (Runkle et al 2020, Qi et al 2020b. Similarly, the population was not adopted as an offset in all of the studies (Shi et al 2020, Qi et al 2020b. These variations in the data process may have hampered conclusions as to how temperature affects the spread of COVID-19. Meanwhile, in some cases, COVID-19 infections stem from clusters (for example, the worker in the food/meat processing industry or market) rather than the whole population, which should be excluded or specified in statistical analysis.
Investigating the role of temperature in the COVID-19 pandemic is important but challenging. Laboratory studies have observed that the high temperature may reduce the survival of COVID-19 virus (Baker et al 2020, NAS 2020, while filed studies did not consistently validate this conclusion. Our suggestion is that the study period should be taken before the execution of policy intervention, since the policy intervention could strongly bend the growth rate of COVID-19. In addition, comparing to ecological or time-series studies, a longitudinal study with individual data at global scale promises to better address the association between temperature and COVID-19 transmission. Meanwhile, researchers also need to carefully examine the influence from all potential confounding variables.
Also, we recommend that determining the influence of temperature on COVID-19 transmission can be comprehensively evaluated after the ending of this global pandemic. Till now, the second wave of COVID-19 is still developing rapidly in some countries, implying that temperature may be unable to significantly suppress COVID-19 transmission. A very recent study concluded the weather contributed to 17% of the variation in the maximum COVID-19 growth rate, and UV lights rather than temperature is the most strongly associated with lower COVID-19 growth (Merow and Urban 2020). However, authors also pointed out that the uncertainty remains high and aggressive policy interventions are likely be needed (Merow and Urban 2020). Prior studies indicated that the variations of population susceptibility is the driving factor of the COVID-19 pandemic, and warm temperature may be not anticipated to substantially limit the COVID-19 growth (Baker et al 2020, Su et al 2020.

Conclusion
This study revealed that data-related and methodological issues mainly concerned data reliability and processing, and the inherent uncertainties in the data decreased the reliability of the statistical analyses. Since the COVID-19 pandemic begun, an enormous quantity of manuscript submissions from the researchers in different countries or regions often led to the need to perform the reviews in rush, which may be also responsible for some data and methodological flaws, since many details might have been overlooked in these review processes in order to provide the newest conclusions regarding the transmission and control of COVID-19. From our point of view, most of the 70 peer-reviewed studies had significant flaws in their methodologies or data design, requiring greater epidemiological rigor to yield robust conclusions. Here we also encourage authors, reviewers, and editors to work together to more closely scrutinize relevant research, aiming to produce studies with high-quality. With respect to COVID-19 transmission, focusing more on the effectiveness and optimal range of interventions, optimal strategies for reopening the economy and outdoor events, protective materials, and tracing the sources of COVID-19 may be better assist in the global fight against the COVID-19 pandemic.

Data availability statement
All data that support the findings of this study are included within the article (and any supplementary files).

Acknowledgments
This study was financially supported by fundings (Nos. GWTX05 and SWJC05) from the National Institute of Environmental Health (NIEH), Chinese Center for Disease Control and Prevention (China CDC). We thank Professor Xiaoming Shi at NIEH, China CDC for his valuable guidance and tremendous help for this study. We thank anonymous reviewers for their insightful comments and constructive suggestions.