Impact analysis of environmental and social factors on early-stage COVID-19 transmission in China by machine learning

As a highly contagious disease, COVID-19 caused a worldwide pandemic and it is still ongoing. However, the infection in China has been successfully controlled although its initial transmission was also nationwide and has caused a serious public health crisis. The analysis on the early-stage COVID-19 transmission in China is worth investigating for its guiding significance on prevention to other countries and regions. In this study, we conducted the experiments from the perspectives of COVID-19 occurrence and intensity. We eliminated unimportant factors from 113 variables and applied four machine learning-based classification and regression models to predict COVID-19 occurrence and intensity, respectively. The influence of each important factor was analysed when applicable. Our optimal model on COVID-19 occurrence prediction presented an accuracy of 91.91% and the best R2 of intensity prediction reached 0.778. Linear regression-based model was identified as unable to fit and predict the intensity, and thus only the variable influence on COVID-19 occurrence can be explained. We found that (1) CO VID-19 was more likely to occur in prosperous cities closer to the epicentre and located on higher altitudes, (2) and the occurrence was higher under extreme weather and high minimum relative humidity. (3) Most air pollutants increased the risk of COVID-19 occurrence except NO2 and O3, and there existed a lag effect of 6–7 days. (4) NPIs (non-pharmaceutical interventions) did not show apparent effect until two weeks after.


Introduction
The novel coronavirus disease 2019 (COVID-19) caused by severe acute respiratory syndrome coronavirus-2 (SARS CoV-2) has become a global pandemic. As of June 15, 2021, there have been over 175 million confirmed patients and over 3.8 million deaths due to COVID-19 (WHO, 2021). On the initial stage of COVID-19 transmission, WHO confirmed on January 12, 2020, that the infection in Wuhan belongs to coronaviruses (Bashir et al., 2020). The virus spread to surrounding provinces in China rapidly during January, and thus Wuhan, Hubei province restricted personnel movement on Jan 23 to prevent more severe transmission (Yu et al., 2020;Andersen et al., 2021). All provinces of mainland China were on high alert and responded swiftly . Comprehensive and stringent epidemic prevention and control measures were taken nationwide (The State Council Information Office of the People's Republic of China, 2020). It was proved that the feedback of these measures was encouraging, only sporadic cases have been reported on mainland China since April 29, 2020 (The State Council Information Office of the People's Republic of China, 2020).
The recovery of COVID-19 in mainland China cost less than half a year and has provided a valuable background for scholars to investigate the correlation between the COVID-19 transmission and meteorological, social, demographic and geographical factors (Liu et al., 2020;Sun et al., 2020b;Wang et al., 2021;Xiao et al., 2021). Currently, it is a popular methodology to statistically analyse how various parameters influence the spreading of COVID-19 from both global and regional perspectives. For example, related studies showed that some environmental and economic factors such as PM 2.5 and GDP have a positive effect on preventing COVID-19 infection (Ahmed et al., 2021). A study in the United States verified that meteorological parameters (higher temperature) and social background (physical distancing intervention) together contribute to a lower risk of COVID-19 transmission (Guo et al., 2021b). However, most studies on COVID-19 transmission in early 2020 were focused on the impact of environmental quality rather than socio-economic, governance, transportation and other elements (Sharifi and Khavarian-Garmsir, 2020).
Previous studies on COVID-19 transmission factors were usually conducted via statistical methods. Varies of models have been proved to be significantly effective in revealing the impact of different factors and thus provided recommendations in prevention (Babuna et al., 2021;Coşkun et al., 2021;Lorenzo et al., 2021;Xiao et al., 2021). Apart from conventional statistical analysis, spatial statistics has also shown its reliability and priority in digging out the complex correlation from geographical data (Sun et al., 2020a;Andersen et al., 2021;Dupre et al., 2021). Furthermore, some of these statistical models showed the potential to forecast the trend of the COVID-19 pandemic (Guo et al., 2021a).
As a popular technology, machine learning has also been widely applied and thus brought significant findings when it comes to the analysis of the COVID-19 pandemic. Kuo and Fu (2021) integrated 8 types of basic machine learning techniques to build up a robust model to predict the number of patients based on environmental, demographic and mobility data. Li et al. (2021) built a ridge logistic model and found out how numerous factors may affect COVID-19 transmission and fatality. Nevertheless, most of these machine learning-based studies focused on the influence factors of worldwide transmission instead of the early-stage pandemic in China.
Therefore, we applied several machine learning methods on a wide range of environmental, social, economic, meteorological and geographical data during the early-stage COVID-19 pandemic in China, and thus revealed the impact of all these factors. Meanwhile, we evaluated and compared the machine learning-based models to work out an optimal solution for estimating the transmission of COVID-19.

Data source
In this study, we collected prefecture-level data in mainland China and all the data obtained were daily data between Jan 10 and Jun 11 when the number of daily new patients decreased to less than 300 (Johns Hopkins University, 2021). Each record represented the parameters of each prefecture per day.
The data of confirmed and newly confirmed patients were obtained from the National Health Commission of the People's Republic of China (2021) and provincial or prefectural health commissions. 13,332 newly confirmed cases of novel coronavirus pneumonia (NCP) were officially released on Feb 12, such a numerous increase was due to the change of diagnosis standard, which was previously relying on nucleic acid testing but then changed to clinical diagnosis (Chinadaily, 2020). We evenly distributed the increment of patients on Feb 12 to each of the previous days to assure our data reflected the true transmission.
Since the relation between an independent factor and disease incidence counts may fail to indicate the association with disease transmission (Zhao, 2020). It should be noticed that COVID-19 was not balanced distributed since the COVID-19 transmission varied a lot among different prefectures. Therefore, our study was conducted from two separate perspectives of COVID-19 occurrence (COV_O) and COVID-19 intensity (COV_I). COV_O were binarized to 1 and 0 depending on whether COVID-19 is over 0. COV_I inherited the natural logarithm of COVID-19 to smooth the unbalanced distribution.
The meteorological data we collected included daily minimum temperature (MinT), maximum temperature (MaxT), mean temperature (MeanT), relative humidity (Rh) and minimum relative humidity (MinRh). These data were downloaded from the China Meteorological Data Service Center (2021).
The atmospheric environmental quality data, including daily CO, NO 2 , O 3 , PM 2.5 , PM 10 , SO 2 and AQI (air quality index), were downloaded from The Data Center of The Ministry of Ecology And Environment of The People's Republic of China (2021). We applied linear interpolation to replace missing values. Meanwhile, considering the delay effect of atmospheric environmental indicators, we defined CO_1 as the CO content of one day before, and CO_9 was the CO content of 9 days before. The same naming rule was used on all the other atmospheric environmental indicators.
The geographical data used in this study was a 30m digital elevation model (DEM) product called 'ASTER GDEM', which was downloaded from Geospatial Data Cloud (2021). The mean DEM (MeanDEM) of each prefecture was calculated.
The socio-economic data we used consisted of three parts. The first part is human-related economic data. We collected the household population (Pop), gross domestic product (GDP), and population density (PD) of the year 2019, which were the latest accessible version. The first two were collected from all provincial statistical yearbooks of China. The population density was downloaded from Ministry Of Housing and Urban-Rural Development of the People's Republic of China (2021).
The second part is migration-related data. We collected destination proportion in population flow from Wuhan (WH) and migration scale (MS) from Jan 1 to Jan 23 (Wuhan lockdown), 2020 (Baidu, 2020). Then we calculated the destination migration scale flow from Wuhan (Popmob) according to the following equation: Meanwhile, considering the delay effect, we defined Popmob1 as Popmob of one day before, and Popmob9 was defined as Popmob of 9 days before, and so on naming Popmob0 to Popmob9. Besides, the cumulative Popmob (Popmobsum) from Jan 1 to each corresponding date was also calculated. The distances from Wuhan to the geometric centre of each prefecture (DisWH) were calculated. The third part is the level of the adopted control measures (Reslevel) in 366 prefectures. These data were collected from the National Health Commission of the People's Republic of China (2021) and provincial or municipal health commissions. China's public health alert system is categorized into four levels in terms of the nature of the incidents, the extent of harm and scope: Level-I (extremely significant), Level-II (significant), Level-III (major) and Level-IV (normal) (The Central People's Government of the People's Republic of China, 2020). We quantified these response levels that no response, Level-IV, Level-III, Level-II and Level-I were ranked 0, 1, 2, 3 and 4, respectively. Meanwhile, considering the delay effect, Reslevel1 was defined as the response level one day before, Reslevel20 was defined as the response level 20 days before, and so was the others.
Days from Jan 10 (Time) of each record was also extracted and regarded as a potential influence parameter considering that the temporal aspect may influence a lot in the early-stage transmission.
More specific information of all the aforementioned data was shown in Appendix A.

Factor filtering
In this study, we collected 113 factors. Such high dimensional data would massively increase the difficulty of machine learning. Besides, some factors may not have a strong correlation to COVID-19 occurrence or COVID-19 intensity where such interruption might bring unexpected noise to our analysis. Therefore, a preprocess of factor filtering was necessary.

COVID-19 occurrence
We built Gradient Boosting Decision Tree (GBDT) and Random Forest (RF) classification models on COV_O to find out the importance of each factor. These two machine learning-based classification models were built and modified based on the Python module 'Scikit-learn' (Pedregosa et al., 2012).
RF fits a certain number of decision trees trained on randomly selected subsamples of the training data by the bagging approach and the output of the RF classification model is the class selected by most trees (Kulkarni and Lowe, 2016;Kuo and Fu, 2021). GBDT is also in the form of ensemble decision trees but uses boosting approach and it usually overperforms RF (Piryonesi and El-Diraby, 2020;Kuo and Fu, 2021). In this preprocessing, we set the number of trees to 1000 for both two models.
Each type of data was normalized to the interval of [0, 1] based on the maximum and the minimum. The influence of each factor was calculated by both these two models. The influence was computed via Gini importance, which was derived from the Gini index (Breiman, 2001). Since our classification was binary according to two categories of COV_O, the Gini index was then computed by the following equation (Qi, 2012): where G k is the Gini index of node k, and p represents the fraction of the positive example assigned to this node. The Gini importance of each feature in a tree was the sum of the Gini index reduction and the feature importance was the sum of Gini importance in all trees (Qi, 2012). The higher feature importance means a stronger correlation.
For each model, we recorded 50 factors the influence of which ranked the least and deleted the factors which were listed in both these two results. In this way, the least related variables were all removed to avoid noise interruption in machine learning.

COVID-19 intensity
Similar to the preprocess on COVID-19 occurrence, GBDT and RF were also adopted to build regression models to explore the importance of each factor for COVID-19 intensity and we also applied the same normalization to each type of data. The number of trees was set to 1100 and 1500 for GBDT and RF regression models respectively. The last 50 influence factors were also recorded and the repetitive factors were dropped as well.

Estimation models 2.3.1. COVID-19 occurrence
Since we have 51,776 records the COV_O of which were 0 and only 4588 records were 1, we randomly select 10% from the former COV_O records and thus the distribution of these two categories was more balanced to avoid overfitting problem. Then we randomly split the training and test data for the following classification models by the proportion of 70% and 30%.
Four machine learning models were adopted to estimate the COVID-19 occurrence. We retained identical GBDT and RF classification applied in factor selection considering they were among the most popular realizations of ensemble learning (Piryonesi and El-Diraby, 2020). Moreover, we have built a ridge classification model and a 3-layer artificial neural network (ANN), then compared them with the aforementioned two models to find the optimal solution.
The ridge classifier is referred to a least-squares support vector machine (LSSVM) with a linear kernel that transforms the binary targets into (− 1, 1) and then does a regression task (Scikit-learn, 2021). To increase the reliability of COV_O estimation, we applied 10-fold cross-validation to our ridge classifier.
The ANN included three hidden layers where the neural nodes were 64, 48, and 16. All records were segmented into training, validation, and test sets by 60%, 20%, and 20%, respectively.
The results of all four classification models were evaluated based on their accuracy, precision and recall calculated by the following equations: where c i refers to the number of correct estimations of class i, n i refers to the true record number of the class i, m i is the number of class i records estimated by the model, and s is the total amount of all records. Considering the algorithm of the ridge classifier is based on linear regression, the coefficients of all selected factors were finally computed. The factors with an absolute coefficient over 0.1 were then listed and analysed.
The overall workflow on COVID-19 occurrence is shown in Fig. 1.

COVID-19 intensity
Based on 4588 records that presented non-zero COVID-19 intensity, we kept them all and also split them into training and test sets by 70% and 30%.
Consistently, we employed four machine learning techniques to estimate the COVID-19 intensity. GBDT and RF were still preserved due to their good performances. However, instead of ridge regression, in this experiment, we used elastic net (EN) with 10-fold cross-validation. Compared to ridge regression, EN can discard unimportant variables but keeps the stability, and it usually gives a significantly accurate prediction (Zou and Hastie, 2005). Besides, we also constructed a 2-layer ANN which had two hidden layers individually consisting of 96 and 10 nodes.
The results were finally assessed by the coefficient of determination (R 2 ) and mean squared error (MSE) of all estimations for both training and test sets. R 2 and MSE were calculated by the following equations: where m refers to the records of all non-zero COV_I, y i refers to the true COV_I of the i th record, ŷ i refers to the estimated COV_I of the i th record, and y is the mean COV_I of all records. The EN model is also a linear regression-based model and thus its coefficient analysis of all input variables was then conducted to explain how each factor would impact the COVID-19 intensity.
The overall workflow on COVID-19 intensity is shown in Fig. 2.

Results
The last 50 COV_O influence factors calculated by GBDT and RF classification are shown in Table 1. Based on these results, 26 factors (Table 2) were finally cross-selected and dropped before fitting the estimation model.
Apart from the AQI with a lag of 7 days, all the other AQI factors were proved to be unrelated to the number of patients and deleted. Some other air pollutants such as SO 2 , CO, PM 10 , PM 2.5 and also the migration scale flow from Wuhan with the lag of various days were identified lack of influence and then removed.
As for the last 50 COV_I influence factors (Table 3), some Air pollutants such as AQI, CO, NO 2 , PM 2.5 and SO 2 were removed based on the cross-selection results (Table 4). Besides, two social factorsthe migration scale from Wuhan and the response levelwith the lag of certain days were also removed.
The final performances of four machine learning models on the test sets for COVID-19 occurrence classification are shown in Fig. 3 as the format of confusion matrixes. The accuracy, precision, and recall were also calculated and shown in Table 5.
All these four techniques showed spectacular performances on the classification of test sets that their accuracies were identically over 88%, and GBDT presented the highest accuracy of 91.91%. Meanwhile, all the high precisions and recalls also indicated that these four classifiers were similarly convincing.
Considering that our Ridge classifier presented excellent performance, 38 factors that showed high influence with the absolute coefficient over 0.1 were retained and listed in Table 6.
As for the prediction on COVID-19 intensity, the results of all four regression models are presented in Fig. 4 and their assessment metrics are shown in Table 7.
The R 2 of the EN model was below 0.4 and its MSE reached over 1.0, which was not feasible enough when the dependent variable (natural logarithm of COV_I) ranged from − 4 to 6. In this case, the coefficient analysis was not applicable. The other three regression models (GBDT, RF, and 2-layer ANN) all performed well that their R 2 on both training and test sets were over 0.75.

Discussion
Most of our proposed machine learning classification or regression models were tested to be reliable. All four classification models provided convincing accuracies and GBDT was tested to be the optimal solution. GBDT performed the best on all three assessment indices compared to the other models.
Considering the good performance of our ridge classification model, the coefficient analysis of all the selected factors on COVID-19 occurrence, therefore, brought reasonable and inspiring findings: The analysis of our meteorological data showed that the maximum temperature had a positive impact on the occurrence of COVID-19 cases while the minimum temperature showed a negative impact. Although it was believed that the higher temperature would reduce the possibility of COVID-19 transmission (Liu et al., 2020), our findings indicated that the COVID-19 was more likely to occur where the temperature was extreme and such connection was stronger where the maximum temperature is higher. However, the meteorological impacts could be very complex considering that human interruptions such as the frequencies of outdoor activities were highly related to the weather. Our study also indicated that the minimum relative humidity had a positive impact on COVID-19 occurrence and this could be supported by current studies (Qiu et al., 2021).
The analysis of the atmospheric environmental quality data explained the correlation from different perspectives. It was believed that CO, NO 2 , O 3 , PM 10 , and PM 2.5 synergistically impact the COVID-19 occurrence with a lag of 8-9 days (Qiu et al., 2021). Our study verified that such lag did exist but such duration should be 6-7 days, except CO and O 3 . CO at 4-5 days in advance was found to be the inflection point that its lag effect was shorter than the other pollutants. Besides, O 3 did not show a significant inflection point and thus its lag effect was still yet to be explored. Another surprising finding was that NO 2 and O 3 both showed a negative correlation with COVID-19 occurrence, which was not aligned with the conclusion that most scholars thought they were supposed to increase the severity of transmission (Barnett-Itzhaki and Levi, 2021). However, we applied the lag effect to avoid the possible impacts of COVID-19 on the air pollutants which can be confusing when fitting the prediction model. Therefore, our study should be more reliable compared to the former researches.
In this study, it was found that COVID-19 patients were more likely to occur in the prefectures with higher GDP, denser population, more intensive population mobility and closer to the epicentre, which was aligned with some studies (Coccia, 2020;Qiu et al., 2021). Nevertheless, for population density, some studies stated that it was not significantly associated with the spreading of COVID-19 (Sun et al., 2020b). Considering this inconsistency, an explanation for our result was that the transportation capacity in prosperous cities was usually larger and thus increased the frequency of human interaction, which made it vulnerable to the COVID-19 epidemic. Whereas the medical facilities were usually more accessible and sufficient in big cities to prevent a further internal epidemic, and thus these factors might negatively influence the COVID-19 intensity instead of occurrence. However, this needed to be proved by further exploring.
Our result also presented that the advance control measure response level was not positively influencing the nationwide spreading until the lag of 15 days, where it was found to be the turning point from the negative coefficient to the positive. This finding was consistent with the widely acknowledged viewpoint that the NPIs (non-pharmaceutical interventions) continuing for two weeks can reduce the spreading effectively (Vardavas et al., 2021).
In our study, the finding on the geographical data revealed that COVID-19 occurrence is higher where the altitude is lower, which was consistent with the finding of Sun et al. (2020b). The landscape and the prefectural development status in China mainland, the prefectures located on higher elevation were closely associated with both the meteorological data as well as the socio-economic status. Hence, such a negative correlation was assumed partly due to the synergy of these two causes.
As for the prediction on the number of COVID-19 patients, RF presented the lowest MSE and the best R 2 on the training set but GBDT   Table 2 26 cross-selected COV_O influence factors to be deleted.  outperformed the others with its best estimation results on the test set. The performance of ANN on COVID-19 intensity estimation was brilliant even though it was not as feasible as GBDT. This could be explained by two reasons: (1) Given that all records were at the prefectural level, data of the same prefecture more or less had a homogeneity problem that most variables were similar values, and thus the neural network failed to train on enough effective data.
(2) The distribution of COVID-19 intensity records was still a little skewed even after the transformation of the natural logarithm. Although these three models were slightly overfitted, such results were still good enough considering the extremely unbalanced distribution of COVID-19 intensity. However, the poor fitting performance of the EN model indicated that this prediction on the number of patients is very complex rather than a single linear regression model can express. While most COVID-19 spreading prediction models using machine learning focused on the number of patients-only statistics instead of taking multifactor as the parameters (Ardabili et al., 2020;Pinter et al., 2020;Rustam et al., 2020), the models proposed in our study counted a series of variables to ensure convincingness. Besides, the brilliant performances of these models verified that a reliable forecast on the regional early-stage COVID-19 transmission could be accessible and it is worth further researches. Moreover, the early-stage infection of more contagious diseases rather than COVID-19 is also believed to be achievable based on our study.
Potential improvement on the analysis of early-stage COVID-19 transmission is worth more endeavour from different perspectives: (1) A wider range of variables such as vaccination may have indirect impacts which have not been fully understood, and they might contribute to a more comprehensive model, especially for complex COVID-19 intensity prediction; (2) The quantification of the control measures in other countries is yet to be studied for better use of our optimal model; (3) More specific correlation and the synergistic effect of different factors are worth further exploration.

Conclusion
Various factors synergistically influence the early-stage COVID-19 transmission in China: (1) COVID-19 was more likely to occur in prosperous cities closer to the epicentre and located on higher altitudes; (2) The extreme weather and higher minimum relative humidity increased the vulnerability to COVID-19 occurrence; (3) More air pollutants increased the risk of COVID-19 occurrence except for NO 2 and O 3 , and most of the pollutants did not show the impact until 6-7 days after; (4) The control effects of NPIs on COVID-19 occurrence became significant after two weeks. The accuracy of our optimal model used to predict the COVID-19 occurrence reached 91.91%. Besides, our optimal regression model presented the best R 2 of 0.778 on the prediction. However, the specific connections between the number of COVID-19 patients and each factor were complex and work further exploration. Overall, this is the first study that analysed multifactor early-stage COVID-19 transmission in China and precisely estimated the pandemic trend by machine learning. Our study brings guiding significance to the forecast of the early-stage regional COVID-19 epidemic and enables quicker response to prevent regional outbreaks from becoming a nationwide pandemic.

Authors' contributions
Juan Qiu and Rendong Li conceptualized the study, Yifei Han directed the study's implementation, drafted and finished the manuscript, Jinliang Huang review the drafts and supervised the study, Qihui Shao, Dongfeng Han and Xiyue Luo collected resources and helped to interpret the findings.

Declaration of competing interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.