Correlation Determination between COVID-19 and Weather Parameters Using Time Series Forecasting: A Case Study in Pakistan

Infectious diseases like COVID-19 spread rapidly and have led to substantial economic loss worldwide, including in Pakistan. The effect of weather on COVID-19 spreading needs more detailed examination, as some studies have claimed to mitigate its spread. COVID-19 was declared a pandemic by WHO and has been reported in about 210 countries worldwide, including Asia, Europe, the USA, and North America. Person-to-person contact and international air travel between the nations were the leading causes behind the spreading of SARS-CoV-2 from its point of origin, besides the natural forces. However, further spread and infection within the community or country can be aided by natural elements, such as the weather. Therefore, the correlation between COVID-19 and temperature can be better elucidated in countries like Pakistan, where SARS-CoV-2 has affected at least 0.37 million people. This study collected Pakistan’s COVID-19 infection and mortality data for ten months (March–December 2020). Related weather parameters, temperature, and humidity were also obtained for the same course of time. The collected data were processed and used to compare the performance of various time series prediction models in terms of mean squared error (MSE), root-mean-squared error (RMSE), and mean absolute percentage error (MAPE). This paper, using the time series model, estimates the effect of humidity, temperature, and other weather parameters on COVID-19 transmission by obtaining the correlation among the total infected cases and the number of deaths and weather variables in a particular region. Results depict that weather parameters hold more influence in evaluating the sum number of cases and deaths than other factors like community, age, and the total population. Therefore, temperature and humidity are salient parameters for predicting COVID-19 affected instances. Moreover, it is concluded that the higher the temperature, the lesser the mortality due to COVID-19 infection.


Introduction
A viral infection named COVID-19 was initially discovered in mid-December 2019 in Wuhan city of China [1], which spread across the whole world, and eventually WHO declared it as a pandemic [2]. Figure 1 shows the map along with the total number of confirmed cases in the province. Up to November 22, 2020, there was 58,475,749 COVID-19 cases, 1,385,775 deaths, and 40,459,596 recoveries across the world, out of which 371,508 total cases, 7,603 deaths, and 328, 931 recoveries were in Pakistan [3]. Although SARS-CoV-2 originated from China, the world's biggest population, it was effectively controlled in China's epicenter and other regions since February 2020 [4]. Daily COVID-19 cases in Pakistan peaked at 6,825 on June 14, 2020; then, it declined to 331 on August 3, 2020; and from the first week of November 2020, again it showed ascending pattern. Albeit there is a cure, the main focus is to curb the spread through national blockades and quarantine measures [5]. Such a high daily number of cases warrants an immediate plan of action to control it effectively and its need to prepare for future outbreaks in Pakistan and other nations.
Recently, scientists elucidated close affiliation between weather parameters and main COVID-19 epidemic areas. Moreover, these areas are located in a relatively temperate region in the northern partition [6]. Although pandemic is a global issue, the outbreak epicenters of the world have a mean temperature of 5°C-11°C with 47%-79% humidity in the first two months of the year 2020. Based on these facts, our primary hypothesis is that virus spread is curtailed in high-temperature and humidity areas rather than areas having average temperature and humidity.
Initially, two cases in Pakistan appeared on February 26, 2020. In parallel, three more cases were recorded in subsequent hours from different cities and there was no affinity/ contact among these COVID-19 victims. Unexpectedly, an increase in the number of affected persons on April 14, 2020, was witnessed with the highest number of cases in Punjab, that is, 2826. Sindh was the second with 1452 patients, KPK was the third, having 800 patients, Gilgit Baltistan was in fourth place, having 233 patients, and Baluchistan and Islamabad had 321 and 131 cases, respectively. In contrast, AJK had the least number of cases, i.e., 43 [7]. As there is no proper cure for this pandemic yet discovered and multiple forms of SARS-CoV-2 are also dependent on seasonality [8], these all factors make SARS-CoV-2 spread more alarming and lethal. Short-term forecasting is inevitable to maintain the balance between social, economic, and health aspects in subsequent months [9]. To illustrate the nature of SARS-CoV-2 and to forecast its transmission, there is a dire need to explore its effect on weather. In this regard, the systematic approach of our study includes the following: (a) Using existing data to predict the number of actual COVID-19 affected cases and the total number of deaths in upcoming months with or without weather data in Pakistan. (b) To determine the fragile range of climatic factors and verifying these factors at various periods through statistical analysis. (c) Aiding Pakistan government institutions and policymakers to adopt new strategies to strengthen existing preventive measures to combat the COVID-19 pandemic.
Demongeot et al. [10] identified that the virulence of SARS-CoV-2 and their lethal strains get downregulated in hot and humid climate conditions. e presumed temperature-dependent virulence of COVID-19 also got an eminent interest in the medical field. Instead of the above, our study aims to determine critical factors relying on temperature and transmission kinetics of COVID-19, which increases with cold and dry weather.
Sajadi et al. [11] explained a simplified model describing a zone at high virulence of the COVID-19 outbreak. Bloom-Feshbach et al. [12] elaborated that COVID-19 prevails high in cold and temperate climates than warm and tropical climates, which acknowledges respiratory influenza viruses. For natural distancing calculation and estimation, Prem et al. [13] utilized an age-structured susceptible-exposedinfected-removed (SEIR) model. is study illustrated that if arrival to work initiated in April, physical distancing measures would be most efficient. Eikenberry et al. [14] stated that the SEIR model aimed to evaluate the potential colony impact of the adoption of masks by the public on the mobility and control of the COVID-19. e study recommended using masks nationwide and implementing their use strictly.
Research work related to applying machine learning tools to elucidate the impact of weather parameters on transmission and circulation of COVID-19 seemed lacking and needs more attention. In addition, ascending temperature may or may not lower SARS-CoV-2 spread, and likewise role of other weather factors is also still under debate.
erefore, past studies are concise to various models, and findings are also not authentic. Hence, it is time to understand the relationship between weather variables and the epidemic spreading of COVID-19 in Pakistan.

Data Collection.
e daily cumulative total number of confirmed cases and the total number of deaths were obtained from the official website of the National Institute of Health (NIH) in Islamabad, Pakistan. e National Institute of Health is an independent health research department under the Ministry of National Health Services of Pakistan. It is located in Islamabad and is engaged in various research activities and vaccine making. Daily COVID-19 diagnosed cases, recoveries, deaths, and COVID-19 diagnostic tests conducted across Pakistan were updated on the official website of NIH [15].

Examination.
e data were collected from March 10 to December 20, 2020, both for COVID-19 and weather, and was further divided into training and testing datasets. e training dataset comprises the data from March 10 to November 15, 2020, and the testing dataset has data from November 15 to December 20, 2020. Test data was further analyzed for a cumulative number of cases and deaths with and without weather data. Figure 2 shows the division of complete data into training and testing datasets.

Methods.
We have applied simple machine learning models, deep learning techniques, and statistical models to predict the total number of cases and total deaths with and without weather data for COVID-19. Time series prediction models such as ARIMA, linear regression, SVM, MLP, RNN, LSTM, and GRU were used. Statistical performance of time series prediction models was measured in terms of mean squared error (MSE), root-mean-squared error (RMSE), and mean absolute percentage error (MAPE). For all these experiments, we used Python version 3.8, Scikit-learn version 0.21.0, and deep learning library Keras v.2.2.5 using tensor flow at the backend.

Autoregressive Integrated Moving Average (ARIMA).
ere are three types of ARIMA, namely, autoregression, data-dependent integration, and parameter estimation. All these three types are implemented according to the issue that needs to be focused on [16,17]. e time series form of the process is (1) In the previous equation, x t and ε t depict the original value and random error at time t. Model parameters are Øa (a � 1, 2..., p) and Øb (a � 0, 1, 2,..., q). An unexpected error is defined by ε t and considered with zero mean and σ2 of standard variance. Equation (1) represents the ARIMA model and is applied to various applications for problemsolving.
Taking value q � 0, in equation (1) works as an AR model with order p, and for p � 0, it becomes MA model with order q. us, (p, q) are both inevitable factors for ARIMA model determination.

Linear Regression. Linear regression can be defined as
where Y � dependent variable, X � independent variable, α � intercept, b � regression parameter as slope, and ε � random error. e disadvantage of linear regression is that it usually correlates among an average of input and input variables. Unfortunately, a simple average is not a full illustration of a single variable.

Support Vector Regression (SVM)
. SVR involves evaluating the support vectors (points) near the hyperplane to upgrade boundary among two-point groups acquired by variation between objective value and threshold. SVR is employing kernel functions to elucidate nonlinear issues, which calculates the affinity between two values. We used the linear kernel function in our study. e main leverage of SVR is that it can capture the nonlinearity of the prediction and then use it to raise the prediction case. In the same scenario, it is beneficial to adopt this view in the case studies used because the sample is inadequate [18]. SVR for the complicated data is where wi � input weights, y � actual values, b � bias, and M � total number of data samples. is comparison illustrates the purpose of use of SVR and ||W|| � magnitude of vector: Enabling SVM consists of two inadequate variables, that is, ε and ϵ * . ey are used to guard against anomalies, and 1/ 2||W||2 is used for the precision of function. Both specifications rely on the C parameter. en, equation (4) will transform into the following equation: With the suppression, Finally, SVR task is accessed as

Multilayered Perceptron (MLP).
Multilayered perceptron (MLP) is the frequently used artificial neural network (ANN) for modeling and forecasting. For evaluating tasks in simple and semicomplicated datasets, this method provides considerable accuracy. It is a wholly joined feed-forward artificial neural network in which neurons are overlapped [19]. MLP has layers: an input, output, and hidden layer. e output layer in the presented research is the total number of cases and deaths. e MLP used in this study has three neurons in the input layer, and each neuron corresponds to an input data point (total cases, total deaths, and days since infection). MLP as the method has ease of implementation. In comparison to complex forms, MLP results in highquality models while keeping robustness and accuracy in prediction.
Because MLP regressor can only revert an individual value, an adaptive model must be used if the issue hinders multiple output values. Although there may be resemblances among the models, training the whole model means that the dataset will be tested, so a better predictive model can be gained to address each issue. In the present study, three independent MLPs were employed.

Recurrent Neural Networks (RNNs).
In deep learning, it is assumed that classified models are more prospering than flat models in regression tasks [20]. As RNN holds hidden states allocated across time, it favors them to accumulate previous information. Moreover, due to their capability of analyzing the variable length of data, they are abundantly used in forecasting [21]. Our research aims to analyze and evaluate the proposed prediction model, RNN, with different hyperparameters. e essential aspect of RNNs is to consider the impact of previous data on the generated output. Most importantly, RNN is effective for learning time information [22]. LSTM and GRU are two robust RNN models. ese illustrations have depicted sublime outcomes in precision and accuracy compared to the classic time series models, and commonly used networks have identified that they can attain multiple outputs in various purposeful domains with time series [23,24]. Figure 3 shows the conceptual framework of the applied proposed model depicting splitting of data into training and testing data. Further, testing data was evaluated using MSE, RMSE, and MAPE, while training data was validated through time series prediction models ARIMA, linear regression, SVR, MLP, RNN, GRU, and MAPE. (GRU). GRU was presented by [25], which solves vanishing gradient with a standard RNN. GRU is reciprocal to LSTM, but it joins LSTM into one update gate. e GRU further combines cell and concealed form. It consists of a cell containing multiple operations which are duplicated and could be a neural network. When the neural network is applied through BPTT, it can prevent gradient vanishing [26]. e GRU layer, comprising reset gates and update gates, can learn long-term and short-term interdependence from the flow [25]. e mathematical interrelationship among different GRU factors is given by

Long Short-Term Memory (LSTM).
e common application of LSTM is in speech recognition and data prediction. Its robust performance in evaluating future predictions by modeling the issue as a series regression problem caught various scientists' attention due to its applications such as activity recognition, prediction, risk resolve, and fall detection [27,28]. As a deep learning methodology, it leads to other forecasting methods [29]. LSTM is a type of RNN, and its original purpose is to eliminate errors in previous algorithms when backpropagating the information contained in the most recent input event [30]. ere are two reasons for using LSTM. First, it returns the error to the machine to calibrate the model in the first training phase. At the same time, errors have been deliberately applied in mechanical gates. Second, the LSTM network is impartial to lag among events in the time series. erefore, when we are trying to derive an unknown prediction model, the LSTM algorithm is more effective as compared to ANN's (such as hidden Markov and SVR) or other prediction methodologies (such as ARIMA) [29].
For the flow of information, LSTM has input, output, and duplicate gates. ese gates are composed of weighted sum logistic functions, and the weights can be gained through backpropagation, all along the process of training. e input gate manages the unit state and the forget gate. e output is accomplished from the output gate or hidden state, and it illustrates the memory used by the direction. is structure permits the network to remember for a long duration, while the traditional single RNN does not have such memory. e ideal feature of LSTM is its extended quality to capture long-term dependencies and powerful ability to process time-series data. For example, given the input timeseries X t and the number of hidden units as h, the gates have the following equation: W xi , W xf , W xo , and W hc , W hf , W ho are weight parameters and b i , b f , b o denote bias parameters. W xc , W hc � weight parameter, b c is bias parameter, and o � elementwise multiplication.
e estimation of C t depends on the output information's from memory cells (C t− 1 ) and the current time step C ∼ t .

Performance Metrics.
Measure the average of the squares of the errors. It is the average squared difference between the estimated values and the actual value. MSE is a risk function, corresponding to the expected value of the squared error loss:

Root-Mean-Squared Error
Root-mean-square error is a frequently used measure of the differences between values (sample or population values) predicted by a model or an estimator and the values observed:

Mean Absolute Percentage
Error. e mean absolute percentage error (MAPE) is a measure of prediction accuracy of a forecasting method in statistics, for example, in trend estimation, also used as a loss function for regression problems in machine learning. It usually expresses the accuracy as a ratio defined by the formula

Results
Evaluation of COVID-19 transmission using mathematical models requires training on a large number of datasets. e size of the dataset affects the performance of the proposed algorithms and holds a considerable role in training. e dataset is classified into two parts, the training and the testing datasets. A training dataset is employed during model development, whereas testing datasets are used to validate datasets that are not previously used [31,32]. e interrelationship between COVID-19 and weather factors in the case of Pakistan is examined in this study. e number of confirmed COVID-19 cases (dependent variable) was log-transformed to make it work as normal distribution as the original data is highly skewed in the selected area. For a specified period up to November 15, 2020, training data evaluates the statistics of cases by considering Pakistan's humidity and temperature data. e hypothesis is that high humidity and temperature (weather variables) shall coincide with a lowered count of SARS-CoV-2 cases. Figures 4(a) and 4(b) illustrate a scatter plot among the number of proved infections compared to thermal readings and humidity in Pakistan. From these findings, it is understood that, as atmospheric moisture and thermal reading decline (increase in temperature), the numbers of infected cases and death rates also decline. When temperature and humidity showed ascending pattern, the infection rate descends. However, this fact is unavoidable that when sunlight hours increase, interaction among people increases. As a result, the infection rate may elevate. e people residing in urban areas are also strongly influenced because it means a higher population density, making COVID-19 inferior. Several parameters which can affect COVID-19 spread could be considered as a potential carrier. Population density also matters in epidemic spreading. Older people are more susceptible to the epidemic. Figures 6(a) and 6(b) depict the total number of cases without weather, while Figures 6(c) and 6(d) denote deaths. In both cases, we can observe in Figures 6(a) and 6(c) that the difference between actual and predicted graph lines is more significant than Figures 6(b) and 6(d). Predictions evaluated with weather data showed that the addition of weather parameters improved predicted results.
In order to understand whether the weather parameters, that is, temperature and humidity inclusion, affect the results or not, we created more comprehensive time series prediction models using ARIMA, linear regression, SVR, MLP, GRU, and LSTM. e current time series prediction model gives better facilitation to elucidate the impact of weather parameters on epidemic spreading. In addition, these time series models aid in illustrating the authentic interrelationship among the number of proved cases, deaths, and weather factors. Tables 1 and 2 predict the total number of cases (actual vs. predicted) and Table 3 and 4 elaborate a total number of fatalities (actual vs. predicted) with and without including weather variables, where the performance of these models are indicated in terms of MSE, RMSE, and MAPE.

Discussion
is study aims to figure out an output of seven-time series prediction models with and without weather data on the total number of COVID-19 cases and their mortality. In e efficiency of the actual versus predicted total number of cases with weather data for COVID-19 is promising and evident. LSTM's outperforming ability to handle fewer datasets than the other models (linear regression, SVR, MLP, RNN, and GRU) which possibly require lengthier data to evaluate correlated fluctuation in time series data has made it a better choice. Conversely, RNN and its updated version GRU accommodate comparatively balanced forecasting performance due to the evaluation metrics (RMSE and MAPE), and explained variance is executed ambiguously. e performance of time series models ARIMA, linear regression, SVR, MLP, RNN, GRU, and LSTM in MSE, RMSE, and MAPE predicting the total number of cases without considering weather data parameters (temperature and humidity) is shown in Table 2. It is clear that values of MSE, RMSE, and MAPE for all-time series prediction models were enhanced without the addition of weather data; for example, GRU showed values of 180.8178718, 13.4468536, and 0.018989281 for MSE, RMSE, and MAPE without weather data. In contrast, it was 140.0163399, 11.83285003, and 0.01641411, respectively, for the number of cases with weather parameters.
Similarly, Table 3 shows the application of time series prediction models on the number of deaths in Pakistan, considering the weather parameters, temperature, and humidity. LSTM shows the best MSE, RMSE, and MAPE values, that is, 1711, 41.36423576, and 0.492211157, respectively. But in Table 4, it is predicted that if we see performance of time series models ARIMA, linear regression, SVR, MLP, RNN, GRU, and LSTM in terms of MSE, RMSE, and MAPE without temperature and humidity, the accuracy of models descends. In both Tables 3 and 4

Mathematical Problems in Engineering
Based on our study results shown in Tables 1-4, it can be illustrated that weather parameters like moisture and temperature can pervade SARS-CoV-2. From the results, we can conclude that there can be elevated epidemic spreading when atmospheric temperature and humidity descend. While on the other hand, when both temperature and humidity are high, the infection rate of SARS-CoV-2 declines. In forecasting the total number of cases and total deaths with and without weather data in Pakistan, the current research illustrated comparability among deep learning models using time series models ARIMA, linear regression, SVR, MLP, RNN, GRU, and LSTM to training datasets. e present study findings elucidate the sublime performance of LSTM over other models by showing high accuracy and precision compared to other time series prediction models.
In this study, we focused on the number of cases and death cases from Pakistan. First, each model is trained. en, we forecast each variable. Parameters of the constructed ARIMA, linear regression, SVR, MLP, RNN, GRU, and LSTM models based on training datasets are presented in Table 5.

Conclusion
e present investigation analyzed the effect of prime weather factors (temperature and humidity) on the number of reported cases and deaths due to COVID-19 in Pakistan. Different time series prediction models such as ARIMA, linear regression, SVR, MLP, RNN, GRU, and LSTM were used, and the execution of each model was analyzed in terms of MSE, RMSE, and MAPE. Results illustrated that the LSTM could better predict the COVID-19 spread as compared to other models. From the present results, we can deliberately conclude that weather holds significance in COVID-19 prediction. us, it is advised to wear masks and personal protective wears, keep social distancing, and continue isolation (on infection/suspect) until the temperature rises or the vaccine is fully deployed. Further, predicting the COVID-19 spread/incidence by considering other weather parameters like rainfall, wind speed, and so forth shall provide additional clues to mitigate the epidemic.

Conflicts of Interest
e authors declare that there are no conflicts of interest.

Authors' Contributions
Humera Batool conceptualized the study, developed the methodology, performed formal analysis, reviewed and edited the article, performed validation, and performed visualization. Lixin Tian reviewed the article, performed supervision, and performed project administration.