Predictive modeling of COVID-19 death cases in Pakistan

Background The world is presently facing the challenges posed by COVID-19 (2019-nCoV), especially in the public health sector, and these challenges are dangerous to both health and life. The disease results in an acute respiratory infection that may result in pain and death. In Pakistan, the disease curve shows a vertical trend by almost 256K established cases of the diseases and 6035 documented death cases till August 5, 2020. Objective The primary purpose of this study is to provide the statistical model to predict the trend of COVID-19 death cases in Pakistan. The age and gender of COVID-19 victims were represented using a descriptive study. Method ology: Three regression models, which include Linear, logarithmic, and quadratic, were employed in this study for the modelling of COVID-19 death cases in Pakistan. These three models were compared based on R2, Adjusted R2, AIC, and BIC criterions. The data utilized for the modelling was obtained from the National Institute of Health of Pakistan from February 26, 2020 to August 5, 2020. Conclusion The finding deduced after the prediction modelling is that the rate of mortality would decrease by the end of October. The total number of deaths will reach its maximum point; then, it will gradually decrease. This indicates that the curve of total deaths will continue to be flat, i.e., it will shift to be constant, which is also the upper bound of the underlying function of absolute death.


Introduction
The COVID-19 pandemic has emerged very rapidly worldwide, affecting nearly 5,488,825 individuals with 349,095 deaths (WHO, 2020a(WHO, , 2020b. Initially, COVID-19 was thought to be a zoonotic virus (bat to human transmission); however, recent studies and the exponential increases in the incidence of COVID-19 indicate complete evidence of transmission from person to person (Adeniyi et al., 2020;Li et al., 2020a;Ogundokun and Awotunde, 2020;WHO, 2020aWHO, , 2020b. The first human exposure case was connected to a "wet market" from Wuhan, Hubei Province, China, in late December 2019 (Abid et al., 2020;Lu et al., 2020;Lukman et al., 2020;Ogundokun, Lukman, Kibria, Awotunde, & Aladeitan, 2020a, 2020b. The source of transmission was via droplets when an individual infected cough; it then entered into the human body and caused deteriorating effects on the intestines, spleen, and lungs. Even a single cough of corona infected individuals can affect three healthy individuals and six immunocompromised patients (Hoffmann et al., 2020;Ogundokun et al., 2020aOgundokun et al., , 2020b. The issue of COVID-19 in Pakistan arrived from the Iranian territory as several thousands of citizens travel to pilgrimage the spiritual place in Iran. After which Pakistan decided to close its border from the entry of individuals from Iran on February 23, 2020 (Lu et al., 2020). Apart from Iranian pilgrimages, several cases were traced to Afghanistan (Raza et al., 2020). Initially, the first two cases in Pakistan were announced on February 26, 2020 by the government, and it was established that the two patients had a travel history from Iran. To curtail the outbreak of the COVID-19, the federal government launched a quarantine policy on the Pak-Iran border city of Taftan (WHO, 2020a(WHO, , 2020b. Up till August 5, 2020, the number of confirmed deaths cases was 6035 (GOP, 2020). The government of Pakistan has continued to enforce blended rules about social separating. Pakistan was forced to lock down mosques, huge get-togethers, mass gatherings, shopping malls, private institutions, universities, marriage halls. The government is taking strict actions and reassuring priests about the wellbeing measures. The current situation is unfavourable for Pakistanis that the cases keep increasing; therefore, specialists were encouraged to force a lockdown in numerous urban areas, yet this wasn't easy. Many people didn't keep to the rules of the lockdown in Italy or China, so it would be challenging to authorize in a nation like Pakistan. However, to date, partial lockdowns under section-144 have been enforced in all Pakistan (Chadsuthi et al., 2012).
Medical researchers often use linear regression to understand the relationship between drug dosage and blood pressure of patients. Quadratic regression model serves the purpose of modelling when a set of data shaped like a parabola and logarithmic regression models have been extensively used for modelling intensity of sound, yields of chemical reactions, production of goods, and growth of infants. Several statistical models can predict essential insights for public health interventions by observing "what if" scenarios. Therefore, this study aimed to predict changes in the cumulative number of COVID-19 related deaths for the coming weeks in Pakistan. This would help evaluate the impact of quarantine, social distancing, masks wearing, and smart lockdowns in the country. Three regression models were chosen, which were conventionally used in the literature for modelling and prediction purposes. Different model selection criteria have been extensively used in the literature like Kullback-Leibler divergence, Akaike information criteria, PRESS statistic, Bayesian information criteria, coefficient of determination, adjusted coefficient of determination, Mallow's C p . R 2 is one of the conventional criteria which has been used for model selection. The closer it is to 1, the better is the fit. The goodness of fit means how close an estimated value of Y is to its actual value in the given sample observations. But it increases with the increase in the predictors, so it is not the best choice because it may also increase the variance of forecast error. Adjusted R 2 is another choice as it accommodates the problem of considerable conflict. The most reliable techniques for model selection nowadays are Akaike Information Criteria (AIC) and Bayesian Information Criteria (BIC), as it imposes a penalty for adding regressors to the model. They set a harsher punishment than R 2 and Adjusted R 2 . The main advantage of using AIC and BIC is that they are beneficial for forecasting purposes.

Related works
Machine learning and predictive approaches have been widely applied in the earlier researches in the part of infectious ailments, which time series forecasting is a branch of. Sources include models of leptospirosis and its rainfall-temperature relationship (Chadsuthi et al., 2012), including temporal associations amid the continuing figure of cases of Plasmodium falciparum and El Niño Southern Oscillation (ENSO) (Hanf et al., 2011). Different methods have often been adopted for modelling pathogens that transpire in recurrent or repeated cycles, for instance, seasonal virus, for which a variety of researches have been released which utilized time-series demonstration to forecast possible epidemics. In (Song et al., 2016), and ARIMA (Adhikari & Agrawal, 2013) method was built to predict the regular occurrence of infection in China for 2012, whereas in (Yin et al., 2020), a predictive time series method (Tempel) was projected for influenza change estimation. Further sources include research by Lee et al. (Lee et al., 2017), who developed a time series method utilizing daily virus-linked tweet totals and used it to deliver instantaneous infection distribution evaluation. Zhang et al. (Zhang et al., 2019), designed a SARIMA method (Adhikari & Agrawal, 2013) utilizing Australian infection investigation and resident Internet pursuit data to forecast periodic flu contagion in the northern hemisphere. Time series analysis was used in (Soebiyanto et al., 2010) to examine the function of temperature variables in the public health of virus spread in 2 warm-environment areas, Hong Kong and Maricopa County (Arizona U.S.). Dominguez et al. (Dominguez et al., 1996) utilizing an alternative time series method to investigate the actions of 2 infection incidence markers in the Barcelona area to enhance their identification. As far as COVID-19 predictions are concerned, there remained a flow in the systematic research available over the preceding months. Much of these researches depend on forecasting metrics linked to coronavirus, for instance, active cases with demises in China, someplace the virus first evolved. In (Roosa et al., 2020), real-time estimates of the total sum of confirmed infected individuals were generated in the China district utilizing threedifferent phenomenological methods commonly used to predict transmittable ailments, for instance, SARS, Aids, contagion, infection, and dengue. In similar research, Yang et al.  merged residents relocation data and public health data to form a Susceptible e Exposed e Infectious e Removed (SEIR) method and amalgamate it with artificial intelligence system prepared on the 2003 SARS datasets to forecast China's contagion arc. In (Li et al., 2020b),the asymmetrical feature was used to model the average and an overall number of diseases and demises, including the associated pandemic whirling opinions in China. An improved loaded auto-encoder was established in  to forecast the epidemic conveyance dynamics and to estimate the sum of documented COVID-19 crisis crosswise China. In contrast, Al-qaness et al. (Al-qaness et al., 2020) projected an amalgamation of an adaptive neurofuzzy inference system (ANFIS) and a salp-swarm-procedure-improved flower pollination algorithm (FPA) to envisage established COVID-19 crisis. Simple mean-field models were used in (Fanelli and Piazza, 2020), an analysis covering China including 2 European nations, Italy and France, to forecast the distribution of the pandemic, and most importantly the height and duration of its outbreak in both of those nations.
Cai, Jia, Feng, Li, Hsu & Lee (Cai et al., 2020) implemented the Multi-Task Gaussian Process (MTGP) regression method to boost wind speed arithmetical forecasts is studied in this article. In the proposed system, the Numerical Weather Forecasters (NWF) forecasts are first combined with a Support Vector Regressor (SVR). Pandey, Chaudhary, Gupta & Pal (Pandey et al., 2020) employed SEIR and regression models for forecast built on datasets gathered from John Hopkins University repository in India. Model efficiency was measured using RMSLE and 1.52 for the SEIR model, and 1.75 for the regression method was obtained. The fault degree of RMSLE amid the SEIR and Regression methods was 2.01. To explain the progress of the COVID-19 contagion procedure, Hou et al. (Hou et al., 2020) established a properly varied SEIR compartmentalized method. The acceptable outcomes of the properly diverse SEIR method presumed that the latent individuals' interaction degree is amid six and eighteen, reflecting the potential effect on the disease infection rate of isolation and quarantine interventions. The findings indicate that strategies can efficiently decrease the overall sum of COVID-19 contagions and deferred the ultimate duration of diseases by decreasing the touch proportion, for instance, seclusion and confinement. Multivariate COX regression was used by Ji et al. (Ji et al., 2020)to classify the risk factors associated with development, and then implemented into the nomograph to construct an innovative estimation recording method. To test the consistency of the novel method, ROC was used (Ji et al., 2020). Hao, Xu, Hu, Wang (Hao et al., 2020) employed Elman neural network, long short-term memory (LSTM), and support vector machine (SVM). An SVM with fuzzy granulation was employed to forecast the evolution range of recently established incidents, recent demises, and recently recovered persons. To derive the association amid various features and the dispersal degree of COVID-19, Malki et al. (Malki et al., 2020) suggested different regressor machine, learning models. The machine learning procedures used in this analysis evaluate the effect on the transmission of COVID-19 of weather elements, for instance, temperature and humidity by removing the association amid the sum of reported incidence and weather elements in some provinces. In 2020, a risk model for forecasting essential diseases such as death was developed by Schalekamp et al. (Schalekamp et al., 2020) Including clinical, CXR and laboratory results. They used multivariable logistic regression. Verdict arch examination was also conducted, and a hazard simulator was imitated.

Modelling Covid-19 death cases and functional forms
The following are the three regression models that were compared for the modeling and prediction purposes. The linear regression model can be expressed as follow; The logarithmic regression model has the following functional form The quadratic regression model used has the following functional form; Table 1 shows the estimations of parameters and value of AIC and BIC from three models for the corona deaths. R 2 for the linear regression model is 0.928, and the Adjusted R 2 is 0.861. The value of the coefficient of determination for logarithmic regression is 0.705, which showed that independent variables explain 70.5% of the variation in the dependent variable as compared to the R 2 value of quadratic regression (0.997). This is much higher than logarithmic and linear regression (0.994) but does not guarantee the excellent fit of the model because as we increase the number of independent variables, the value of R 2 changes. The essential criteria which have been extensively used in the literature for model comparison purpose are Akaike information and Bayesian information criteria. Akaike information criterion (AIC) is a fined technique based on insample fit to estimate the likelihood of a model to predict/evaluate the future values. AIC is an estimator of out of sample prediction error and thereby the relative quality of statistical models for a given set of data. Given a collection of models for the datasets, AIC estimates the rate of each model close to each of the other models. We ought to choose AIC and BIC criteria for the selection of a good model. The optimal model is selected based on the highest R 2 and minimum AIC and BIC. From Table 1, it can be seen clearly that the quadratic regression model shows the best results for every model selection criterion. It has the minimum value of AIC and BIC among all three regression models, which is 330.71 and 141.81 respectively.

Results and interpretation
The value of the coefficient of determination is 0.997 showing that independent variables explain 99.7% variation in the dependent variable. The amount of adjusted R 2 is 99.4% and also showing the same trend. The value of Durbin Watson dstatistic is 0.01, which lies in the autocorrelation area. So, there is evidence of autocorrelation, but this also does not have an impact on the prediction date. There is evidence of multicollinearity as the value of VIF is one which is showing a linear relationship between the linear and the quadratic trend, but this does not have an impact on the prediction date. The multicollinearity could have been avoided by taking the transformation of the variable. Fig. 1 shows the comparison of fitting regression models. The observed data were plotted against the fitted data of all three models. Quadratic regression shows a better fit as compared to the other two models. Fig. 3 shows the scatter plot, which indicates that there exists no precise pattern, and the points are diffused.
For this reason, there does not exist heteroscedasticity in the quadratic regression model. Fig. 2 shows the probability plot of residuals that is meaning that the residuals follow the normal distribution. The model could be used for prediction purposes because all assumptions were met.
In the presentation of data, the number of deaths in Pakistan demonstrated by the was obtained from the National Institute of Health of Pakistan from February 26, 2020 up to August 5, 2020. After testing the primary unit root and some functional formats, the daily data fits well and suggests a statistically appropriate model (Raza et al., 2020). The estimations of parameters through the quadratic model for the corona deaths were mentioned in Table 2, so, the quadratic regression equation for the corona deaths is; The quadratic regression equation, trends with a negatively signed coefficient allows the fitted ln DY t to reach a maximum (both local and global) and then to change its direction from increasing to decreasing. The model is relatively statistically adequate for prediction purposes. Table 3 shows the observed and fitted data of deaths cases.  Fig. 4 shows the predicted and fitted deaths cases due to COVID-19 in Pakistan, which is offering a good fit for the model and suitability for prediction purposes.

Discussion
In this paper, we have proposed three regression models for the prediction of death cases by COVID-19 in Pakistan and selected quadratic modelling based on the model selection criterion. There are four stages of the epidemic, S1: exponential,   S2: power law, S3: linear and S4: flat (VermaAli, 2020). The death cases in Pakistan have entered the phase of balanced and quadratic regression in this term is giving an excellent fit to the data. The same model has been used by (Tominaga, 2020) in which he showed that such a regression model is good even in the early stages of the epidemic, which is generally said to increase exponentially and monotonically. Quadratic regression modelling has also been involved in the prediction purpose for Fenton treatment of municipal landfill leachate (Lak et al., 2012). This model included both significant linear and quadratic parameters. This method of modelling has also been suggested in the situations where an estimation of the possible date of flattening the curve of the cases of infected individuals (Leon, 2020). The same modelling technique has been used for projections for first-wave COVID-19 deaths across the U.S. using social-distancing measures derived from mobile phones (Woody et al., 2020). The quadratic time trend model was also applied to the log of new cases, that accurately predict the trajectory of the epidemic in China (Linton, 2020).

Conclusion
WHO data of the whole world, together with the initial statistics about China, indicates that the daily cases and the number of patients who have been recovered from this disease are trending high. Although there are deaths because of this disease, it's not trending upward. The number of deaths has been analyzed concerning gender, and it was concluded from the data of different countries that men are vulnerable to COVID-19 than women. This may be due to heart diseases, blood pressure, and smoking habits in men, which makes them weaker towards COVID-19 than women. It has also been observed from different countries that most affected age groups by this virus vary from country to country. The least affected age group from this virus around the globe is below 18 years. There is a significant difference between the average deaths and recovery cases since the recovered patients are right in numbers than fatalities. Studies showed that as far as issues and death rates are concerned, age and gender impacted differently. The Quadratic regression model has been selected from three regression models based on the model selection criteria; conventionally used methods are AIC and BIC for discussing the death cases. The model which has the smallest value of AIC and BIC among all the regression models, that model is used for modelling and prediction. After applying the predictive model, the rate of mortality is predicted to decrease by the end of October. The total number of deaths will be reached at the maximum point; then, it will gradually decrease. This indicates that the curve of total deaths will continue to be flat, i.e., it will shift into a constant that is also the upper bound of the underlying function of total deaths. For the deterministic part of the model, the definition remains. This interpretation holds for the deterministic aspect of the model. The coronavirus carriers are anonymous, and everyone is a potential carrier of the virus that could cause great havoc to society. The outbreak may rise to an unmanageable scenario. With the increased number of deaths, the government should consider lockdown decision with strict rules and regulations as well as the public should follow simple and basic prevention guidelines.

Declaration of competing interest
The authors declared there is no conflict of interest during this study.