Analysis and forecasts for trends of COVID-19 in Pakistan using Bayesian models

Background COVID-19 is currently on full flow in Pakistan. Given the health facilities in the country, there are serious threats in the upcoming months which could be very testing for all the stakeholders. Therefore, there is a need to analyze and forecast the trends of COVID-19 in Pakistan. Methods We have analyzed and forecasted the patterns of this pandemic in the country, for next 30 days, using Bayesian structural time series models. The causal impacts of lifting lockdown have also been investigated using intervention analysis under Bayesian structural time series models. The forecasting accuracy of the proposed models has been compared with frequently used autoregressive integrated moving average models. The validity of the proposed model has been investigated using similar datasets from neighboring countries including Iran and India. Results We observed the improved forecasting accuracy of Bayesian structural time series models as compared to frequently used autoregressive integrated moving average models. As far as the forecasts are concerned, on August 10, 2020, the country is expected to have 333,308 positive cases with 95% prediction interval [275,034–391,077]. Similarly, the number of deaths in the country is expected to reach 7,187 [5,978–8,390] and recoveries may grow to 279,602 [208,420–295,740]. The lifting of lockdown has caused an absolute increase of 98,768 confirmed cases with 95% interval [85,544–111,018], during the post-lockdown period. The positive aspect of the forecasts is that the number of active cases is expected to decrease to 63,706 [18,614–95,337], on August 10, 2020. This is the time for the concerned authorities to further restrict the active cases so that the recession of the outbreak continues in the next month.


INTRODUCTION
During December, 2019, the historical pandemic started in Wuhan, China (Paules, Marston & Fauci, 2020). This pandemic was named as a novel coronavirus COVID-19 by the World Health Organization (WHO, 2020). Though this virus has lower death rates when compared to Middle East Respiratory Syndrome (MERS) and Severe Acute Respiratory Syndrome (SARS), this virus has higher transmission rates (Tomar & Gupta, 2020). Due to its higher transmissibility, the virus has covered almost the whole world (WHO, 2020). In Pakistan, the first positive case was observed on February 26, 2020. The country imposed a complete lockdown on March 23, 2020. However, the struggling economy of the country forced the government to lift the lockdown on May 9, 2020. During the lockdown, the number of new cases was low (Yousaf et al., 2020); however, the outbreak has increased pace since lifting of the lockdown. There are further expectations that the number of cases and deaths will increase more rapidly in the future (Yousaf et al., 2020).
There has been an earlier attempt to forecast the number of cases, recoveries and deaths in Pakistan (Yousaf et al., 2020;Khan, Saeed & Ali, 2020). However, the contribution by Yousaf et al. (2020) was based on quite smaller datasets which is serious issue for producing reliable predictions (Moftakhar, Seif & Safe, 2020). Furthermore, these forecasts underestimated the number of confirmed cases and deaths which may be due to changing post-lockdown trends in the country. On the other hand, the contribution by Khan, Saeed & Ali (2020) did not considered the causal impacts of the lockdown in the country. In addition, the said contributions used the ARIMA models (Yousaf et al., 2020) and vector autoregressive models (Khan, Saeed & Ali, 2020) for the forecasts. The ARIMA models have been frequently used to forecast the patterns of this pandemic (Gupta & Pal, 2020;Majeed, Adeleke & Popoola, 2020;Benvenuto et al., 2020;Kumar et al., 2020). However, these models have some limitations. Firstly, the forecasts from these models are dependent on the previous behavior of the data along with preceding forecast errors, which means the forecasting error accumulates over time. In addition, the forecasting accuracy of these models is affected in presence of covariates (Brockwell & Davis, 2002). To avoid such issues, the Bayesian structural time series (BSTS) models are employed (Scott & Varian, 2013;Brodersen et al., 2015). The BSTS models (i) allow the inclusion of prior information (ii) allow the model parameters to evolve over time (iii) can handle large number of covariates using spike and slab prior (iv) are less dependent on certain hypothesized specifications (v) can investigate individual components of a time series (McQuire et al., 2019). Interestingly, number of time series models, such as ARIMA (used by Yousaf et al., 2020) and vector autoregressive models (used by Khan, Saeed & Ali, 2020), can be expressed in the state space form (Scott & Varian, 2014b). These models have the capability to provide reliable forecasts for future outbreak of different diseases (Scott & Varian, 2014a). The said models have been used earlier in forecasting the health issues by using alcohol and alcohol licensing policies (de Vocht et al., 2017;McQuire et al., 2019). The proposed models provided 14% improvement (while comparing with ARIMA models) in forecasting the consumer sentiments (Scott & Varian, 2014a).
The strict social distancing in Wuhan facilitated China to restrict the spread of the pandemic in other provinces (Li et al., 2020). However, in Pakistan the lockdown was lifted at a quite earlier stage of the pandemic. Unfortunately, to the best of our knowledge, the effects of ending the lockdown have not been studied yet. Therefore, the analysis of the impacts of lifting the lockdown in the country is very important. The causal impacts of different interventions can also be computed using BSTS models. Unlike ARIMA models, these models compute the dynamic confidence interval for the evolving impact, based on difference between actual and counterfactual series. Due to their added advantages, these models are superior to conventional models (Brodersen et al., 2015).
We have conducted a study to investigate the various parameters of COVID-19 in Pakistan, for next 30 days. Since the Bayesian methods often produce better results as compared classical methods (Kundu & Joarder, 2006;Kundu, 2008;Pak, Parham & Saraj, 2013), the BSTS models have been proposed for the forecasting. The causal impacts of lifting the lockdown in the country have also been investigated conducting intervention analysis within BSTS models. The improved forecast accuracy has been observed using BSTS models, instead of repeatedly used ARIMA models. We have also included the datasets from neighboring countries (India and Iran) to investigate the performance of the proposed models for data regarding other countries. The detailed study suggested that the cumulative number of confirmed cases and deaths is expected to increase exponentially during next month. However, the recoveries are expected to increase faster than confirmed cases, so the active number of cases will decrease during the next month. The study also revealed that lifting the lockdown at the earlier stage has seriously increased the trajectory of the outbreak in the country.

MATERIALS AND METHODS
The data have been obtained from the published reports of the National Institute of Health (NIH), Islamabad, Pakistan. The data contain the details regarding cumulative (and daily) number of cases, deaths, recoveries and tests in the country (NIH, 2020). NIH updates the data on daily basis since February 26, 2020. Additionally, data for Iran and India are obtained from reports of Our World in Data (2020). As we have used the published data by NIH and Our World in Data, ethical approval was not required.

ARIMA models
The ARIMA model is mathematically written as ARIMA (p, d, q). The model is based on three parameters 'p, d and q'. The parameter 'p' represents the autoregressive terms, 'd' defines the number of non-seasonal differences and 'q' denotes the number of moving terms. The ARIMA (p, d, q) model can be written as (1) where, h p represents the terms of autoregressive operator, e q are the coefficients of the error terms, q are the values of moving average operator and Z t is d-order differenced time series.

Structural time series models
Visualization of a time series as a product of aggregating different components is very useful. The decomposition of each layer facilitates the direct individual interpretation of the model. The structural time series models have these features. The basic form of the structural time series models (STM) is as follows: where Y t is the observed value, T t represents the trend component, S t denotes the seasonal component, and E t is the error term. Hence, the STM is a dynamic system composed of trend and seasonal components perturbed by random errors. The STM is capable of reflecting the important characteristics of the data and enhancing prediction power of the model by including repressors. These models also utilize the previous knowledge about the parameters by adding the prior information.
The Gaussian STM can be expressed as Equation (2) links the observed data Y t with the unobserved latent variable t , hence called observation equation. In (2), Y t is k Â 1 vector of values, W t is a k Â m matrix involving known values, t is unobserved k Â 1 state vector and x t are the randomly and independently distributed Gaussian error terms with zero mean and variance È t . On the other hand, Eq. (3) is called transition equation as it defines the behavior of latent state over time. It is simply an autoregressive model of t , defined by the unobservable Markov Chain process observed by Y t . Here, the X t is k Â k transition matrix, H t is k Â m error control matrix (indentifies the rows of transition equation having non-zero error terms), and l t another Gaussian random error term with mean zero and variance t . Therefore, the STM contain the underlying stochastic process determined by t . Since t are unobservable, a vector of observations is used to compute the system. The initial information in shape of prior information is assumed for which is normally distributed with mean t and variance Q 0 which is independent of x t and l t for all t.

Markov Chain Monte Carlo (MCMC) method
Suppose b is a set of all model parameters and a ¼ a 1 ; a 2 ; …; a m ð Þrepresents the full state sequence. Further suppose that the prior distribution for b is p b ð Þ as well as distribution p a 0 b j ð Þ on the initial state values. Then, the sample from the posterior distribution Þcan be obtained using MCMC method in the following manner. Draw a sequence of a 1 ; b 1 ð Þ; a 1 ; b 1 ð Þ; … from a Markov chain having stationary distribution as p a; b Y 1:n j ð Þ . The sampler alternates between the data augmentation step and parameter simulation step. The data augmentation step and parameter estimation step simulate from conditional distributions p a Y 1:n ; b j ð Þand p b Y 1:n ; a j ð Þ , respectively. The data augmentation step has been carried out using the algorithm proposed by Durbin & Koopman (2002) which provides the improvement in the algorithm proposed by De Jong & Shephard (1995). Once the draws for the state are available, the draws for the parameters are easy to obtain for all state components excluding the static regression coefficients. The draws for the static regression coefficients have been obtained using the algorithm proposed by Ghosh & Clyde (2011). The detailed studies regarding implementation of MCMC algorithms can be seen from the contributions of Fitzgerald (2001), Bugallo, Martino & Corander (2015) and Martino (2018).

Causal impacts
Equations (2) and (3) also investigate the post-lockdown discrepancy between observed number of new cases, deaths and active cases and a simulation based series that was expected without lifting the lockdown. Such investigations provide facilitations in assessing the causal impacts of lifting lockdown in the country. These causal impacts can be computed by the following steps.
Step-I: Fit STM using observations under the lockdown period.
Step-II: Use fitted model from Step-I to produce forecasts for post-lockdown period.
Step-III: The difference between forecasted series and actual series during the post-lockdown period is computed and interpreted as causal impacts of lifting the lockdown in the country. The causal impact model incorporates the behavior of two time series (i) the behavior of the response series in the pre-intervention period (ii) the behavior of a time series that were predictive of the response series in the pre-intervention period. If the control series did not encountered the intervention, it can be assumed that the relationship that existed between the treatment and control in the pre-intervention period may continue in the post-intervention period (Brodersen et al., 2015). We used number tests as the control series, which were predictive of the number of cases/deaths prior to the lifting of the lockdown. The number of tests was itself not affected by the lifting of the lockdown; hence the assumptions for the implementation of the causal impact model were not affected.
We have used R software to conduct the analysis regarding the study. The computations for ARIMA models have been obtained using auto.arima function available in forecast package, while bsts package has been used to obtain the forecasts for the BSTS models (Scott, 2020). We have employed the local level model for the trend component. The number seasons have been chosen to be seven to evaluate the weekly pattern of the series. The root mean square percentage error (RMSPE) and root median square percentage error (RMdSPE) have been used to compare the forecast accuracy of proposed models with ARIMA models. The said measures can efficiently determine the forecast accuracy of a model (Hyndman & Koehler, 2006;Bowerman, O'Connell & Koehler, 2004). Based on these measures, we have investigated the predictive power of BSTS models using different starting points. To investigate whether the residuals are white noise, we used Ljung Box test at different lags. We have also included the datasets from neighboring countries (India and Iran) to observe the applicability of the proposed models in different situations. As we observed the improved forecasting capacity of BSTS models, the detailed forecasts have only been reported under BSTS models for brevity. The forecasts under BSTS models are based on prior information and current data (likelihood function). The bsts package uses spike-slab prior for the analysis. The prior information is combined with the likelihood function to produce the posterior distribution. The BSTS models often use the default prior. The default prior for the variance r 2 ð Þ can be formulated as: r À2 $ G 10 À2 ; 10 À2 s 2 On the other hand in case of several controls, the spike and slab prior is used for the coefficients (Scott & Varian, 2014b). The prior parameters in spike and slab are selected using Zellner's prior method (Liang et al., 2008;Zellner, 1986).The posterior distribution is estimated using Gibbs sampler (George & McCulloch, 1997). The detailed studies regarding implementation of MCMC algorithms can be seen from the contributions of (Fitzgerald, 2001;Bugallo, Martino & Corander, 2015;Martino, 2018). In addition the Kalman filter and Bayesian model averaging were employed to obtain the forecasts. The causal impacts of lifting the lockdown are investigated by conducting intervention analysis within the BSTS models. The CausalImpact package in R (Brodersen & Hauser, 2020) has been used for estimating the said impacts.

RESULTS AND DISCUSSIONS
On July 11, 2020, Pakistan had 246,351 number of total confirmed cases, 93,217 active cases, 5,123 deaths and 153,134 recoveries regarding COVID-19. The ratio of cumulative confirmed cases per 100 tests (RCT) stood at 16.01. Similarly, the ratio of cumulative deaths per 100 confirmed cases (RDC) was 2.08, and the ratio of cumulative number of recoveries per 100 confirmed cases (RRC) was 62.12. These figures have already challenged the healthcare system of Pakistan having total number of 215,436 doctors, 108,137 nurses, 19,218 community health workers (CHWs) and 41,689 labs across the country (WHO, 2020). The government figures also suggest that there are 6,664 crucial care beds and about 2,500 Ventilators for the COVID-19 patients in the country (NIH, 2020). These figures simply indicate that Pakistan has very limited healthcare infrastructure to cope up this pandemic. Hence, analysis of different parameters of this epidemic is fundamental for watchful decision making in the country. Using the dataset obtained from NIH, we compared the forecasting accuracy of BSTS models with ARIMA models. The comparison is made on the basis of RMSPE and RMdSPE. The current trends of the pandemic have been represented in Fig. 1. The results regarding comparison of predictive powers of the proposed models are given in Table 1. The diagnostic checking for the proposed models, using Ljung Box test, has been reported in Table 2. The future trends of the outbreak are presented in Table 3 and in Fig. 2.
The separated analysis of trend, seasonality and regression components has been presented  Fig. 3. On the other hand, the causal impact of lifting lockdown in the country has been discussed in Fig. 4. Figure 1 reports the current trajectories of COVID-19 in Pakistan, where the zeroth day denotes February 25, 2020. In particular, this figure includes the trends of RCT, RDC, RRC and RAC. Though RAC has decreasing trend and RRC has increasing trend, the issue for the country is that RDC is constant over time, which means that rate of deaths is directly proportional to number of cases. In addition, the RCT has increased over time, which is quit alarming. It means the relative pace of spread has increased over time. The comparison of forecast accuracy for ARIMA and BSTS models are placed in Table 1. The numerical results for RMSPE and RMdSPE for each country (Pakistan, Iran and India) have been reported in the said Table. We have used different starting points to check the predictive power of the proposed models. In particular we have used four different  said figures suggested the good convergence of the estimates. Further, QQ plots for the residual terms have been given in Figs. 9-12. In QQ plots, the black line is the reference line for the standard normal distribution and means of MCMC draws have been represented by blue dots. Since the dots are clustered around the straight line, the normality assumption holds well for the residuals. In additions, we have also plotted the posterior distribution of the autocorrelation function (ACF) of residuals using boxplots, for Pakistan, in Figs. 13-16. Since the ACF plots have dampened out at successive lags, so the residual terms are stationary for each variable. Further, the numerical results regarding diagnostic checking of the proposed models have been presented in Table 2. From these results, it can be seen that P-values for the Ljung Box test (Q-statistic) are greater than 0.05 for all variables at different lags. So, we do not have sufficient evidence to assume the residuals as dependent. The calibration for the models using 90%, 80%, 70%, 60% and 50% data has been reported in Figs. 17-21, respectively. From the said figures, it can be seen that actual data are within the corresponding 95% prediction intervals. Hence, the models can efficiently be used to derive the forecast for the epidemic. Table 3 and Fig. 2 represent the forecasts for various parameters of the epidemic using BSTS models. It should be noted that this figure has logarithmic scale for y-axis, so the straight line indicate the exponential growth. From this figure, it can be assessed that estimated series is quite close to the observed data (mostly overlapping the observed series), which indicates the model efficiency. Further, the cumulative number of positive cases, recoveries and deaths is expected to increase exponentially during the next 30 days. However, the growth in number of recoveries is expected to be faster than that of confirmed case, which is a good sign for the country. To be more specific, on August 10, 2020, the expected number of positive cases in Pakistan will be 333,308 with 95% prediction interval [275,077]. Similarly, the number of deaths in the country is expected to reach 7,187 [5,390] and recoveries may grow to 279,602 [208,420-295,740]. However, it was very encouraging to observe that the active cases are likely to decrease to 63,706 [17,614-95,337], which is lower as compared to current number. This relief is mainly due to rapid growth in the recoveries in the country. The demand for the healthcare facilities is expected to be slightly lower in the next month.  The BSTS models also allow us to investigate the patterns of trend, seasonality and regressions individually. We have investigated the contribution of these components for cumulative number of positive cases, as the patterns for the other parameters were alike. As the cumulative number of positive cases depends on the cumulative number of tests, we considered the cumulative number of tests conducted in the country as covariate. The contributions of the said components have been presented in Fig. 3. This figure elucidates that the cumulative number of positive cases is increasing exponentially in the country. The contribution of seasonality is slightly increasing over time. In addition, the contribution of regression component is also has an upward trend, this is mainly due to increase in rate of confirmed cases. The causal impact of lifting the lockdown in the country has also been discussed by conducting intervention analysis using BSTS models. The results have been given in Fig. 4. This figure simply indicates the significant increase in pace of the outbreak (in black lines) as compared to expected, had the lockdown not lifted (in grey shade). The probability of causal impact was 0.9989, which was quite high indicating that the lifting the lockdown has increased the outbreack of the epidemic significantly. On July 11, 2020, the total number of confirmed cases was 246,351; however under lockdown the number of cases would have been 147,583 with 95% interval [135,807]. So, there is an absolute increase of 98,768 confirmed cases with 95% interval [85,018]. Hence, the country has conceded 56% increase (in the confirmed cases) than the expected, if the lockdown would have continued. In addition, during the lockdown, the average rate of positive cases per 100 tests was 8.68%, while for the post lockdown period this rate stands at 13.21%. Similarly, the rate of deaths per 100 positive cases, for the lockdown period, was 1.46% which increased to 2.06% in the after the lockdown. On the whole, the ending lockdown has significantly increased the load on the Pakistani healthcare system. It is also a truth that the country cannot afford a prolonged lockdown. However, identifying the hotspots and enforcing lockdown in those areas may help. In addition, the serious effort to induce the people to abide by the SOPs in the country is fundamental. indicated a rapid future growth in the pandemic in respective countries. Our results are also comparable with the earlier study carried out in Pakistan (Yousaf et al., 2020). However, this study did not cover the impact lockdown in the country. In addition, we have obtained these forecasts using a more flexible model. Farzana Noor performed the experiments, analyzed the data, authored or reviewed drafts of the paper, and approved the final draft. Amjad Ali performed the experiments, authored or reviewed drafts of the paper, and approved the final draft.

Data Availability
The following information was supplied regarding data availability: The data and the R code are available in the Supplemental Files.

Supplemental Information
Supplemental information for this article can be found online at http://dx.doi.org/10.7717/ peerj.11537#supplemental-information.