The peak and size of COVID-19 in India: SARIMA and forecast

Following the USA, India ranks the second position globally for COVID-19 cases in the pandemic year 2020. The study intends to explore the epidemiological stage of COVID-19 disease by estimating the most warranted features peak and size of COVID-19 cases in India. Data for this study was retrieved from a publicly available COVID19-India application programming interface (API). Exponential model was applied to estimate the growth rates of daily COVID-19 cases. Seasonal auto regressive integrated moving average (SARIMA) model was developed for the growth rates to predict daily COVID-19 cases. The exponential model unravels a shift and a modest decline in the growth of daily COVID-19 cases. The study shows that the SARIMA model is suitable for projecting daily COVID-19 cases. The forecasted peak value of daily COVID-19 cases was approximately 104,000 COVID-19 cases on 19 September 2020, whereas the real-time peak value was 97,861 COVID-19 cases conspicuous on 16 September 2020. The projected size of COVID-19 disease was 105 lakhs versus 103 lakhs at the end of December 2020. The forecasts and projections is adjoining to the real-time peak value of daily COVID-19 cases in India and successfully explores the epidemiological stage of COVID-19 disease in India.


Introduction
Many top ranked countries in pandemic such as the United States of America (USA), Brazil, Russia, United Kingdom (UK) has experienced the second wave of Coronavirus disease 2019  disease caused by the virus severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) (Atangana, 2020;WHO, 2020). India is one of the countries not yet shown the second wave of COVID-19 pandemic. However, following the USA, India ranks the second position in the size of COVID-19 disease (Dong, Du, & Gardner, 2020;JHU CSSE, 2020). Given such trends, India shows a different predicament. Unlike other countries, India is the only country that showed an upsurge of COVID-19 with a delayed and high peak value, while other countries, including the USA, have shown the first peak much before India, and later on, ups and downs have been the course of the COVID-19 pandemic disease.
However, in August 2020, India breached the USA's peak value and reached the highest count of daily COVID-19 cases in the world. The exponential rise in COVID-19 cases was observed that delayed the peak value indicating a subtle impact of NPIs in India. In India, some conclusive factors such as income support, debt relief (Dev, 2020;Sengupta & Jha, 2020) and internal movement restrictions (Bhagat et al., 2020) might have led to the delay in attaining a peak value. After some conditional relaxations during April 2020 (MOHFW, 2020a), India implemented a cluster-containment strategy by red, orange, and green zones concerning movement restrictions during May 2020 (MOHFW, 2020b), where green coloured has the least restrictions and red zone had the strict restrictions.
This study aims to comprehend the epidemiological stage of COVID-19 disease in India. The rise of COVID-19 cases indicates an exponential rise in the last seven months from 30 January 2020 to August 2020. Therefore, it is important to analyse the daily confirmed cases of COVID-19 disease in India to estimate the peak value and the size of the COVID-19 cases noting the end of September and December 2020 as the critical periods. We explored the COVID-19 disease data by applying exponential models and developing the Seasonal Auto Regressive Integrated Moving Average (SARIMA) model and performed the model's assessment. Based on the application of these models, we analysed the growth rates of COVID-19 cases and provided an expected count of the peak and an expected size of COVID-19 cases in India.

Data source
We retrieved data from COVID19-India application programming interface (API) an open access and publicly available database of COVID-19 disease cases (COVID19-India API, 2020). We retrieved data between the dates 30 January 2020 and 31 August 2020, and this period of fiveand-a-half months is used for the analytical purpose of daily COVID-19 cases. Besides, the daily COVID-19 cases for the period between 01 September 2020 and 30 September 2020 was used for validating the outcomes for the next 30 days.

Exponential model
The exponential model is expressed as = * e (1) where t is time or date, yt is the number of daily COVID-19 cases at time t, C is the initial value, and B=e k with k as the growth or decay rate.

Forecast and projection of COVID-19 cases
Auto regressive integrated moving average (ARIMA) was applied to the parameter k of exponential model (Ceylan, 2020;Singh et al., 2020) from mid-March 2020 to 31 August 2020 to predict daily cases from 01 September 2020 to until 28 February 2021. ARIMA (p, d, q), where p is the order of autoregression, d is the degree of difference, q is the order of moving average, is expressed as where yt is the differenced series of COVID-19 cases, yt-n are the lagged values of yt of order n, t is white noise, n-1 is the lagged errors of order n,  is an intercept term,  is an autoregressive parameter, and  is a moving average parameter.
An ARIMA model is efficient in dealing with non-stationary and stationary time series data (Duan & Zhang, 2020;Lotto et al., 2017). The seasonal ARIMA or SARIMA for time series is referred to as an ARIMA (p, d, q) (P, D, Q)[ ] where (P, D, Q) represents the (p, d, q) for the seasonal part of the time series, and m refers to the number of observations per cycle. The SARIMA model's accuracy is measured by root mean square error (RMSE) and mean absolute percentage error (MAPE).
The natural logarithm of the constant 'C' of the exponential model showed a linear increase during the studied period. A simple linear regression was applied to predict the values of 'C' overtime.
( ) = + * where,  is constant and  is the slope of the model.
The Pearson correlation test was applied to measure the linear association between projected values and real-time data between 01 September 2020 and until the peak timing. Statistical softwares used were STATA (StataCorp, 2019), R (R Core Team, 2017) and RStudio (RStudio Team, 2020).

Exponential model
The study included 3,687,944 confirmed cases of COVID-19 disease for analyses during the period of seven months between 30 Jan 2020 and 31 August 2020. Besides, the study used COVID-19 cases in September 2020 for validating the outcomes. Figure 1 shows the non-linear fits based on the Equation (1) over the daily COVID-19 cases. It also shows the adjusted R-square of the exponential model's natural logarithm from 15 March 2020 until 31 August 2020 as the sample size was small until the mid of March 2020.
The adjusted R-square values (brown coloured connected circles) of the exponential model's natural logarithm was more than 85% from May to August 2020. This high adjusted Rsquare confirms the large proportion of explained variance by the exponential model's natural logarithm and ascertains the model's best fit over COVID-19 cases. The model parameters 'C' and 'B' (=e k ) of the exponential model were estimated by applying the nonlinear estimation method. These model parameters 'C' and 'B' are significant (P <0.001) since mid-April 2020. The parameter 'B' or the slope of the exponential model explains the acceleration or deceleration of COVID-19 cases. The slope parameter 'B' was 1.09 (P<.001) by mid-April, which declined to 1.047 (P<.001) and 1.023 (P<.001), respectively, by the end of May 2020 and August 2020 (Table  1). Based on these two slope parameters, the plot of non-linear fits shown until May 2020 and August 2020 (deep-sky-blue (dark and light) coloured long-dash line) displays the exponential increase in daily COVID-19 cases and a deceleration in COVID-19 cases (Figure 1).

Forecast, projection and the peak and size of COVID-19 cases Model assessment: ARIMA/SARIMA
We performed ARIMA/SARIMA model using Equation (2) to obtain the forecasts and projections of daily COVID-19 cases. The time series plot of growth rates, autocorrelation function (ACF), and partial autocorrelation function (PACF) (Figure 2) reveal seasonality and nonstationary in growth rates of daily COVID-19 cases. Based on the ACF and PACF of the ARIMA model, the second difference was applied for making the time series stationary. The corrected AIC (AICc) value of the model ARIMA (2, 2, 1) was -347.3. Again, based on the seasonal part of ACF and PACF, the SARIMA (2, 2, 1) (1, 0, 1)[7] was applied that had the lowest AICc value of -357.3 (Table 2). A seven-day cycle was applied as the moving average of seven days shows a linear trend in COVID-19 cases. The developed SARIMA model's diagnosis confirms no trends in the residuals, no outliers, and nearly constant variance (Figure 2). Specifically, the ACF plot of residuals shows no significant autocorrelations. The plot of the residuals shows a standard normal variate. The P values for the Ljung-Box statistic (Q) were above 0.05 (Table 2). In sum, the diagnosis of residuals confirms that the developed SARIMA (2, 2, 1) (1, 0, 1)[7] model is appropriate to the trends and seasonality in growth rates. Root mean square error (RMSE) and mean absolute percentage error (MAPE) of the SARIMA model were at 0.049 and 0.775, respectively, which is very small.   Table 3 shows the estimates of the coefficients of SARIMA (2,2,1) (1,0,1) [7] model. The coefficients of applied SARIMA model are significant at 1% level. The estimates of autoregressive model of order 1 (AR1) and of order 2 (AR2) of the SARIMA (2,2,1) (1,0,1) [7] are 0.182 (P=0.089) and 0.160 (P=0.177), respectively. The estimates of moving average of order 1 (MA1) is -0.955 (P<.001). The estimates of the seasonal autoregressive model of order 1 (SAR1) and a moving average of order 1 (SMA1) are 0.940 (P<.001) and -0.778 (P<.001), respectively. The significant values of MA1, SAR1, and SMA1 of the SARIMA model reveal valuable effect on the estimates of COVID-19 cases. The SAR1 has a positive effect on the growth rate and hence on COVID-19 cases. Whereas, MA1 and SMA1 have negative effect on the growth rate. Altogether, there is a deceleration in the growth rate, and so, it would show a decline overtime in COVID-19 cases.  Table 4 shows the parameter estimates of the Equation (3) for extrapolating the constant 'C'. The values of the slope parameter and adjusted R-square are 0.06 and 75.3, respectively. The beta coefficient value is 0.060 which is small and significant at 1% level.  Figure 4 shows the trends in daily COVID-19 cases (grey coloured circles with pipes) until August 2020 and projected daily COVID-19 cases (green coloured circles with pipes) based on SARIMA model (Equation 3) and exponential model (Equation 1) from 1 September 2020 until 28 February 2021.

Forecast and projection: peak and size of COVID-19 cases
The estimates of SARIMA model shown in Table 3 for forecasting growth rates between 01 September 2020 and 28 February 2021. Along with estimates of constant 'C' shown in Table  4, the projected growth rates were used in the exponential Equation (1) to estimate projected values of COVID-19 cases until 28 February 2021.
The forecasted peak value is estimated at 104,081 counts of COVID-19 cases on 19 September 2020 compared to the real-time peak value of 97,861 COVID-19 cases on 16 September 2020. The Pearson correlation coefficient between daily COVID-19 cases and real-time daily COVID-19 cases was 67% for the period of 19 days between 1 and 19 September 2020 and 57% between 19 and 30 September 2020.
The projections reckon the cumulative number of COVID-19 cases, i.e., the size of COVID-19 disease accounts for a total of 105 lakhs at the end of December 2020. Figure 4: Projection of daily COVID-19 cases, India, September 2020 to February 2021.

Discussion
The study demonstrates the epidemiological stage of COVID-19 disease in India by estimating the peak value and the size of COVID-19 cases based on 3,687,944 number of COVID-19 cases between 30 Jan and 31 August 2020. A two-parameter exponential model was applied to the COVID-19 cases for examining the growth or decay rate of COVID-19 cases. A SARIMA model was developed to forecast the growth or decay rate for future time, and hence COVID-19 cases were projected.
The results show that the adjusted R-square of the exponential model was higher than 85% in the studied period (Table 1). The fit of the exponential model shows a shift that confirms a deceleration in the growth rates (Figure 1).
Based on the developed SARIMA model (Table 2), the projection shows a slow, cyclic decline in the growth rates of daily COVID-19 cases since September 2020. Projections show the forecasted peak value of 104,081 counts of daily COVID-19 cases on 19 September 2020 versus real-time peak value of 97,861 COVID-19 cases on 16 September 2020 (Figure 4). The declining growth rates underpinned by the significant negative values of SMA1 and MA1 of the SARIMA model affirms the peak value since the third week of September 2020 (Table 3).
More prominently, we predicted the timing of the peak and its magnitude very close to real-time data, for the first time. In comparison to real-time data, the projected daily COVID-19 cases were nearly coinciding with the original data points between 1 September and 30 September 2020. It ensures an independent cross-check compared to other sensitivity analysis, such as comparing other countries and training data. Furthermore, projections demonstrate that the total cumulative number of COVID-19 cases would be approximately 105 lakhs by the end of December 2020. It is confirmed that projections are correct as the real-time COVID-19 cases accounts for 103 lakhs COVID-19 cases at the end of December 2020.
Knowledge of the peak value and its timing is crucial information for government authorities which they are looking forward to. Particularly for India, the precautionary measures vary by the states of India, and the same follows for standard measures such as social distancing, self-quarantined, asymptomatic carriers and transmission rate (Gurdasani et al., 2021;Nabi, 2020;Sarkar et al., 2020). This study used more than five months' data and hence presented the results for a long-term prediction. For the first time, the study shows a close forecast of the peak value of the daily COVID-19 cases, its timing and the size of COVID-19 disease in India.

Limitations of the study
The study is based on daily COVID-19 cases during seven months between the dates 30 January 2020 and 31 August 2020. The projection analysis was performed at the national level only because smaller states have limitations for projections given small samples and differences in the length of period for COVID-19 disease. Apart from statistical and time series assumptions, ARIMA or SARIMA can make projections within its limits of a linear relationship (Zhang et al., 2013) and depend on requiring a new cycle (Nobre et al., 2001;Wang et al., 2020). The behavioural aspects of population and socio-economic and demographics characteristics of the population were not available for consideration in trend analysis and projections because of data constraints. In view of these data constraints, we do have trend and seasonality into account for robust analysis and long-term projection.

Conclusion
The study shows that the SARIMA model is suitable for projecting daily confirmed cases of COVID-19 disease. The study based on moments of the distribution of the daily COVID-19 cases unravels the uncertainty about the peak and the size of COVID-19 disease in India. Based on the applied methodology, we show the peak value of 104,081 counts of COVID-19 cases on 19 September 2020, whereas the real-time peak value of 97,861 COVID-19 cases was conspicuous on 16 September 2020. The forecast and projection of daily COVID-19 cases are very close to the real-time peak value and the size of COVID-19 cases for India. The projected COVID-19 cases are 105 lakhs versus real-time COVID-19 cases of 103 lakhs by the end of December 2020. The study successfully explores the epidemiological stage of COVID-19 disease in India. The study strongly suggests keeping track of the growth rates of daily COVID-19 cases obtained from the exponential model to understand the flattening and size of COVID-19 disease in India.