Volatility estimation for COVID-19 daily rates using Kalman filtering technique

This paper discusses the use of stochastic modeling in the prognosis of Corona Virus-Infected Disease 2019 (COVID-19) cases. COVID-19 is a new disease that is highly infectious and dangerous. It has deeply shaken the world, claiming the lives of over a million people and bringing the world to a lockdown. So, the early detection of COVID is essential for the patients’ timely treatment and preventive measures. A filtering technique with time-varying parameters is presented to predict the stochastic volatility (SV) of COVID-19 cases. The time-varying parameters are estimated using the Kalman filtering technique based on the stochastic component of data volatility. Kalman filtering is essential as it removes insignificant information from the data. We forecast one-step-ahead predicted volatility with ±3 standard prediction errors, which is implemented by Maximum Likelihood Estimation. We conclude that Kalman filtering in conjunction with the SV model is a reliable predictive model for COVID-19 since it is less constrained by the past autoregressive information.


Introduction
Forecasting of time series with the estimation of time-varying parameters is useful for many statistical, probabilistic, and optimization processes that allow models to consider past observations and detect the disease pattern. Researchers and developers are increasingly using stochastic models to track and prevent chronological diseases and gain a more comprehensive understanding of the disease. Recently, many researchers, journalists, and amateur data enthusiasts are working on stochastic models to help people monitor the Coronavirus's spread and effects over time.
Corona Virus is a respiratory illness caused by a novel virus that affects humans, mammals, and birds. This viral disease has become a major global disaster. Novel Coronavirus outbreaks were initially detected in Wuhan, China (in 2019), and have now spread to numerous countries worldwide. Nearly 40 million cases have been identified in 188 countries by October 15, 2020, with over one million fatalities and 27 million recovering [1].
Experts have confirmed that the virus can spread rapidly from one human body to another and infect the lungs of humans through the respiratory system. Close to each other (less than six feet), the virus spreads through droplets generated from coughing, sneezing, and talking. Most of the droplets fall to the ground or onto surfaces rather than traveling long distances in the air. Individuals who have been infected with this viral infection will have varying symptoms, from coughing, fever, infections in the throat, kidney failure, respiratory problems, etc. The less common way in which people become infected is by touching something contaminated then touching their faces. It is most contagious during the first three days after the onset of symptoms, even though people who do not show symptoms are also at risk of picking it up before symptoms appear aka being asymptomatic [2,3].
COVID-19 has been affecting the US for months, and researchers are working hard to determine the virus's characteristics (why some people are more affected than others, what we can do to slow its spread, and where it is likely to move to next). The data indicate a spike within a short time, therefore it might be useful to analyze the disease case rate to know how the disease is spreading, what impact the pandemic has on people, and whether the preventative measures are effective [4,5]. The dynamics of the COVID-19 cases are now believed to involve volatility clustering and show typical non-linear characteristics [22].
This study develops a stochastic model to predict the volatility of COVID-19 rates per day. The volatility models are used to predict the financial data since they contain extreme fluctuations. The SV model with Kalman filtering is used in this analysis because COVID-19 data shows high spikes within a short time period. A challenge of the SV models for COVID-19 data is the estimation of time-varying parameters, because it is not possible to observe the volatility directly from the data. Our approach consists of observing only a time series of daily rates, followed by a filtering procedure. At this point, the rates are not a Markov process, the likelihood of a current observation of rates is a function of the entire history of the COVID cases, not just the last observation. However, the rates of infected cases follow long memory nature over time. It suggests that there is a persistent behavior from the past to present information. We assumed a log-volatility (conditional) followed by an auto-regressive stationary process in the COVID-19 data. The long memory and stationarity are verified by testing the parameters and the unit root test presented later. As the likelihood estimation of the SV model is tiresome for fitting the data, the Kalman filtering has been used to estimate the time-varying parameters via Maximum Likelihood Estimation [9].
The overview of this paper is as follows: Section 2 describes the research methodology of Kalman filtering [16] and volatility models. Section 3 deals with the dynamic behavior of the datasets. We discuss the background and some useful information about COVID-19. In Section 4, the long memory and stationarity tests are analyzed. Section 5 provides the results and discussion of our model's suitability regarding the estimation of model parameters for COVID-19 data. Finally, Section 6 contains the conclusion of this study.

Research methodology
This section describes the Kalman Filtering technique and volatility modeling to estimate the stochastic volatility (SV) of COVID cases daily rates.

Kalman filtering
We begin with a state-space model [14] as where z t is observed data (space), θ t is unobserved data (state vector) with a coefficient matrix C, and ∊ t is a Gaussian error term. As θ t is unobserved, we use the following autoregressive equation: where B is a n × n transition matrix and υ t is a Gaussian error term with mean 0 and variance σ υ . The unobserved data θ t can be obtained from Eqs. (1) and (2) using given data z s = {z 1 , …, z s }. In this study, time s is used as time t in a recursive process and the process for t is called as filtering technique. The filtering technique helps to find accurate estimation from noisy information. The error terms η t can be defined as η t = θ t − θ t using the unobserved data. Besides the error terms, the covariances of the two noise terms are assumed as stationary over time as The best filter can be found by minimizing the mean squared error, E(η 2 t ), which is equivalent to W t (the error covariance matrix at time t): Now the state vector can be updated with an innovation process. The θ t ′ is assumed as the prior estimate of θ t , which is computed by the state update equation as where θ t ′ is the prior estimate of θ t , K t is the Kalman gain and (z t − Cθ t ′ ) is the innovation or measurement residual. At this point, the error covariance matrix at time t can be obtained as It is clear that the error of the prior estimate is not correlated with the innovation. To minimize the tr([W t ]), we differentiate it with respect to K t and set it to zero as In order to compute Kalman gain K t , we minimize the trace of W t by taking derivative as equal 0, which gives: Eq. (7) is called as Kalman gain equation in the filtering technique. Now Eq. (5) can be updated with the optimal gain as The prior estimate of W t can be expressed as where ϕ is the stationary transition matrix and the above Eq. (9) gives the minimum MSE. For the details of filtering, the readers are referred to [12].

Volatility modeling
This subsection presents the volatility modeling of daily rates of COVID-19 data. A stochastic component has been used to compute the volatility by following an innovation sequence. The innovation is fully independent of observations used in this study [13]. The data volatility is estimated through an unobservable process that changes stochastically. We express the rates r t as a product of two components of the process as where σ t is the volatility and β t is a noise term. At this point, we assume that the noise term follows a sequence of Gaussian white noise [15], and there is no dependency between the sequence and data volatility.
To estimate the stochastic volatility, the log-squared rates of the data are used as follows: where ψ t = logr 2 t ,v t = logσ 2 t , and α t = logβ 2 t . Now it is clear that the logsquared rates have two parts namely, the unobserved volatility v t and the unobserved noise α t . The unobserved volatility varies with time through an auto-regression equation [11]: where γ t is a white Gaussian noise term with the variance σ 2 γ . So the Eqs. (11) and (12) consist of the time-varying parameters and are called the stochastic volatility model. In this model, the noise term is computed by two types of Normal distribution. One of the Normal distributions is assumed with zero mean and the other one is non-zero mean. So the observed data z t can be expressed as follows: The noise part δ t can be written as a linear combination of Bernoulli random variable [20], with probability π and Normal random variables x t0 and x t1 , where x t0 is Normally distributed with mean 0 and variance u 2 0 , x t1 is Normally distributed with mean μ 1 and variance u 2 In our study, we assume that x t0 , x t1 , B t all are independently and identically distributed and the probabilities of B t are defined as Pr{B t = 0} = π 0 and Pr{B t = 1} = π 1 , where π 0 + π 1 = 1. In the SV model, a 0 ,a 1 ,σ γ ,μ 1 ,u 0 , and u 1 are time-varying parameters, so our approach is to estimate them using the Kalman filtering technique described in Section "Kalman filtering".

Dynamic behavior of the datasets
In this section, we present the background of the COVID-19 datasets used in the paper. It is the dynamic behavior of the data that encourages us to apply our methodology in this paper.

Data background
We collected the daily number of laboratory and hospital confirmed COVID-19 cases and deaths released by the World Health Organization (WHO) from January 10, 2020 to May 15, 2020 to construct a real-time database [6]. Most affected countries like the United States, China, Italy, and Spain were included in this study and a comparison of daily cases was illustrated (See Fig. 1). Afterward, daily new deaths and new cases for a 10-day interval time interval in all four countries were plotted for the first 90 days (See Figs. 2-5). Healthcare Access and Quality (HAQ) Index for the topmost affected countries with confirmed COVID-19 cases   reported by the WHO is derived from a previously published study by the GBD (Global Burden of Disease) 2016 Healthcare Access and Quality Collaborators [7].

Descriptive statistics
This data set includes data from four different countries: the United States, China, Italy, and Spain. We assemble daily new cases and new deaths for the first 90 days for each country and calculate the change in percentage [8]. Next, we applied some statistical analysis to the datasets for additional information. Table 1 provides information about the percentage change of daily new cases for each country. Spain has a lower mean value than other countries. The standard deviation for the maximum cases is higher in China, as presented in Table 1. The skewness and kurtosis give summary information about the shape of a distribution. As it shows that the Kurtosis is positive for all the countries and maximum for the USA, it indicates flatter tails and narrow peaks aka normal distribution.

Stationary and long memory approaches
This section analyzes the time series by testing for stationarity and long memory in the COVID-19 cases data. A stationary series and long memory series are relatively easy to predict. The assumption is that the data's statistical properties will be the same in the future as they were in the past. We now briefly discuss the long memory and stationary test when they are applied to the datasets.

Stationary test
To test COVID-19 data's stationarity, we used the Augmented Dicky Fuller test (ADF test) [21]. It is a hypothesis used to determine the presence of a unit root in a series which facilitates the analysis of higherorder autoregressive processes. The null hypothesis is assumed as the data has a unit root against the alternative with no unit root. The p-value below the critical level leads to a unit root in the dataset [17]. This test's summary statistics for the datasets are presented in Table 2.
We see that all the p-values of four datasets are higher than a significance level (0.01) at lag 4, which suggests that the alternative hypothesis is acceptable with no unit root, meaning that the datasets of

Long memory test
As the data follow stationary behavior at a specific lag, it prompts us to analyze the long memory effects of data. We know that the fractional difference parameter identifies a long memory pattern of data in an Autoregressive Fractional Integrated Moving Average (ARFIMA) model [18]. The process is considered the long memory pattern when the fractional difference parameter (known as long memory parameter) lies in the interval (0, 0.5). Since the parametric model ARFIMA was fitted to the Gaussian stationary data r t , a traditional Maximum Likelihood (ML) can be used to estimate the model parameter. However, we observe that the traditional ML estimator (MLE) requires a large number of operations to optimize its likelihood function for a Gaussian random field, thus it is not computationally efficient. At this point, we used a relatively efficient algorithm namely, the whittle likelihood that provides a spectral approximation to the log-likelihood [19]. Using the stationary property, the whittle approximation can reduce the MLE's number of operations from O (n 3 ) to O (nlogn). Table 3 shows the parameter estimation with standard error for COVID data from four countries. We see that the estimated parameter is less than 0.5 for each country's dataset. Furthermore, the estimated errors are very low, meaning that the estimates are stable around the actual value. So the datasets used in the study follow long memory patterns, i.e., persistence behavior.

Results & discussion
This section presents the analysis of estimating the time-varying parameters of SV model using the Kalman filtering technique. To fit the Kalman filtering, we first initialize the parameters a 0 , a 1 , σ γ , μ 1 , u 0 , and u 1 for estimation. The initialization was considered in a way to obtain the log-volatility over time. The parameter σ 2 γ represents the variance of the log-volatility process and measures the randomness of future data volatility. To estimate the parameters at time t, the MLE algorithm was used with the innovation processes in Eqs. (12) and (13). In this case, we used the normally distributed auto-regressive conditional heteroscedasticity assumption on the white noise term, β t [10].
The parameter estimates (a 0 , a 1 , σ γ , μ 1 , u 0 , and u 1 ) and the sample paths of data volatility after filtering are presented in Tables 4-7 and Figs. 6-9. It is clear that the estimates are close to the true parameters, as the errors are pretty low. The σ 2 γ is the variance parameter of the logvolatility process, which measures the uncertainty of future data volatility. If the value of σ 2 γ is zero, it is not possible to identify the SV model. The parameter a 1 is considered as a measure of the persistence of shocks to the volatility. Tables 4-7 show that a 1 is less than 1 for four countries' data volatility. So we conclude that the latent volatility process is stationary, leading to the stationarity of z t of the COVID cases, which confirms the results of Section "Stationary and long memory approaches".
The parameters a 1 and σ γ represent the dynamics of the volatility evolution of COVID cases. The tables also show that the parameter a 1 is pretty close to 1, and the parameter σ γ is different from 0 for all four countries. It suggests that the volatility evolution is uneven over time. It is concluded that the COVID rates of cases might be heteroscedastic by nature, meaning that there might be non-constant conditional volatility over time. So, the summary statistics of these tables are advantageous to control the risk or mitigate COVID cases' effect.

Conclusion
This paper discusses the daily rates of COVID-19 cases from four different countries, namely the United States, Spain, Italy, and China. The data shows a stochastic nature over time, so we estimated the stochastic volatility of daily rates. The daily rates of COVID-19 cases show the persistence behavior, meaning the movements of time series is correlated with their past observations and reflect stationarity at some past time lags. The persistence behavior was analyzed with the ARFIMA model parameter using Whittle likelihood (see Long memory test subsection). It is an effective method for reducing the number of MLE operations using the stationary property of COVID-19 data, and provides a good spectral approximation to the log-likelihood. In addition to this, the stochastic feature of stationary data helps to model the high fluctuations or high rate of COVID cases with much certainty.
In this study, we used the Kalman filtering technique in conjunction      . 6. Sample path of one-step-ahead log-volatility, with ±3 standard prediction errors for USA COVID cases. Fig. 8. Sample path of one-step-ahead log-volatility, with ±3 standard prediction errors for Spain COVID cases. Fig. 7. Sample path of one-step-ahead log-volatility, with ±3 standard prediction errors for China COVID cases. Fig. 9. Sample path of one-step-ahead log-volatility, with ±3 standard prediction errors for Italy COVID cases. with the SV model for forecasting the data volatility. The process filters out the unnecessary information from data and provides the estimation of time-varying parameters that support non-constant conditional volatility. The results suggest that this stationary process forecast the volatility effectively with Kalman filtering. The one-step-ahead logvolatility with ±3 standard prediction errors were shown over time (see Figs. 6-9), and the low errors of parameter estimation (see Tables 4-7) imply that the estimation is around the actual value. So the analysis is useful to detect the high rate of COVID-19 cases of a particular time. As we applied the test case for four leading countries of COVID cases, it can be applied to any country with a high new cases rate and new deaths rate. Although the vaccine for COVID-19 is now available, the number of cases is increasing. Therefore, detecting the high rate would allow us to raise awareness of self-protection and to take all the possible protective steps such as practicing social distancing, improving personal hygiene, covering the face with a mask, and other prescribed methods by experts.