Baseline accuracy of forecasting COVID-19 cases in Moscow region on a year in retrospect using basic statistical and machine learning methods

The large amount of data that has accumulated so far on the dynamics of the COVID-19 outbreak has allowed to assess the accuracy of forecasting methods in retrospect. This work is devoted to comparing a set of basic time series analysis methods for forecasting the number of confirmed cases for 14 days ahead: machine learning methods, exponential smoothing, autoregressive methods, along with variants of SIR and SEIR. On the year-long data for Moscow, the best basic model is showed to be SEIR within which the basic reproduction number R0 is predicted using a regression model, achieving the mean error of 16% by the MAPE metric. The resulting accuracy can be considered a baseline for a more complex prospective model that would be based on the presented approach.


Introduction
Currently, despite many publications devoted to the problem of describing the spread of the COVID-19 epidemic, models of this process that would be actually efficient and useful for practice are still lacking [1]. The need for these models for planning policies to contain the outbreak is undeniable. So, the task of assessing accuracy of different forecasting methods in retrospect seems extremely relevant.
In the literature, forecasting the dynamic of the epidemic is generally considered as the task of forecasting the numbers of confirmed cases [2][3][4], recoveries [2; 4] or deaths [2; 4; 5]. The forecast is usually based on the input data for a few previous days, and predicts the respective time series for some days ahead. The input data vectors for a day may contain additional features indirectly characterising the evolving situation: mitigation measures [6; 7], statistic of search queries containing keywords relevant to the  [8], data inferred from social network texts [9], and so on. Also, different works consider different forecasting horizons: 1 day ahead [3; 10], 7 days [11], 10 days [4; 5; 12] or more: note the competitions "COVID-19 Data Challenge" by SberBank ‡ and the Zindi § competition.
A vast variety of forecasting methods is used in different works: • the tomorrow-like-today approach, also referred to as the naive approach [13].
However, the works devoted to comparing different approaches [4; 11; 13; 20] show that the more sophisticated methods sometimes show no significant increase in accuracy compared to the simplistic ones: for instance, complex machine learning methods do not outperform single ones when forecasting for 10 days [4]. Due to that, a task of practical importance considered in this paper is to assess accuracy levels achieved by basic forecasting methods on the available retrospective data. This will prepare the basis for further research and improvement of models for forecasting the dynamics of the epidemic.

Data
We use data from the Yandex DataLens service for a period of one year (from March 2020 to March 2021) for the Moscow region, which contains daily numbers of confirmed cases, recoveries and deaths.
We normalize the data per 100 thousand of population, and preprocess it by discarding the points before the day with the first detected cases and replacing all values less than or equal to 0 with interpolated values for their non-zero neighboring values.
In this work, we focus on predicting the number of confirmed cases, considering it to be the most real-time indicator of the dynamics of the outbreak. The target value for forecasting is the total increase in the number of confirmed cases over a certain period of time (14 days), because we believe it to be less dependent on stochastic factors unrelated to the epidemic spread (delays in testing, in reporting, etc.).

Dummy
As the simplest basic time series prediction model we use the tomorrow-like-today approach, in which the future value of the time series is taken equal to its current value. In spite of the simplicity of such approach, it often proves hard to outperform, and thus is here used for comparison purposes.

Statistical models
Out of this class of models, we use the Holt-Winters exponential smoothing model namely Holt Winter's method from the statsmodels library ¶ with the "trend" parameter set to "additive" and other parameters kept default (denoted further as Smoothing); and the ARIMA autoregressive model with automatic selection of optimal model parameters from the pmdarima library + (denoted as AutoARIMA).
The input data for statistical models is the entire time series (preprocessed as described in Section 2), from its very beginning and up to the day for which prediction is being made. The output is the values of the preprocessed time series for the 14 days following the current day.

Machine learning methods
We use the following popular machine learning methods: Support Vector Regression with linear kernel and max iter=5000 (further referred to as LinearSVR); leastsquares linear regression with the parameter normalize=True (LR); an ensemble of decision tree models with n estimators=100 (RandomForest); gradient boosting with n estimators=100 (GBR).
All of the above models are taken from the sklearn library [23], with all hyperparameters kept default except those specified above. The input data for the models is a vector consisting of the values of the preprocessed series for the last 14 days before the current day, while the output is the value of the series for the day that is 14 days after the current day.

Population models
3.4.1. SIR with parameters adjusted independently for each day In the SIR model [24; 25] considered here (further referred to as SIR), 14-day forecast is performed each day with fitting the model parameters to the past 14 days. For the parameters to fit we consider the population size N potentially susceptible to infection, the susceptible-toinfected transition rate β and infected-to-removed (recovered or deceased) transition rate γ. The compartment sizes of the susceptible S, the infected I and the removed ¶ https://www.statsmodels.org/stable/ + https://alkaline-ml.com/pmdarima/

Baseline accuracy of forecasting COVID-19 cases in Moscow region
R for the day before the period under forecast are set on base of the known values of the Confirmed, Recovered and Deceased series: R = Recovered + Deceased, I = Confirmed − R, S = N − I − R. Fitting the parameters is performed using the leastsquares algorithm minimizing the mean square error for the I and R series on 14 past days.

SIR and SEIR with parameters changing in expert-preset dates
The model parameters (β and γ for SIR; β, γ, and δ for SEIR) are constant in the intervals delimited by preset dates: The dynamics of γ(t) and δ(t) are defined analogously. Dates date i correspond to significant events that affect the evolving outbreak. Dates for the Moscow region are obtained from the Wikipedia webpage * : (i) 31-03-2020 -the start of lockdown (stay-at-home order); (ii) 12-04-2020 -closure of almost all enterprises and organizations stops, introduction of mandatory permits for leaving home; (iii) 12-05-2020 -imposition of mandatory wearing of masks in public transport, resumption of the work of industrial organizations and construction facilities; (iv) 09-06-2020 -the end of lockdown; (v) 01-09-2020 -end of summer vacation and the beginning of the school year; (vi) 29-09-2020 -imposition of remote learning; (vii) 31-12-2020 -the new year holidays; (viii) 15-01-2021 -lifting of most of restrictive measures except masks; The 14-day forecast for a day t involves fitting the parameters β i , γ i , and δ i by the non-linear least-squares algorithm minimizing the mean square error of the I series over all last data up to t. Here, i includes dates no later than t minus 7 days; if t is less than 7 days from date i , then β i , γ i , and δ i do not take new values at date i because such interval is considered too short for parameter fitting.
Further, the SIR and SEIR models with such parameter fitting are referred to as SIR expert and SEIR expert.  3.4.3. SEIR and SIR with regression model for R 0 (t) In this model, the parameter β(t) = γ · R 0 (t) is obtained on base of the basic reproduction number R 0 (t) of how many people an average infected person infetcs in a day. It is calculated as where Daily cases(t) is the number of daily confirmed cases for day t, t is the number of the current day, w is the window size (we use w = 4 in accordance with the approach of Yandex ). γ is searched for using the least-squares algorithm minimizing the mean square error of I on 14 past days. Initial compartment states are obtained from the known daily values of confirmed cases, recoveries and deaths. N is set fixed at 100 000 in accordance with the prior normalizing of the series of confirmed and recovered.
The future values of R 0 (t) are forecast using a regression model based on Epsilon Support Vector Regression (SVR) † † with the parameters C=100, kernel="rbf". During fitting, SVR receives the number of the day t at its input, and outputs R 0 (t). Fitting SVR and finding γ is carried out for every day using past values of R 0 (t) calculated by Eq. 2. The 14-day forecast is obtained with the help of SVR predicting R 0 for each of the 14 days ahead, while γ is fixed in its value obtained for the current day.
The SEIR model is employed in the same fashion. For the initial parameter for Exposed we take its value obtained for the previous day. The δ parameter is searched for analogously to γ. Parameter search is carried out by minimizing the mean square error of either the I series (further referred to as "SEIR SVR R(t) v1"), or of both I and R series ("SEIR SVR R(t) v2").

Results
The forecasting accuracy is assessed by the mean absolute percentage error (MAPE) metric: where n is the number of days in the series, A t and F t are true and predicted values of the preprocessed time series respectively for the day t.
In order to evaluate each model on different stages of epidemic, we split the time series into 5 equal parts (folds). When using machine learning models, the first part (Fold 0) is used as their training set, therefore, for the sake of consistency, the results of the models are only compared for the 4 folds (Fold 1 -Fold 4). worse than that of the Dummy model. Results higher than the Dummy model are achieved by the following models: Smoothing, AutoARIMA, SEIR expert, and SEIR and SIR models with regression model for R 0 (t).

Conclusion
The main conclusion of the obtained results is that the most effective instrument for predicting the dynamics of the coronavirus epidemic among basic methods is SEIR with regression model for R 0 (t). In future work, we plan to use this approach with more advanced models for predicting time-dependent parameters of the SEIR and SIR models.