THE APPLICATION OF ARIMA MODEL TO ANALYZE COVID-19 INCIDENCE PATTERN IN SEVERAL COUNTRIES

The COVID-19 pandemic continues to spread and already shows a recurrence in many countries, despite several social distancing and vaccination measures implemented all around the world. Epidemiological data are available, and we use the Auto-Regressive Integrated Moving Average (ARIMA) model to analyze incidence pattern and to generate short-term forecasts of cumulative reported cases in Morocco, France, Italy, Spain and USA, using daily reported cumulative cases data from Worldometers, and we report 5-day and 10-day ahead forecasts of cumulative cases and check a posteriori the precision of this forecasting, by confronting it to the real data observed. In the discussion, we propose a link between the ARIMA, elevation and average temperature in several countries’ modelling approaches, for allowing the comparison between their explicative abilities.


INTRODUCTION
The novel coronavirus outbreak , as named by the World Health Organization (WHO) on 11th February 2020, began in Hubei Province, China, in December 2019 and continues to cause infections in multiple countries. The COVID-19 represents the newest zoonotic Coronavirus disease that crossed species to affect humans and spread in an unprecedented manner.
The outbreak was declared a pandemic and a Public Health Emergency on 30 January 2020, by WHO. To control this pandemic, the governments have enacted a range of social distancing strategies, such as city-wide lockdowns and isolation of suspected cases plus a vaccination policy. The numbers of cumulated cases and deaths continue to accumulate every day and despite a slowing due to strict lockdowns combined with isolation, quarantine measures and vaccination, many countries entered in second, third until fourth waves of the outbreak Seligmann et al. [38]. While the transmission potential of this novel coronavirus can reach high values, the epidemiological features as dependencies on geo-climatic or demographic variables, and the mechanisms of transmission and host susceptibility of the viral agent SARS-CoV-2 of COVID-19 outbreak are still unclear (Demongeot et al., [10]; Demongeot and Seligmann [11]; Seligmann et al., [38]). In this paper, we use the Auto-Regressive Integrated Moving Average (ARIMA) model to generate 5-day and 10-day ahead forecasts of the cumulative reported cases in the countries as Morocco, France, Italy, Spain, UK and USA.
Currently, several mathematical methods are applied in disease incidence prediction such as linear regression, artificial neural networks and grey box models. The ARIMA model is commonly used in infectious disease time series prediction, especially for series that have a cyclic or repeated pattern. The model was conceived for economics applications, but is well convenient in the medical field nowadays. The principle of the model contains filtering out the high-frequency noise in the data, detecting local trends based on linear dependence and forecasting by extrapolating trends as it has been already done for some countries in the COVID-19 case, in order to explain for example correlations observed with geo-climatic parameters as mean temperature and elevation (Deb and Majumdar [1]; Demongeot et al., [10][11][12][13]; Perez et al., [14]; Faye et al., [15]; Ilie et al., [24]; Behambar et al., [30]; Seligmann et al., [38]). Despite its high predictive performance, the model has some limitations which decrease its scope of application. The model assumes a linear relationship between the dependent and independent variables while the actual data often present nonlinear relationships. Besides, the model assumes that the mean and variance of time series are independent of time, which means stationary of order 2. Thus, more than one approach should be tested to choose the better one as in former studies for some epidemics (Kane et al., [25]; Luo et al., [28]; Rubaihayo et al., [35]; Soebiyanto et al., [40]; Wei et al., [42]; Abioye et al., [45]).

Empirical analysis 2.1.1 Data sources
The daily incidence data of COVID-19 from January to May 2020 were collected from the Euro-  4 DEMONGEOT, OSHINUBI, RACHDI, HOBBAD, ALAHIANE, IGGUI, GAUDART, OUASSOU Data from January 1 to May 10 2020 were used to build the ARIMA model, data from May 11 to May 16 2020 to evaluate the forecasting precision by the model, and before July for showing the geoclimatic dependencies of the SARS-CoV-2 incidence (Figure 1). Simulations are using Gnu-R ARIMA software.

Study and visual representation of data
When predicting the evolution of a time series process, it is useful to find a model generating past values in order to extrapolate the simulation results of this model to the future. For performing that task, the model must sufficiently describe the past, and be robust in its predictions. The time series we aim to predict is the number of cumulated confirmed cases Yi,k of the COVID-19 outbreak at day i in country or region k. The time unit is the day. A first approach to time series data confirms that the logarithmic transformation allows us to describe the phenomenon in a functionally simple manner and with the added advantage of expressing the evolution of the number of cases in logarithmic form.
The series log(Yi,k) appears on Figure 1 Top right at first glance to suggest a non-linear evolution close to the quadratic one. It should be pointed out that this quadratic evolution would only describe the phenomenon at the start of epidemy, and would no longer be the generating model, where p means the order of auto-regression, d the degree of trend difference, q the order of moving average, P the seasonal auto-regression lag, D the degree of seasonal difference, Q the seasonal moving average, and s the length of the cyclical pattern Wei et al., [42]. In this paper, we use only the auto-regressive integrated moving average ARIMA(p,d,q) method with time-dependent parameters. Time series stationarity, parameter estimation, model checking and prediction will be done to establish this ARIMA model as in (Demongeot et al., [10]; Luo et al., [28]; Rubaihayo et al., [35]), by using like in [10] the statistical software facility given in: www.statsmodels.org/stable/generated/statsmodels.tsa.arima_model.ARIMA.html.
The goal of the study is to identify if the coefficient estimates for the time trend change after a certain point, and if so, to examine it further to find out the nature of change and potential causes, with the following model:

ARIMA MODEL TO ANALYZE COVID-19 INCIDENCE PATTERN
where i,k = (Yi,k-Yi-1,k)/Pi,k denotes the incidence rate of Covid-19 of region k at time i in {1,…,T} (every time point i denotes a single day), Pi,k denotes the overall population of region k at time i, f(i,,k) is the trend function and k is the day when the first confirmed case is observed in the region k, i.e., we have: Yi,k = 0 for i < k and Yi,k ≥ 0 for i ≥ k. Lj,i,k is the dummy variable signifying the lockdown, j captures the effect of lockdown, for j = 1,2,3 and Ui,k is the error process. In the real data, it is observed that the number of cumulated cases grows with time and then stabilizes after a certain time. Since we consider the following structure for log(i,k) in the model (1), by assuming an ARIMA(p,d,q) structure for the error process, while f is supposed to be a polynomial quadratic trend in time (i-). In particular, the coefficients for the linear and quadratic terms in f are considered to be different for i-k<k and i-k≥k.
The parameter k is estimated from the data and it tells us when the trend of the growth changes its pattern in region k. Combining everything, we will use the following equation for the model: Here, p and q denote respectively the auto-regressive and moving-average orders of the ARIMA error process and Ui,k = i,k denotes a standard normal white noise process, and we have: -for n=1,2, ßn,i,k = ßn,1,k, and for i<k, ßn,i,k = ßn,2,k for i≥k (3) -for j=1,2,3, Lj,i,k=0 for i<j,k, and Lj,i,k=1, for i≥j,k, (4) where  is a binary indicator of pre (j,k=0) and post (j,k=1) infection waves j=1,2,3. We estimate k, j,k and j,k (1<j<3), p and q from data using Akaike information criterion (AIC).
The best ARIMA model for each country was studied in the paper as parameters p=6, d=1 and q=0 as in Demongeot et al., [10].

The basic reproduction number R0
There are several methods for estimating the basic reproduction number R0, equal to the mean number of new infections caused by an infected person at day j among the susceptible population.
The time-dependent method Obadia et al., [31] has been used for estimating R0 at time t, denoted R(t), which allows to the detection of the end of the first wave, e.g., the 18 th of May 2020 in Morocco ( Figure 2).  , it is possible to study their correlation with geoclimatic variables like the mean temperature or the mean elevation, and demographic variables like the median age in many countries (Demongeot et al., [10]; Seligmann et al., [38]). For that purpose, the ARIMA technique allows for i) extracting the trend using the moving average method with a long window  The first wave of the COVID-19 outbreak occurred at the same time on the beginning of February 2020 in France and on the beginning of April 2020 for many other countries like Qatar and India ( Figure 6). The ARIMA software gives the auto-correlation function of the original new cases (Figure 6), and the estimation of its initial slope used for studying correlations (Table 1 and Figure   7), showing for elevation less than 1000 m a decrease of the auto-decorrelation length for new cases (estimated by the initial slope of the auto-correlation function), corresponding to a diminution of the contagiousness period (Demongeot et al., [11]; Seligmann et al., [39]).   Opposite of the initial auto-correlation slope

Mean elevation (in m)
Opposite of the initial auto-correlation slope transmitter (the patient in a period of contagiousness) to the recipient (the susceptible individual), in an atmosphere at high mean temperature and altitude, which destroys by heat and radiation the essential components of the virus (capsid and RNA). Regarding the altitude, the cubic adjustment shows a paradoxical effect at low altitude due to countries with high mean temperature and low altitude.

Forecasts for Morocco and France
It is possible to use the ARIMA model to do one-week forecasts of new COVID-19 cases. We have done it at the end of the first wave in Morocco and France (Figure 8), showing roughly the start of the decrease, but over-estimating the intensity of this trend.

03-02
In both cases, the forecast for Morocco and France shows an overestimation of the decrease at the end of the first wave possibly due to the influence of the periodic drop in counts of new daily cases at the end of the week, visible in many countries Demongeot et al., [12].

Second wave in Armenia
We can do the same study for the second wave of the COVID-19 outbreak as for the first one. On Figure 9, the example of Armenia shows a linear moving average (in red) canceling the periodic effect of partial settlements of weekends already observed for Morocco, to be subtracted from the original new cases (in blue) to obtain a stationary residue (in black). On Figure 10, the auto-correlation function of new cases is used for estimating the length of the contagiousness period, here about 8 days (at intersection with the value of 0.25, limit of the significance with p-value = 0.05 for a time series of 45 days).  The study for different countries of the value of its initial slope ( Table 2) shows an increase until 70°F, then a decrease with mean temperature more than 70°F, which corresponds to an adaptation of the viral pathogenicity caused by numerous mutations observed at high altitude due for example to the mutagenic power of UVs ( Figure 11). Many mutations with high contagiousness have in fact been observed in countries with a high mean temperature (Brazil, Colombia, South Africa).

Median age and incidence of COVID-19
The dependence of the incidence of COVID-19 on median age in the observed countries can change between two consecutive waves as shown in Figures 13 and 14, probably due to an adaptation of the virus to the demographic profile of the population in which it propagates, with a positive correlation of the incidence vs the median age (R=0.41) during the first wave ( Figure 13) and a negative correlation (R = -0.41) during the second wave ( Figure 14). One possible explanation lies in i) the exhaustion of the targets most sensitive to viruses (elderly people with chronic comorbidities) and ii) the application of mitigation measures (prevention, protection, eviction) between the two waves, which mainly concerned classes at risk, including the elderly.

CONCLUSION
The third wave of the COVID-19 outbreak started already in many countries and forecasting its start has not been possible (like for the start of the second wave) from the ARIMA approach, but the data about new cases are already showing similarities with the first wave, in particular for those concerning the correlation with temperature and elevation in the countries in which it occurred Seligmann et al., [38][39].
The same work has to be made for the third and fourth waves as for the two first ones, notably in the directions discussed in Section 4, in what concerns the influence of four factors on the outbreak dynamics: i) the age, because an adaptation has been observed during the second wave which seems to concern younger patients leading to discuss the value of R0 in a heterogeneous population, ii) the duration of the contagiousness, which seems to be longer than in previous waves, iii) the entropy of the distribution of the daily reproduction rates (as already pointed out in Demongeot et al., [12]; Oshinubi et al., [32][33]; Rhodes and Demetrius [34]) and which may correspond to a change in the immune defense sequence, with suppression of the apparent improvement due to innate immunity (causing a U-shaped profile of the distribution, hence a decrease of its entropy) and iv) the influence of geoclimatic but also of socio-economic factors. Eventually, the frequent over-deviation of the new daily cases observed in third and fourth waves in many countries (pos-Median Slope of exponential regression of COVID-19 incidence ARIMA MODEL TO ANALYZE COVID-19 INCIDENCE PATTERN sibly due to the entanglement of several successive or simultaneous health measures like distancing, lockdown and vaccination) would lead to replace in the future ARIMA models by generalized additive models (GAM) with a negative binomial regression. Indeed, descriptive ARIMA models and are known not to replace explanatory models based on plausible contagion mechanisms, such as ODE models and finding the best model to represent the COVID-19 data remains an open challenge despite notable advances in this direction (Demongeot et al., [13]; Griette et al., [20]; Wei et al., [42]). The spatial diffusion of SARS Cov-2 and its variants is also an important subject, not addressed here. Only Figure 1 Bottom shows a spatial difference between the French regions. To address this problem, we can, on a descriptive level, make use of the statistical techniques of spatial interpolation by kriging already used for malaria in (Gaudart et al., [16][17]) and of detection of spatial heterogeneities in public health data Guttmann et al., [21][22][23]. On the explanatory level, the spatiotemporal modeling using the partial differential equations (PDE) (Gaudart et al., [17-18]) would make it possible to estimate the speed of propagation, as well as its direction (NorthEast / SouthWest for the first wave in France). The analysis of the spatio-temporal heterogeneities mentioned above would also make it possible to propose a model containing delays between the causes (like date of contagion) and the effects (like limits of the period of the subsequent contagiousness), and also to work on the structure of the noise, cause of intrinsic fluctuations in epidemic data (in particular linked to variations in daily reproduction rates Demongeot et al., [12]), which is only addressed here through the residue of the ARIMA model. All of these important points will be covered in future articles.