Skip to main content
Advertisement
  • Loading metrics

Time warping between main epidemic time series in epidemiological surveillance

  • Jean-David Morel,

    Roles Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Validation, Visualization, Writing – original draft, Writing – review & editing

    Affiliation Laboratoire de Physiologie Intégrative et Systémique, Ecole Polytechnique Fédérale de Lausanne, Lausanne, Switzerland

  • Jean-Michel Morel,

    Roles Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Validation, Visualization, Writing – original draft, Writing – review & editing

    Affiliation City University of Hong Kong, Department of Mathematics, Tat Chee Ave, Kowloon Tong, Hong Kong

  • Luis Alvarez

    Roles Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Software, Validation, Visualization, Writing – original draft, Writing – review & editing

    lalvarez@ulpgc.es

    Affiliation Departamento de Informática y Sistemas, Campus de Tafira, Universidad de Las Palmas de Gran Canaria, Spain

Abstract

The most common reported epidemic time series in epidemiological surveillance are the daily or weekly incidence of new cases, the hospital admission count, the ICU admission count, and the death toll, which played such a prominent role in the struggle to monitor the Covid-19 pandemic. We show that pairs of such curves are related to each other by a generalized renewal equation depending on a smooth time varying delay and a smooth ratio generalizing the reproduction number. Such a functional relation is also explored for pairs of simultaneous curves measuring the same indicator in two neighboring countries. Given two such simultaneous time series, we develop, based on a signal processing approach, an efficient numerical method for computing their time varying delay and ratio curves, and we verify that its results are consistent. Indeed, they experimentally verify symmetry and transitivity requirements and we also show, using realistic simulated data, that the method faithfully recovers time delays and ratios. We discuss several real examples where the method seems to display interpretable time delays and ratios. The proposed method generalizes and unifies many recent related attempts to take advantage of the plurality of these health data across regions or countries and time, providing a better understanding of the relationship between them. An implementation of the method is publicly available at the EpiInvert CRAN package.

Author summary

To monitor an epidemic, it is crucial to understand the relationship between the incidence of new cases, the hospital admission count, the ICU admission count, and the death toll time series. The relationship between any pair of such indicators can be formulated in terms of temporal delays and ratios which evolve across time. Given two such time series, we develop, based on a signal processing approach, an efficient numerical method for computing their time varying delay and ratio curves. Using realistic simulated data, we show that the method faithfully recovers time delays and ratios. In addition, we discuss several applications to real epidemic data where the method seems to output interpretable time delays and ratios. The obtained relationship between these epidemic time series is a key issue in epidemiological surveillance. The proposed technique provides a new tool to visualize, compare, and understand the evolution of key epidemiological time series. An implementation of the method is publicly available at the EpiInvert CRAN package.

Introduction

The COVID-19 epidemic has provided us with information, for many countries, on the evolution of a number of key instantaneous epidemic measurements such as the number of cases, deaths, hospitalizations, or intensive care units (ICUs) occupation. In this work, we explore the functional relationship between these time series. Let f(t) and g(t) be a pair of time series. We will assume that there exists a smooth time varying delay, s(t), between the values of both curves (such as the average temporal delay between cases and deaths). Moreover, we assume that once the delay between both curves is compensated, the ratio between their values, r(t), follows a smooth function that represents the causal relationship between both time series. Therefore we look for functions s(t) and r(t) such that (1) Of course, this problem is ill-posed because there are an infinite number of combinations of functions r(t) and s(t) that satisfy the above equation. Yet, one can transform the problem into a variational method where an energy is minimized. The first term of the energy is the quadratic difference between both members of (1) and the second term is a measure of the smoothness of functions r(t) and s(t), namely the quadratic norm of their derivatives. Minimizing jointly such an energy leads to find the smoothest time delays s(t) and smoothest ratio r(t) satisfying (1), by the classic Tikhonov-Arsenin solution for solving ill-posed problems [2]. The problem is then posed in terms of “we want the Eq (1) to verify as well as possible but for r(t) and s(t) as smooth as possible”. In Fig 1 we illustrate an example of this approach when comparing, for the United Kingdom, between February 2020 and November 2023, the time series of cases and deaths. Key dates are indicated : the vertical bars indicate beginning or end of lockdown or social distancing measures and each shaded colored rectangle corresponds to a different wave, associated with the emergence of a different variant. We observe that the estimated r(t) and s(t) are quite smooth. A good agreement is obtained between r(t)f(t) and g(t + s(t)), except at the beginning of the epidemic where r(t)f(t) is much smaller than g(t + s(t)), and s(t) is negative. This fact is explainable by the very significant underestimation of the number of cases at the beginning of the epidemic, where the number of deaths was growing much faster than the capacity to test the population.

thumbnail
Fig 1.

Illustration for the United Kingdom of the proposed warping method to compute the time warping s(t) and ratio r(t) relating the daily case incidence f(t) and the daily death count g(t) through the warping equation r(t)f(t) = g(t + s(t)). (A): f(t) in cyan and g(t) in red. (B): comparison of r(t)f(t) with g(t + s(t)) showing how well the warping equation is satisfied. (C): the estimated r(t) in cyan and s(t) in red. Both are constrained by the method to be smooth to avoid overfitting. The color shaded rectangles represent the time intervals of the first five SARS-CoV-2 pandemic waves (as described in [1]).

https://doi.org/10.1371/journal.pcbi.1011757.g001

Our proposed method aims at facilitating the comparison of epidemiological times series by computing and visualizing the latent variables s(t) and r(t) that link two time series. According to [3]:

Understanding and monitoring the epidemiological time delay dynamics of SARS-CoV-2 infection provides insights that are key to discerning changes in the phenotype of the virus, the demographics impacted, the efficacy of treatment, and the ability of the health service to manage large volumes of patients.

The authors of this paper studied how the pandemic has evolved in the United Kingdom through the temporal changes to the epidemiological time delay distributions between clinical outcomes. In [4] ten time periods were used to track changes in delays and the probabilities of outcomes using parametric statistical models in France. In each period, the variability of the delay was modeled using a parametric distribution. In [5], the difference between in-hospital mortality across four time periods was studied. [6] studied the evolution, in the USA, of in-hospital mortality by month of admission.

Comparison between epidemic waves is also a usual way to study the time evolution of epidemiological indicators. For instance, in [7], authors show that the in-hospital mortality was lower during the second wave of COVID-19 than in the first wave. Mortality predictors and differences in mortality between COVID-19 waves were identified using logistic regression in [8].

Despite these findings in specific examples, to our knowledge, no method exists to this date to continuously model and estimate the relationship between epidemiological curves. For example, the mortality rate may vary gradually depending on the improvement of medical treatments, the gradual introduction of vaccination or the gradual emergence of new variants. Until now, the way to study this variation has been to separate the epidemic time interval in different periods and assume a constant value in each period. This strategy requires selecting time periods, which is not an easy task because there exist many ways to do it, and calculating a constant value for each period. Knowledge of how the value of the variable varies within each period is lost in the process. In this paper, we propose a different approach where instead of separating the epidemic time interval in different periods, we model, using a generalized renewal equation, the evolution of the delay s(t) and the ratio r(t) between indicators as smooth functions defined in the whole epidemic time interval.

Our goal is to provide epidemiologists with tools providing a joint visualization, enabling the interpretation of two epidemiological curves related by causal relationship, such as incidence and death, or hospitalization and admission in ICU. While a direct visualization is possible through a mere superposition of two curves sharing the same time interval, we add and display in form of curves the two hidden variables connecting them : their time delay and their ratio. As we develop our Model section, Eq (1) is an extension of the renewal equation, which is normally used for modeling a single incidence curve. This equation is extended here to relate two curves concerning the same population. In that way, we also extend and measure a generalized reproduction number r(t) which estimates (for example) an instantaneous ratio between incidence and mortality, and a generalized time interval s(t) that estimates for example the instantaneous time delay between detection and death.

Model

Dynamic time warping: The technique

Dynamic time warping (DTW) is a method to match two signals f(t) and g(t) by finding a nondecreasing shift function w(t) such that f(t) is as close as possible to g(w(t)). Variants of dynamic time warping like the one proposed in [9] match the slopes of the signals rather than the signals themselves. If the signals are sampled, thus encoded as two vectors, Bellman’s dynamic programming principle [10] applies and yields an optimal solution for any time discrete version of the cost E(w) ≕ ∫ c(f(t) − g(w(t))2dt, where c is a positive cost function. As developed in [11], a discrete path optimization is performed in the 2D space of time correspondences, where only diagonal (t, s) → (t + 1, s + 1) horizontal (t, s) → (t + 1, s), and vertical (t, s) → (t, s + 1) moves are authorized. This discrete setting is efficiently solved by dynamic programming, and it can be extended to an optimal warping with time variable penalty [12]. We refer to [13] for a review of this algorithm. In this paper, we shall prefer a continuous formulation to the discrete one, in the line of [14], where a continuous time warping functional is defined, matching multiple copies of the same audio by a signal alignment that is robust to the presence of sparse outliers with arbitrary magnitude. The continuous formulations allow for a continuous iterative optimization by gradient descent and remove limitations on the form of the functional or discretization of the warp. As proposed in [15], a continuous penalty on the derivative of the warp can be used to ensure that this derivative does not vanish and does not become too high.

In the same line, the warping of signals has a 2D classic equivalent in the optical flow problem introduced in a seminal paper [16]. Applied to two successive images of a video, the method finds a vector field (2D warp) matching both images and enforces the warping field to be smooth by combining a regularity term with the matching fidelity term. In its initial formulation, the differential method worked only if the motion does not exceed one pixel. A more general computation of the optical flow is developed in [17] as a hierarchical multi-scale implementation of the optical flow using the Laplacian pyramid to represent both compared images at different scales. These ideas are developed in [18] which implements a multi-scale version of Horn-Schunk optical flow. The same hierarchical differential method is obviously applicable to pairs of similar signals.

Epidemiological applications of time warping

In [19] the authors used dynamic warping to forecast the future spread of COVID-19 by exploiting the identified lead-lag effects between different countries. The past epidemic transmission delay between nations is determined through dynamic time warping between cumulative case curves. Then the cumulative case curve of the leading country is used to predict the Covid-19 spread in the following nation.

In [20] the authors similarly present an algorithm for the clustering of confirmed COVID-19 cases at the county level in the United States. Dynamic time warping is used as a k-means clustering distance metric. The obtained clusters enable retrospective interpretation of the pandemic and informative inputs for case prediction models. In [21], the same method is developed for the Polish voivodeships. Similarities in time series are found by dynamic time warping between COVID-19 infections and deaths in pairs of vovoideships. The authors argue that a time warping method is important because the coronavirus epidemic did not start in all voivodeships at the same time. The method jointly analyzes the number of infected and deceased people in each province. Dynamic warping is applied to match pairs of (smoothed) incidence curves, pairs of cumulative case curves, and pairs of cumulative death curves.

In [22] the authors assessed the similarity between the time series of energy commodity prices and daily COVID-19 cases. They argue that the pandemic affects all aspects of the global economy and propose to assess the connections between the number of COVID-19 cases and the energy commodities sector. These connections are achieved by using Dynamic Time Warping to compute DTW distances between all time series—and use them to group the energy commodities according to their price change. In that way, the authors demonstrate that some commodities such as natural gas are strongly associated with the development of the pandemic. In [23] dynamic time warping was used to compare temporal patterns in patient disease trajectories. This enables the assessment of comorbidities in population-based studies, as it permits to identify more complex disease patterns than those derived from pairwise disease associations. The disease-history vectors of patients of a regional Spanish health dataset were represented as time sequences of ordered disease diagnoses. Statistically significant pairwise disease associations were identified and their temporal directionality was assessed. Subsequently, an unsupervised clustering algorithm, based on DTW, was applied on the common disease trajectories in order to group them according to their shared temporal patterns.

Limitations of time warping for epidemiological signals.

We have listed in the preceding paragraphs the main computational contributions to time warping methods and several use cases in epidemiology. The main limitation of time warping comes from its equation. Given two signals f(t) and g(t), a (generally smooth) time shift s(t) is sought such that f(t) ≃ g(t + s(t)). This method is applicable to different but roughly proportional curves of the same phenomenon, for example to the incidence in neighboring regions. But it fails to fully model the relationship between curves that are linked to the same pandemic but are of different nature and scale, such as the pair formed by the incidence and death curves. While the examination of these curves suggests interpretable time delays, they also have a time varying relative rate, which may for example be linked to infectivity changes of a variant, to changes in the treatment, to vaccination, to evolution of the hospital treatment procedure, etc. Thus time warping, which shifts but preserves the values, is not applicable to such pairs. This is why we extend here the warping equation to introduce a varying rate or ratio r(t). The equation we aim at solving for each pair of curves is (1), namely

We shall relate this equation to the classic epidemiological renewal equation in the next paragraph. Yet, in the renewal equation the time interval is assumed known. Here s(t) could be interpreted as the mean of the time interval, but is unknown and must also be evaluated by the method. In the next paragraph we explain why (1) can be seen as a generalization of the renewal equation, and r(t) an extension of the reproduction number Rt.

The renewal equation

The reproduction number r(t) (in the literature often written Rt or Rt) is a key epidemiological parameter evaluating the transmission rate of a disease over time. It is defined as the average number of new infections caused by a single infected individual at time t in a partially susceptible population [24]. The reproduction number r(t) can be computed from the daily observation of the incidence curve i(t), but requires empirical knowledge of the probability distribution k(s) of the delay between two infections [25, 26].

There are two different models for the incidence curve and its corresponding infection delay k(s). In a theoretical model, i(t) would represent the real daily number of new infections, and k(s) is sometimes called generation time [27, 28] and represents the probability distribution of the time between infection of a primary case and infections in secondary cases, this time delay distribution being assumed stationary (independent of t). In practice, neither parameter is easily observable because the infected are rarely detected before the appearance of symptoms and tests will be negative until the virus has multiplied over several days. What is routinely recorded by health organizations is the number of new detected, incident cases. When dealing with this real incidence curve, k(s) is called serial interval [27, 28]. The serial interval is defined as the delay between the onset of symptoms in a primary case and the onset of symptoms in secondary cases [28].

A fundamental equation links r(t) to i(t) and k(s). It is the renewal equation, first formulated for birth-death processes in a 1907 note of Alfred Lotka [29]. We adopt the Nishiura et al. formulation [30, 31], (2) where f0 and f are the maximal and minimal observed times between a primary and a secondary case. In the renewal equation, the incidence curve i(t) is causally linked to itself through an internal delay represented by the probability distribution k(s) and a reproducing factor r(t). We now consider generalizations of this renewal that link time series that can still be related by causality, time delay, and a reproducing factor.

Our first example is the reproducing link from cases to deaths. Consider the time delay probability distribution kt(s) that a person reported dead at time t was reported positive at time ts. Then we call r(t) the probability that a person diagnosed at time t will eventually die from the disease. With these assumptions, we obtain a generation equation linking cases to deaths, namely (3) where d(t) denotes the daily death toll and i(t) the incidence curve. A strong simplification of the model may replace kt by the Dirac mass concentrated at its center of mass σ(t) ≕ ∑s skt(s). Then (3) boils down to (4) which amounts to assuming that all persons dead at time t have been detected at time tσ(t). Let us now relate Eq (4) to the simpler Eq (1). This can be done by setting . Then (4) rewrites where is the reverse change of variable. Setting we get which is precisely (1). It remains to ask why the change of variable is licit. This is ensured if |σ′(t)| < 1, so that is strictly increasing. We shall ensure such a condition in our formalization of the problem.

Mathematical formalization and resolution of (1)

The energy.

Given two causally related time series f(t) and g(t), we introduced the problem of finding a smooth time warping s(t) and a smooth reproduction number r(t) such that (5)

This leads us to solve the following variational formulation of the problem where r(t) and s(t) are estimated by minimizing the error functional (6)

The first term of the energy (6) penalizes a poor matching between r(t)f(t) and g(t + s(t)). The second and third terms of the energy impose that r(t) and s(t) be smooth, and the strength of this smoothing is controlled by the weights wr and ws. The larger the values of these weights the more regular r(t) and s(t) will be. The fourth term penalizes warps s(t) that would fall outside an expected interval [smin, smax] (the function (x)+ is defined as (x)+ = x if x > 0 and (x)+ = 0 otherwise).

Normalization and discretization.

To compute the minima of the energy (6), we first normalize the time series f(t) and g(t) by dividing them by their medians. In this way, the minimization is independent of the time series magnitudes. Once the energy is minimized, the obtained value of r(t) is scaled to compensate for this normalization. Next, we express the energy in a discrete setting as (7) where rk = r(tk), sk = s(tk), fk = f(tk), and .

Alternate minimization.

We use an alternate iterative gradient descent type method to compute and . Given we compute by expressing the derivatives of the energy with respect to and equating them to 0. This yields a system of linear equations in the cases k = 0 and k = K are managed using the boundary conditions. If the user has selected a value or for r0 or rK, then we add to the system the associated equation or Otherwise we add the equations or

Given , we compute using a first order approximation of . We express (then the unknown becomes ) and we replace in the energy by (where ). Differentiating the energy with respect to we obtain for k = 1, …, K − 1, a linear equation in where the boundary conditions in k = 0 and k = K are managed in the same way as for We perform iterations of the alternate scheme to compute until convergence.

Initialization.

Notice that we need to initialize to start the iteration. We use two approaches to get an initialization of The first one is to put in correspondence the local maxima of f(t) and g(t). That provides the value of as the shift between the local maxima. For the rest of the points, we use linear interpolation to compute between the shift obtained for the local maxima. The second approach to initialize is to divide [0, K] in equally spaced subintervals [km, km+1] and for each subinterval, we compute as the value s ∈ [smin, smax] which maximizes the covariance between and . For the rest of points kkm we initialize using linear interpolation. To get the best result for {rk, sk} we minimize the energy using both mentioned different strategies for the initialization of and we keep the one providing the lowest value of the energy.

Parameters.

The main parameters of the energy are wr, ws, which determine the regularity of the ratio r(t) and of the time shift s(t), the interval [smin, smax] with determines the expected range of values for s(t), and the optional boundary values of s(t) and r(t) at the ends of the time interval [0, T]. All of these parameters can be tuned by the user. Unless otherwise indicated, the values of these parameters used in the experiments presented in this work are wr = 1000, ws = 10, [smin, smax] = [−10, 25] and, by default, optional boundary values are not used. The parameter wo only intervenes in the case where s(t) goes out of the interval [smin, smax]. Since an interval [smin, smax], large enough to include the possible values of s(t) is normally used, the term in the energy with wo is rarely used. We fixed wo = 1010 in the energy minimization implementation.

Material and methods

We can apply our method to any pair of epidemiological time series where we can expect that the generalized renewal Eq (1) is satisfied. It is easily checked if the model provides a good matching between both time series (see Fig 1).

The time series to be compared must provide data with the same fixed time step (for instance, daily or weekly). Given the continuous formulation of the generalized renewal Eq (1), as showed below in the experimental results, the method works better if the time series are corrected by adequate smoothing of the strong weekly oscillation, also called administrative noise.

All the experiments presented in this paper can be reproduced using the implementation of the method (named EpiIndicators) publicly available at the EpiInvert CRAN R package [32]. We use data publicly available provided by Our World in data, OWID, (https://covid.ourworldindata.org/data/owid-covid-data.csv). In particular, we use the following time series:

  • new_cases : new daily confirmed cases of COVID-19
  • new_deaths : new daily deaths attributed to COVID-19
  • icu_patients : number of COVID-19 patients in intensive care units (ICUs) on a given day
  • hosp_patients : number of COVID-19 patients in hospital on a given day
  • weekly_icu_admissions : number of COVID-19 patients newly admitted to intensive care units (ICUs) in a given week (reporting date and the preceding 6 days)
  • weekly_hosp_admissions : number of COVID-19 patients newly admitted to hospitals in a given week (reporting date and the preceding 6 days)

The raw registered number of daily cases and deaths are very irregular curves that contain a strong weekly seasonality due to the way countries register the number of cases and deaths every day of the week. To remove the noise and the weekly seasonality of these curves, we use, EpiInvert, the method proposed in [33] which estimates smooth trend curves for the incidence and death. In the experiments presented below, we use these restored versions of the incidence and deaths as the epidemic time series.

Results and discussion

Comparison of cases and deaths in the United Kingdom

In Fig 1 we show the results obtained when comparing cases and deaths in the United Kingdom. We shall compare these results with the ones obtained by other authors on related time series. In [3], the authors study the variation of the time delay between symptom onset and death in 4 epidemiological periods from 2020-01-01 to 2021-01-20 in the United Kingdom. In Table 1 we present some of the results. Globally, the obtained time delay increases in a significant way in the first part of the year and declines in the last period. We compared this result with the one displayed in Fig 1. Note that the data are different: an incident case is recorded at the time of its registration while in [3] the case statistics is attributed to the time of symptom onset. We nevertheless observe the same trend: a significant increase in the first part of the year and a decline by the end of 2020. The maximum of our obtained delay, s(t), is 18.14 and the maximum of the time delay obtained in [3] is 24.69. This difference is justified by the delay between symptom onset and the inclusion of the case in the incidence statistics. This delay depends on the country’s data collection policy. Thus these data are too different to perform a quantitative numerical comparison of the results; we limited ourselves to a study of the coherence of the results.

thumbnail
Table 1. Results, obtained in [3], for the mean of the time from symptom onset to death, segmented by symptom onset date using Lognornal distributions.

https://doi.org/10.1371/journal.pcbi.1011757.t001

In [1], the authors estimate, for the United Kingdom, the death-rate, in the population, for the first five SARS-CoV-2 pandemic waves illustrated as color shaded areas in Fig 1. It should be noted that the article calculates the mortality rate of large population groups, without taking into account whether they have been infected or not. Therefore, the mortality rates are much lower than the ones computed in Fig 1, which refer to the incident population. Calculating the average of the r(t) calculated by our method in each period we obtain the comparison of Table 2. The results show good agreement, apart from two explainable discrepancies. In wave 1, the mortality rates diverge strongly because of the low testing capacity at the beginning of the epidemic. At this stage, most incident cases are only detected in hospitals while the results of [1] are based on a neutral population. In the later waves, the mortality rate obtained by [1] is about 10 times smaller than the one obtained on the incident population, with the exception of wave 4. This would mean that approximately 10% of the observed population was infected. In wave 4, the first wave of Omicron, more people were infected, which causes the rate given in [1] to go up, but the one measured on the incident population instead goes down, which is explainable by the fact that the first Omicron variant was very contagious but less lethal.

thumbnail
Table 2. Comparison, for the United Kingdom, of the death-rate per capita obtained in [1] for the first five SARS-CoV-2 pandemic waves and the average of r(t) in the same periods estimated using the case and deaths time series.

https://doi.org/10.1371/journal.pcbi.1011757.t002

Comparison of hospitalizations and deaths

For France, we compare, in Fig 2, the hospitalizations and deaths restored using EpiInvert. In the data that we use, provided by the OWID, the hospitalizations are given by the weekly aggregated hospitalizations but updated every day with a different value, that is, we have a daily value of the time series. To get an approximation of the daily hospitalizations we divide this quantity by 7. In Fig 2 we observe a good agreement between r(t)f(t) and g(t + s(t)) with quite smooth r(t) and s(t).

thumbnail
Fig 2.

For France, we use as time series f(t): the daily value of the weekly hospitalizations divided by 7 and g(t): the restored number of daily deaths using EpiInvert. In (A) we plot f(t) and g(t). In (B) we plot the estimated r(t)f(t) and g(t + s(t)). In (C) we plot r(t) and s(t).

https://doi.org/10.1371/journal.pcbi.1011757.g002

The colored rectangles in Fig 2 represent the time intervals used in [4] to study the variation of the mortality rate in hospitals. in Fig 3 we present a comparison between the mortality rate obtained in [4] and the value of r(t) showed in Fig 2. Notice that the data are quite different because we use the registered values of weekly hospitalization and deaths and the authors of [4] use the values obtained on a cohort of patients. The reporting delays of the data can be also quite different for each kind of data. With this caveat, the obtained comparative results show a similar order of magnitude. Taking into account that the periods used in [4] are very short (the first six periods belong to the first wave), a mortality decay from 25% to 10% in a few months seems excessive. In the first period for example, in less than 3 weeks and at the beginning of the epidemic, mortality goes down from 24.9% to 20.6%, which is not very credible, especially in the first weeks when everything was getting worse. In that sense, our result that shows a worsening death rate at the beginning before starting to decrease seems more reasonable. We can add to the preceding caveat that the number of cases is very small between sections 5 and 7, which deteriorates the quality of the results when calculating separately by sections. The trend is nevertheless similar in both estimates: for the first variant, the mortality decreases progressively during the first wave, to grow again from the emergence of the Alpha variant in the fall of 2020. At the peaks of the waves, the results of both estimates are closer to each other, which is logical as more data were available.

thumbnail
Fig 3. For France, we show the evolution, during the first epidemic year of the mortality rate in hospital obtained in [4] and the average of r(t) in the same periods showed in Fig 2.

https://doi.org/10.1371/journal.pcbi.1011757.g003

Influence of the weights wr and ws

The lower the values of the weights wr and ws, the better the match between f(t)r(t) and g(t + s(t)), but the functions r(t) and s(t) are more irregular. At the end of the paper we include a justification of the default choices of these parameters using realistic simulated data. To illustrate the influence of such choices, we compare in Fig 4 hospitalizations and deaths in France when lowering wr and ws. As expected, the match between r(t)f(t) and g(t + s(t)) is better than in Fig 2, but r(t) and s(t) are more irregular. In fact, we observe significant oscillations of r(t) and s(t) in very short periods of time that suggest that the values for wr and ws are too low in this experiment, so that both corrected curves overfit.

thumbnail
Fig 4. Same experiment as in Fig 2 but using wr = 10 (instead of the default value wr = 1000) and ws = 1 (instead of the default value ws = 10).

https://doi.org/10.1371/journal.pcbi.1011757.g004

Comparing the result of EpiInvert with the smooth values of cases provided by the OWID

As stated above, the regularization of the raw incidence and death data is an important pre-processing step to improve the quality of the results. In this experiment, we compare the influence of the regularization step when processing the same raw time series with two different methods. In Fig 5 we compare the smooth version of the raw incidence provided by the OWID in France with the smooth version of the raw incidence obtained by EpiInvert, a functionality included in the EpiInvert CRAN package [32]. We observe that the OWID smooth version of cases is more irregular than EpiInvert, it can be zero sometimes and the median of the delay s(t) is about 1.81 days. This experiment also illustrates another use of our method: comparing different versions of a same epidemiological curve and quantifying their difference.

thumbnail
Fig 5.

We use as time series f(t): the restored number of daily cases in France using EpiInvert and g(t): the smooth values of daily cases provided by the OWID. In (A) we plot f(t) and g(t). In (B) we plot the estimated r(t)f(t) and g(t + s(t)). In (C) we plot r(t) and s(t).

https://doi.org/10.1371/journal.pcbi.1011757.g005

Verifying the model by a simulation study

To evaluate the capacity of EpiIndicators to correctly estimate r(t) and s(t) we developed a realistic simulation framework where both functions were known. We used as second time serie g(t) the restored number of deaths in France using EpiInvert and we generated the first time serie f(t) from simulated r(t) and s(t) using the formula (8) in this way, we obtained a perfect matching between both curves. The function r(t) was obtained using ‘atan()’ type functions where we simulated a decreasing behavior for r(t), s(t) was simulated in the same way with an increasing behavior. After simulating f(t) and g(t) in that way with a “ground truth” (s(t), r(t)), we applied EpiIndicators to the pair f(t), g(t) and compared the obtained values of r(t) and s(t) with the ground truth. The results are shown in Figs 6 and 7. We observe a good agreement between the estimated values of r(t), s(t), and the ground truth. Notice that we cannot expect the values to be exactly equal, because in the areas where g(t + s(t)) is very small or constant, there is not enough information to compute s(t). In that case, the energy regularity term dominates, causing the estimated value to be slightly smoother than the ground truth. On the contrary, at the peaks of the epidemic waves, the information is more robust and we can expect the calculation of r(t) and s(t) to be more accurate.

thumbnail
Fig 6.

We use as time series g(t) the restored number of deaths in France using EpiInvert and f(t) is defined as where s(t) and r(t) are a simulated ground truth. In (A) we plot f(t) and g(t). In (B) we plot the estimated r(t)f(t) and g(t + s(t)) using the recovered values for r(t) and s(t). In (C) we plot the recovered values for r(t) and s(t).

https://doi.org/10.1371/journal.pcbi.1011757.g006

thumbnail
Fig 7. Comparison of the values of r(t) and s(t) estimated by the model and their ground truth using the simulated data.

https://doi.org/10.1371/journal.pcbi.1011757.g007

To illustrate the influence of the regularization of the time series in the quality of the r(t) and s(t) estimations, we show in Fig 8 the results obtained using the same simulation but using as function g(t) the original raw number of reported deaths instead of the regularized version using EpiInvert. We observe that the estimations of r(t) and s(t) are strongly perturbed by the noise and weekly bias of the raw data: compare with the results in Figs 6 and 7.

thumbnail
Fig 8.

We use as time series g(t) the raw number of reported deaths in France and f(t) is defined as where s(t) and r(t) are a simulated ground truth. In (A) we plot f(t) and g(t). In (B) we plot the estimated r(t)f(t) and g(t+ s(t)) using the recovered values for r(t) and s(t). In (C) we plot the recovered values for r(t) and s(t).

https://doi.org/10.1371/journal.pcbi.1011757.g008

Next, we studied, using these simulated data, the influence of the order in which the time series are taken. Let and be the ratio and delay obtained using g(t) as the first curve and f(t) as the second curve. Then, accordingly with our model, we have that (9)

To check if this relation is satisfied we studied the normalized error distribution (10)

The results are shown in Fig 9. The small size of the normalized error (10) shows the performance of the method and its stability when we change the role of the time series. In the beginning, we find larger errors due to boundary effects in the estimations.

A sanity check: Verifying the transitivity of the estimations of r(t) and s(t)

In this paragraph, we study the transitivity of the estimation when three time series are involved. Let f1(t), f2(t) and f3(t) be these three time series. We took as example the France incidence curve f1(t) restored using EpiInvert; f2(t) is the number of new hospital admissions and f3(t) is the number of new deaths restored using EpiInvert. The hospitalization data was taken from the COVID-19 European Hub. We will study the relationship between the ratio and delay between the three time series. Let ri,j(t) and si,j(t) be the ratio and delay between time series fi(t) and fj(t). Then we have (11)

Using this relation for {i, j} = {1, 2}, {2, 3}, {1, 3} we obtain the following transitivity relation: (12) therefore the transitivity condition can be expressed using the equations (13) (14)

Next, we check if the relations (13) and (14) were satisfied for the proposed example. To do so we estimated ri,j(t) and si,j(t) using the proposed method. Then, in Fig 10, we plotted r1,3(t) and r2,3(t + s1,2(t))r1,2(t) in (A) and s1,3(t) and s1,2(t) + s2,3(t + s1,2(t)) in (B). We notice a reasonable transitivity between the estimations of the ratio and delay for the three time series, except at the beginning of the epidemic, likely due to the poor quality of the initial incidence curve.

thumbnail
Fig 10.

A sanity check: verifying the transitivity of the estimations of r(t) and s(t) when three time series are involved: f1(t), the France incidence curve, f2(t), the number of new hospital admissions and f3(t) the number of new deaths. In (A) we plot r1,3(t) and r2,3(t + s1,2(t))r1,2(t) and in (B) we plot s1,3(t) and s1,2(t) + s2,3(t + s1,2(t)).

https://doi.org/10.1371/journal.pcbi.1011757.g010

Comparison of epidemiological waves between countries

In Fig 11, we use our method to compare the deaths between Italy and France. It must be pointed out that this comparison is not fully legitimate. The populations being distinct, there is no formal argument to involve a generalized renewal equation between both incidence curves. Furthermore, the health policies in France in Italy were different. This application of our method is justified by the fact that we can expect a similar trend of the epidemiological time series in neighboring countries modulus a shift given by the temporal difference in the entry of the different variants of the virus in the countries. Variations in r(t) can be expected due to the different health policies. Indeed, the results in Fig 11 show a very good agreement between r(t)f(t) and g(t + s(t)), s(t) is quite smooth and r(t) is quite stable, oscillating most of the times in a short range of values ([0.8, 1.2]). The epidemic started in Italy before France. This is reflected in that s(0) ≈ 11.8 days. Then s(t) decreases sharply and stabilizes near zero, indicating a coordination of the epidemic waves in both countries. We also know that the first wave of the epidemic in Italy generated a large number of deaths; this is reflected in that r(0) ≈ 1.12. The median of r(t) for the whole period is 0.885, which is very close to the ratio between the populations of Italy and France (around 0.875), which suggests that, globally, the mortality rate in both countries has been similar. The factor 1.2 in the first three months of the pandemic indicates a larger mortality in Italy where only adaptive regional lockdown was applied. In France a strict lockdown was applied in the corresponding period. We can conclude that the overall mortality in France and Italy was proportional to their population, in spite of different policies.

thumbnail
Fig 11.

We use as time series f(t): the restored number of daily deaths in Italy using EpiInvert and g(t): the restored number of daily deaths in France using EpiInvert. In (A) we plot f(t) and g(t). In (B) we plot the estimated r(t)f(t) and g(t + s(t)). In (C) we plot r(t) and s(t).

https://doi.org/10.1371/journal.pcbi.1011757.g011

Applications to other time series

It might be argued that the generalized renewal equation proposed here, r(t)f(t) = g(t + s(t)) might be applied to more general situations than pandemic-related curves, namely any case where two curves are causally related, such as temperature and flu incidence. We refer to [34] for a study of the climate influence on dengue propagation in Peru. This study estimated the timing difference of dengue epidemics between jungle and coastal regions, with differences significantly associated with the seasonal cycle of mean temperature. We took a similar but more straightforward example and performed a comparative analysis of the weekly France flu incidence (data obtained at sentiweb.fr) and the corresponding weekly average temperature evolution (data obtained at opendatasoft.com) in the 2010-2019 period. Fig 12(A) displays the cyan incidence curve and a red oscillatory curve corresponding to f(t) = max(T(t)) − T(t) where T(t) is the temperature. This inversion ensures that low temperatures correspond to high values of f(t) and enables a direct application of our proposed warping method ensuring that r(t) is positive. A visual examination of both curves shows a coincidence between yearly temperature minima and flu incidence peaks with a short delay. Fig 12(B) displays a comparison of both curves after applying the estimated temporal shift s(t) and ratio r(t). This result nevertheless shows an imperfect match. This fact, in addition to the high oscillatory behavior of the red curve s(t) and the cyan curve r(t) in Fig 12(C) means that our model does not provide a good result matching temperature and flu incidence. Indeed, we find that, in general, the temperature is not proportional to the flu incidence modulus a shift s(t). The ratio r(t) and the time delay s(t) provide no more meaningful insight than what was known beforehand, that low temperatures favor the emergence of flu. This negative experiment leads us to restrict the application of our method to cases where the compared curves are legitimately linked by a general renewal equation of the proposed form, r(t)f(t) = g(t + s(t)).

thumbnail
Fig 12.

Results obtained using as f(t) the weekly flu incidence and g(t) = maxt(T(t)) − T(t) where T(t) is the weekly average of the temperature in France. The black vertical lines correspond to the peaks of the flu incidence.

https://doi.org/10.1371/journal.pcbi.1011757.g012

Justification of the choice of the regularization weights wr and ws

To justify the choice of the regularization weights wr and ws we shall use the realistic simulated data shown in Fig 6. The advantage of this simulated data is that we know the ground truth for r(t) and s(t). We can therefore compute the RMSE between this ground truth and the method’s r(t) and s(t) estimations for arbitrary choices of wr and ws. To study the influence of wr and ws on the RMSE we added a perturbation to f(t) and g(t) given by a multiplicative Gaussian noise with mean 1 and standard deviation 0.3. On the results, we also applied a 7-day centered window average. To get an intuitive relative error for r(t), we divided the RMSE by the mean of the true r(t). For s(t) this normalization step was not needed because the reference time unit is always the day. in Figs 13 and 14 we show the obtained RMSEs. We find a number of combinations with accurate results using wr between 100 and 10000 and ws between 1 and 100. The results are therefore quite stable in a broad range of values of wr and ws. In particular, the default choices of the parameters wr = 1000 and ws = 10 provide the best RMSE result for r(t) and among the best ones for s(t).

thumbnail
Fig 13.

RMSE between the true r(t) and the one obtained by our method divided by the mean of the true r(t), using different choices for the regularization weights wr and ws.

https://doi.org/10.1371/journal.pcbi.1011757.g013

thumbnail
Fig 14. RMSE between the true s(t) and the one obtained by our method using different choices for the regularization weights wr and ws.

https://doi.org/10.1371/journal.pcbi.1011757.g014

Conclusion

The results of the method are informative and realistic: we found that the proposed method enables a comparison between related epidemiological time series. A very good agreement of r(t)f(t) and g(t + s(t)) was generally observed in our experiments, for smooth r(t) and s(t), and the intervals where some disagreement was observed were explainable by real-world events. Our method also leads to propose generalizations of the renewal equation that enable relating curves of different data, thus generalizing the concept of reproducing number Rt to heterogeneous classes.

The main objection that can be raised by the method is its over-determination. Indeed, there are multiple solutions of (1), some trivial like s(t) = 0 and , and many more. Indeed we can fix an arbitrary and smooth shift s(t), and obtain again a solution pair r(t), s(t) to (1) as . Thus, the joint imposition of regularity to r(t) and s(t) is crucial for our purposes. The good results obtained in the simulated study where a realistic ground truth for r(t), s(t) is properly recovered by the method, and the symmetry and transitivity of the method confirm that a sufficient regularity imposition on r(t) and s(t) properly solves the over-determination problem. From the modeling viewpoint, we verified that the behavior of r(t) and s(t) was interpretable and coherent with related results obtained by previous works.

References

  1. 1. Nab L, Parker E, Andrews C et al. Changes in COVID-19-related mortality across key demographic and clinical subgroups in England from 2020 to 2022: a retrospective cohort study using the OpenSAFELY platform. The Lancet Public Health. 2023;8 (5):364–377. pmid:37120260
  2. 2. Tikhonov AN, Arsenin VY. Solutions of ill-posed problems. Winston; 1977.
  3. 3. Ward T, Johnsen A. Understanding an evolving pandemic: An analysis of the clinical time delay distributions of COVID-19 in the United Kingdom. PLoS ONE. 2021;16(10). pmid:34669712
  4. 4. Lefrancq N,Paireau J, Nathanaël H et al. Evolution of outcomes for patients hospitalised during the first 9 months of the SARS-CoV-2 pandemic in France: A retrospective national surveillance data analysis. Lancet Reg Health Eur. 2021;5:100087. pmid:34104903
  5. 5. Kurtz P, Bastos LSL, Dantas LF, et al. Evolving changes in mortality of 13,301 critically ill adult patients with COVID-19 over 8 months. Intensive Care Med. 2021;47(5):538–548. pmid:33852032
  6. 6. Finelli L, Gupta V, Petigara T, Yu K, Bauer KA, Puzniak LA. Mortality Among US Patients Hospitalized With SARS-CoV-2 Infection in 2020. JAMA Netw Open. 2021;4(4). pmid:33830226
  7. 7. Roelens M, Martin A, Friker B et al. Evolution of COVID-19 mortality over time: results from the Swiss hospital surveillance system (CH-SUR). Swiss Med Wkly. 2021;151:w30105. pmid:34843180
  8. 8. Carbonell R, Urgelés S, Rodríguez A, et al. Mortality comparison between the first and second/third waves among 3,795 critical COVID-19 patients with pneumonia admitted to the ICU: A multicentre retrospective cohort study. Lancet Reg Health Eur. 2021;11:100243. pmid:34751263
  9. 9. Keogh EJ, Pazzani MJ. Derivative dynamic time warping. In: Proceedings of the 2001 SIAM international conference on data mining. SIAM; 2001. p. 1–11.
  10. 10. Bellman R. The theory of dynamic programming. Bulletin of the American Mathematical Society. 1954;60(6):503–515.
  11. 11. Müller M. Dynamic time warping. Information retrieval for music and motion. 2007; p. 69–84.
  12. 12. Clifford D, Stone G, Montoliu I, Rezzi S, Martin FP, Guy P, et al. Alignment using variable penalty dynamic time warping. Analytical chemistry. 2009;81(3):1000–1007. pmid:19138127
  13. 13. Senin P. Dynamic time warping algorithm review. Information and Computer Science Department University of Hawaii at Manoa Honolulu, USA. 2008;855(1-23):40.
  14. 14. Sprechmann P, Bronstein A, Morel JM, Sapiro G. Audio restoration from multiple copies. In: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE; 2013. p. 878–882.
  15. 15. Wang K, Gasser T. Alignment of curves by dynamic time warping. The annals of Statistics. 1997;25(3):1251–1276.
  16. 16. Horn BK, Schunck BG. Determining optical flow. Artificial intelligence. 1981;17(1-3):185–203.
  17. 17. Beauchemin SS, Barron JL. The computation of optical flow. ACM computing surveys (CSUR). 1995;27(3):433–466.
  18. 18. Meinhardt-Llopis E, Sánchez J. Horn-schunck optical flow with a multi-scale strategy. Image Processing on line. 2013;.
  19. 19. Stübinger J, Schneider L. Epidemiology of coronavirus COVID-19: Forecasting the future incidence in different countries. In: Healthcare. vol. 8. MDPI; 2020. p. 99.
  20. 20. Jin Q. Time Warping clustering for the forecast and analysis of COVID-19. In: 2020 IEEE MIT Undergraduate Research Technology Conference (URTC). IEEE; 2020. p. 1–5.
  21. 21. Landmesser JM, et al. The use of the dynamic time warping (DTW) method to describe the COVID-19 dynamics in Poland. Oeconomia Copernicana. 2021;12(3):539–556.
  22. 22. Dmytrów K, Landmesser J, Bieszk-Stolorz B. The connections between COVID-19 and the energy commodities prices: Evidence through the Dynamic Time Warping method. Energies. 2021;14(13):4024.
  23. 23. Giannoula A, Gutierrez-Sacristán A,Bravo Á, Sanz F, Furlong LI. Identifying temporal patterns in patient disease trajectories using dynamic time warping: a population-based study. Scientific reports. 2018;8(1):1–14. pmid:29523868
  24. 24. Rodpothong P, Auewarakul P. Viral evolution and transmission effectiveness. World Journal of Virology. 2012;1(5):131. pmid:24175217
  25. 25. He X, Lau EH, Wu P, Deng X, Wang J, Hao X, et al. Temporal dynamics in viral shedding and transmissibility of COVID-19. Nature medicine. 2020;26(5):672–675. pmid:32296168
  26. 26. Ashcroft P, Huisman JS, Lehtinen S, Bouman JA, Althaus CL, Regoes RR, et al. COVID-19 infectivity profile correction. arXiv preprint arXiv:200706602. 2020;.
  27. 27. Wallinga J, Teunis P. Different epidemic curves for severe acute respiratory syndrome reveal similar impacts of control measures. American Journal of epidemiology. 2004;160(6):509–516. pmid:15353409
  28. 28. Cori A, Ferguson NM, Fraser C, Cauchemez S. A new framework and software to estimate time-varying reproduction numbers during epidemics. American journal of epidemiology. 2013;178(9):1505–1512. pmid:24043437
  29. 29. Lotka AJ. Relation between birth rates and death rates. Science. 1907;26(653):21–22. pmid:17754777
  30. 30. Nishiura H. Time variations in the transmissibility of pandemic influenza in Prussia, Germany, from 1918–19. Theoretical Biology and Medical Modelling. 2007;4(1):20. pmid:17547753
  31. 31. Nishiura H, Chowell G. In: Chowell G, Hyman JM, Bettencourt LMA, Castillo-Chavez C, editors. The Effective Reproduction Number as a Prelude to Statistical Estimation of Time-Dependent Epidemic Trends. Dordrecht: Springer Netherlands; 2009. p. 103–121. Available from: https://doi.org/10.1007/978-90-481-2313-1_5.
  32. 32. Alvarez L, Morel JD, Morel JM. EpiInvert R package. CRAN. 2022;.
  33. 33. Alvarez L, Colom M, Morel JD, Morel JM. Computing the daily reproduction number of COVID-19 by inverting the renewal equation using a variational technique. PNAS Proceedings of the National Academy of Sciences of the United States of America. 2021;118(50):1–10. pmid:34876517
  34. 34. Chowell G, Cazelles B, Broutin H, Munayco CV. The influence of geographic and climate factors on the timing of dengue epidemics in Perú, 1994-2008. BMC infectious diseases. 2011;11(1):1–15. pmid:21651779