Filtering and improved Uncertainty Quantification in the dynamic estimation of effective reproduction numbers

The effective reproduction number Rt measures an infectious disease’s transmissibility as the number of secondary infections in one reproduction time in a population having both susceptible and non-susceptible hosts. Current approaches do not quantify the uncertainty correctly in estimating Rt, as expected by the observed variability in contagion patterns. We elaborate on the Bayesian estimation of Rt by improving on the Poisson sampling model of Cori et al. (2013). By adding an autoregressive latent process, we build a Dynamic Linear Model on the log of observed Rts, resulting in a filtering type Bayesian inference. We use a conjugate analysis, and all calculations are explicit. Results show an improved uncertainty quantification on the estimation of Rt’s, with a reliable method that could safely be used by non-experts and within other forecasting systems. We illustrate our approach with recent data from the current COVID19 epidemic in Mexico.


Introduction
The effective reproduction number is the proportion of new cases generated by active cases at calendar time . Estimating has proved to be an important tool for monitoring ongoing epidemics, to monitor interventions or general changes in contagion patterns (e.g. Ferguson et al., 2006;Fraser et al., 2004). This has been the case in the past and even more recently with the current COVID19 epidemics. In Germany, a calculation of is used as public monitor of the evolution of the COVID19 epidemic and has direct consequences on public life and decision making (RKI, 2020). Regarding the estimation of , in basically all realistic cases, a proxy for incidence data has to be used, and along with the inherent stochastic variability in reported cases, proxies are subject to noise.
There are methods to estimate as a sub product of larger mechanistic compartmental models (e.g. Capistran et al., 2020). The main limitation of these methods is precisely that an underlying mechanistic model must be postulated and calibrated beforehand, thus inheriting possible model drawbacks (see Bettencourt and Ribeiro, 2008;Cintrón-Arias et al., 2009, for example). Recently, more generic methods for estimating , requiring incidence data only, have been proposed (Cauchemez et al., 2006;Wallinga and Teunis, 2004;Fraser, 2007;Cori et al., 2013;Thompson et al., 2019). Cauchemez et al. (2006)  is hardly recommended for non-experts and simpler analytic methods are sough.
Let be the incidence of cases, to be used for the estimation of , = 1, 2, …. Using the renewal equation and other assumptions (Fraser, 2007) explains that the effective reproduction number may be defined and estimated with (1) Here is a probability mass function (i.e. ≥ 0 and ∑ ∞ =1 = 1) that accounts for the infectious disease generation interval. It is assumed that infected individuals have a generation interval , dependent on the number on days since infection but independent of current time. At time , − represents a ''force of infection'' of individuals infected days ago. Then represents the ''total infectiousness of infected individuals'' (Cori et al., 2013). It may also be seen as an estimation of the current total of active cases.
is the ratio of secondary cases produced by the current total of active cases. From here onwards we will assume that time is measured in days.
Note that may be only a proxy of the total number of infected people, as for example, people arriving at hospitals to seek help, etc. In most common cases, the new infected individuals at time , * , including asymptomatic infections, is impossible to measure. Ideally, as a proxy, we may regard instead = * with an unknown constant.  Capistran et al., 2020). Data timestamp correspond to symptom's onset date. The corresponding posterior of according to Cori et al. (2013) Poisson model (right). Also, the Mérida incidence data multiplied by 10, and its corresponding posterior (dark red). The uncertainty presented by the posterior responds to the incidence absolute values and not to the variation in reproduction numbers. The generation interval used here is explained in Fig. 2. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.) infections and is indeed independent of . Any estimation of should reflect this, providing the same results if a different is used or if the proxy is given proportional to the total population etc. Moreover, the uncertainty we have on the actual should not be bound to the absolute values of but to the recent variation in the observed ratio ∕ .
Moreover, epidemic data has far more caveats that need to be addressed. Reporting delays, along with right truncation are very common in epidemic data. Incidence data timestamp is also a significant source of uncertainty. In the cases where timestamps correspond to the reporting date, incidence data is a convolution of infections (Gostic et al., 2020). Several methods address this issue by deconvolving the observed trajectories to infections in terms of uncertain reporting delay distributions Bhatt et al., 2020). In cases for which the symptom's onset date is known, the incidence is used directly instead of deconvolving a delay from onset to reporting, allowing for a more precise estimation of the infection date (Huisman et al., 2021). We do not address the deconvolution of infections issue in the present work since all used data have the symptom's onset date as a timestamp. Only under a thorough investigation of the reporting system at hand, is that incidence data may be corrected to obtain a reasonable enough good proxy for estimating (Salmon et al., 2016;McGough et al., 2020). As we say, only ideally is that = * . Even after standard data preprocessing techniques are applied, conditions may change during an epidemic with the net result that changes over time, for example. Undoubtedly, the use of raw incidence data, without proper preprocessing, most likely will lead to incorrect/biased estimates of (Gostic et al., 2020). We do not discuss data preprocessing any further and assume = * throughout, with an unknown fixed .
In Cori et al. (2013) the estimation of , based on the proxy incidence , uses (1) to form the statement that A probability model is then proposed for the observed conditional on −1 , −2 , … , 1 with the expected value . Cori et al. (2013) proposed the Poisson model | −1 , −2 , … , 1 ∼ ( ). Incidence data is commonly overdisperse and the immediate improvement on this Poisson model would be to use a more general count data model as a Negative Binomial (Lindén and Mäntyniemi, 2011). Dispersion parameters may be fixed or estimated in the Negative Binomial . However, directly postulating a model for incidence data may result in estimates that depend on the unknown constant . Moreover, as far as estimating is concern, the actual variability of is not in principle relevant, only the variability observed in the ratios ∕ .
The resulting inference of Cori et al. (2013), where they assume a constant of = 7 days (with a gamma (5, 1) conjugate prior) is illustrated in Fig. 1. In particular, the coefficient of variation (CV) of the posterior distribution for is (see Cori et al., 2013, p.3 web material). The CV will decrease if we increase the arbitrary constant . It does depend only on the absolute current values for and not on the intrinsic variation of the underlying 's, see Figure 2. Further elaboration on this Poisson model may be found in recent literature. Thompson et al. (2019) improves this Bayesian inference of by also estimating and adding the corresponding variability in the posterior of . Nouvellet et al. (2018) uses (Cori et al., 2013) embedded in a more complex forecast system. The need for a statistical estimation of is apparent, with a reasonable estimation of the uncertainty in all inferences.
Recently, and prompted by the current uses of estimates during the COVID19 epidemic, Gostic et al. (2020) make a review of current methods and recommend the use of Cori et al. (2013), in contrast to other methods not based on (1), see Gostic et al. (2020) and references therein. Gostic et al. (2020) stress some other three points in the estimation of , namely: (1) Data for estimating the infectivity function is rarely available, and proxies for estimating should be used with care.
(2) Correct incidence data should always be used, representing the number of new infections every day, taking special care to correct for reporting delays and other issues. (3) Smoothing of estimation is a matter of concern. Cori et al. (2013) uses a smoothed window of = 7 days, but this should be used judiciously depending on the data at hand, although no definite guidelines are suggested in Gostic et al. (2020). However, Gostic et al. (2020) focus on other comparing issues in the estimation of and did not concentrate on the adequateness of the uncertainty expressed by the posterior distributions of Cori et al. (2013). Moreover, Abbott et al. (2020) present an estimation for the COVID19 pandemic, based on a hierarchical model, that tries to account for reporting delays, sub-reporting etc. using MCMC. Besides the actual details, the basic idea of modeling the as an autoregressive time series is used, as we propose doing (with the ( )), see Section 2 below.
Our aim here is to improve the Uncertainty Quantification (UQ) in the estimation of , from the definition in (1), and using an analytic Bayesian approach (i.e. non-MCMC) that may be robustly embedded in other systems or used by non-experts.
We follow the same basic idea of Cori et al. (2013) but interpret (1) differently. Namely, define = ( ), then where = ( ) + ( ) and = ( ) + ( ), for some unknown constant . As usual, when taking logarithms of count data to postulate Table 1 Updating formulas for the parameters of the non-central student-t posterior distribution for , | ∼ ( , ).
As recommended by Cori et al. (2013), should only be estimated for incidence with ≥ 10. This further justifies the use of the above Normal model. Therefore, with the observed log 's, i.e. the s, we will try to estimate the = ( )s. Cori et al. (2013) assumed a constant over days, obtaining some smoothing. A further improvement on the approach of Cori et al. (2013) is that we explicitly and formally model smoothing, or coherence, among the s. This we do by postulating an autoregressive prior for the 's, modeling the fact that should be similar (but not necessarily equal) to −1 . This creates a Dynamic Linear Model, and using Bayesian updating, forms a filtering type inference for the sequence.
We will continue to assume the generation interval fixed and dedicate our effort to improving the UQ in the Bayesian estimation of . As mentioned above, we do not discuss data preprocessing and assume that is a correct proxy for estimating . We do not provide an analysis on the effects of incidence data anomalies on our estimation of . The paper is organized as follows. We present the details of our approach in the next section and in Section 3 we present two examples using recent COVID19 incidence data from Mexico. Finally, in Section 4 we make a discussion of our results.

Materials and methods
A Dynamic Linear Model (West and Harrison, 1997) is proposed for the estimation of the 's. Namely, = + , ∼ (0, −1 ).
With 0 ∼ ( 0 , −1 * 0 ) an autoregressive AR(1) (prior) model is proposed to estimate the log 's. Here we state that is equal to −1 plus some noise . The precision process, i.e. 2 = −1 , is the variance of the observed log 's, , and is assumed unknown. With (1) is assumed to estimate these variances. are the discounted degrees of freedom = −1 + 1, West and Harrison (see 1997, sec. 10.8 for details). Here we also assume that the precision is similar to −1 multiplied by the random innovation or shock ∕ . Since ( ) = , ∕ has a shifted distribution with mean 1. This keeps all positive. The discount factor and the hyperparameters 0 , * 0 , * , 0 are part of the modeling and prior definition (see below) and considered known.
The marginal posterior of may be calculated analytically and corresponds to a non-central student-t distribution with degrees of freedom where = ( , −1 ); see Table 1 for the recursive calculation of the central parameter and dispersion parameter . This constitutes a filtering type estimation (Kalman, 1960) of the 's, with discounted variance (West and Harrison, 1997, sec. 10.8). The marginal posterior distribution of may be obtained from (2), namely In particular, quantiles for this posterior may be calculated by transforming the corresponding quantiles of (2) since = ( ) is monotonic.

Autoregressive modeling and prior distributions
The assumption of Cori et al. (2013) that remains constant over days is relaxed here with the AR(1) prior model for the 's, = −1 + , ∼ (0, −1 * ). The variance discount factor is key for learning from recent variability in the 's and gradually cease to learn from more distant 's. It is proved that the degrees of freedom , which we may visualize as the ''effective'' sample size, the number of sample points in the past actually used to estimate the variance component, has the property → (1 − ) −1 (West and Harrison, 1997, sec. 10.8). We postulate that (1 − ) −1 = 2 , or = 1 − (2 ) −1 . That is, the limit to learning from the past is 2 days. Regarding the prior for 0 ∼ ( 0 , −1 0 * 0 ), we center it at 0 = 0 (i.e. = 1) with the same variance for the observed , that is * 0 = 1. Regarding the prior for 1 ∼ ( 0 ∕2, (1 − ) 0 ∕2), note that by design it is centered at . Using 0 = 2 provides a diffuse prior with large variance, and leads to the default non informative (1∕2, 1∕2) in the case of = 1∕2. The variance for the AR(1) model for is crucial, as it controls the memory and smoothness in our approach. Our a priori statement on this variance is the value of the variance multiplier * . The prior model for is in fact a summation of Gaussian shocks (as in the Weinner process). It is clear then that | − ∼ ( − , −1 * ). If the variance is similar to, or larger than, the variance of the observational process, that is * ≥ 1, then little smoothing is obtained at lag since can vary as much as, or more, than − . Taking = , we set * = 2, thus * = 2∕ , to also limit the level of smoothing to less than days. We postulate * = 2 as a pragmatic large enough variance multiplier, that leads to a quite similar smoothing obtained by Cori et al. (2013) using the same value for . This is illustrated in the examples presented in the next Section.

The generation interval
A key input to calculating is postulating through an infectious disease generation interval. may be seen as the probability ( ∑ ∞ =1 = 1) that an infected person infects other people days from infection (Fraser, 2007). A readily way to define is by defining a pdf ( ) and with its cdf let = ( ) − ( − 1). For the COVID19 epidemic, we postulate an Erlang(3, 8/3) pdf depicted in Fig. 2(a). The expected value is at 8 days and the maximum infectivity is at 5.5 days, decaying near zero after 20 days. This is based on early (Team et al., 2020;Verity et al., 2020) as well as recent reports on the viral load and implied infectiousness of the decease (see Cevik et al., 2020, fig. 2). The use of the Erlang distribution in epidemics has been suggested elsewhere (Champredon et al., 2018).
To illustrate that our inferences are stable, we also present in Fig. 2(a) several other Erlang densities as generation intervals varying around the Erlang(3, 8/3), that may also be considered adequate for COVID19. Since all calculations for the posterior distribution of are analytic and performed in milliseconds, calculating each posterior in each case represents no great CPU burden. This approach is used in Thompson et al. (2019) to include the inherent uncertain nature of any generation interval density. In Thompson et al. (2019) a methodology is proposed to estimate the generation interval based on previous data of known pairs of base and secondary cases; from detailed contact tracing of infected individuals, for example. Variability is imposed on the parameters of a postulated family of densities (in our case, Gamma densities that are close to the Erlang(3, 8/3)) and from that, the s are calculated. Formally speaking, this constitutes a hierarchical model in which now the parameters for the generation interval density become random. The posterior distribution for ceases to have a close form, but generating a Monte Carlo (MC) sample from it is straightforward.  Thompson et al. (2019). The resulting posterior distributions of on 23JUN2020 (b) and 03JUL2020 (c) for our method, for the Mérida COVID19 incidence data presented in Fig. 1. An additional Erlang(30, 20/30) representing an extreme alternative generation interval density and its corresponding posterior for , dashed green in (a), (b) and (c), respectively.
One simple simulates one set of parameters for the generation interval, calculates the corresponding posterior and in turn simulate one value from it. Iterating through this algorithm will create a MC for the posterior of , without needing to rely on MCMC. All the 20 posteriors shown in Fig. 2(b) where calculated in 0.5 s, thus a sample of 1000 would take 8 min, approximately.
The resulting posterior distributions for the using our method, for Mérida, México, on 23JUN2020 and 03JUL2020 are presented in Figs. 2(b) and (c). Certainly, under broad regular circumstances, the Bayesian posterior operator is smooth with respect to changes in the prior and likelihood (see Christen and Pérez-Garmendia, 2020, for example) and therefore the resulting posteriors are compacted around the original posterior. Certainly, if a more distant generation interval density is used the will move accordingly, as see with the dashed green Erlang(30, 20/30) serial interval (with mean now in 20 days) and its corresponding posteriors in Fig. 2. More interesting is that more variability is seen, as expected, if there is variability in the incidence data, seen on 23JUN2020 and 03JUL2020. Compare with the Mérida incidence data in Fig. 1 and s in Fig. 3 We continue with an Erlang(3, 8/3) as our generation interval distribution for COVID19 in the two examples to be presented below.

Results
We use the same data sets presented in Fig. 1 to illustrate our approach, plotting the posterior time series from the last date in the time series to 60 days before, see Fig. 3. Also plotted in Fig. 3 are the posterior for the s using (Cori et al., 2013) and the observed = s. Regarding the Mérida data, note how (Cori et al., 2013) posterior is basically insensitive to the variability observed in the data, the posterior inter-quantile ranges remain basically equal. For our approach, note from 05.05 to 19.05 the large variability in leads to large posterior ranges whereas from 02.06 to 09.06 a more stable observed results in far shorter posterior inter-quantile ranges. Later on, by the end of June, more spread posteriors are seen again given the observed second wave of large and unstable s. A sharp increase in follows a sharp decrease, from above 2 to nearly 1, within 10 days until 03.07. Our AR(1) modeling results in a smooth decay on the position of the posterior for , and shorter spread, as a response to the short stable spell of observed . For these same 10 days, a similar behavior is seen in the position of Cori et al. (2013)'s posterior, but this as a result of the moving window of days, as explained in the previous section. As already mentioned, the UQ resulting from the posterior of Cori et al. (2013) responds only to the absolute values of s, which in this case are around 25 to 100 in May and June.
Regarding the results from Mexico city, presented in Fig. 3, note how the posterior of Cori et al. (2013) basically collapses around the median, given the large counts of this large metropolitan area (with a population of more than 21 million, varies around 1000 to 2000 in May and June). On the contrary, our UQ provided by the posterior of s presents larger spreads in early May, when an unstable spell is observed for . Later on, after most of June with a more stable period, the uncertainty decreases with far sharper posteriors by the end of the month and the firsts days of July.
Regarding the smoothing obtained in the estimation of with respect to the observed values, that is, gray lines in Fig. 3, see how sharp changes in the latter are reasonably smoothed out by the former. Our strategy to setting the variance multiplier hyperparamenter * in the AR(1) model for (i.e. * = 2∕ , with = 7 days, see Section 2.1) results in a smoothing similar to that of Cori et al. (2013)'s, although with larger and adaptive inter-quantile ranges, as already commented. The smoothness obtained by Cori et al. (2013) is a result of assuming a fixed for = 7 days and calculating the posterior independently for each . Ours is the result of the DLM and the autoregressive prior model on .

Discussion
We present a new approach for estimating the effective reproduction number with the following improvements: (1) It does not depend on a compartmental model, specific to a particular disease. (2) Using previous ideas (Fraser, 2007;Cori et al., 2013) we construct a novel statistical model and Bayesian inference approach to estimate . (3) The resulting UQ is better suited responding to the recent variation of observed s. (4) An AR(1) process is used to promote smoothness on the estimation and (5) all calculations are simple, leading to a Bayesian updating (filtering) type analysis that may be used by non-experts. In fact, a simple to use Python program with an implementation of our method (and also the original of Cori et al., 2013) may be downloaded from https://github.com/andreschristen/RTs. We are working also on an alternative implementation in a spreadsheet.
The generation interval is a crucial input to this and other similar methods to estimate . Note that based on preliminary data on infected patients, some basic knowledge can be gained to postulate a reasonable . This was the case since early stages of the COVID19 epidemic (Team et al., 2020). It is easier to gain information on since it represents information on the infectivity, bound to an average infected person. Social contact patters, lockdown measures etc do not play a role in (from the underlying assumption that ( , ), in the renewal equation ( ) = ∫ ( , ) ( − ) , has the form ( , ) = ( ) ( ), see Fraser, 2007). The social/epidemiological factors influencing contagion patters are coded in , an appealing feature of this type of approaches to estimate the latter, to monitor changes in epidemic data from the onset.
Estimating the infectious disease generation interval requires time consuming clinical studies that are commonly not available during an ongoing epidemic with a new virus as Sars-Cov-2. Thompson et al. (2019) present an attempt to include uncertainty in the estimation of into the posterior distribution of of Cori et al. (2013). Including an estimate of in our method, along with its corresponding uncertainty, is formally straightforward within this Bayesian framework, either by 's for incidence data presented in Fig. 1 for the last 60 days. From the posterior marginal of each in (2) we calculate its 10%, 25%, 50%, 75% and 90% quantiles and these are transformed back to obtain quantiles of . These are plotted vertically with blue shades and the median with a red line. The Poisson model posterior of of Cori et al. (2013) is also plotted, using the same quantiles with green shades and a black line for the median. The posterior of Cori et al. (2013) for Mexico city is so sharp that the quantile shades are hardly seen. The gray lines are the observed = s. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.) including a likelihood with additional observables or by postulating a prior process for . Indeed, this will complicate the approach and most likely MCMC methods will need to be used.
As already mentioned, a reliable estimation of is quite relevant, to be used as such, or as part of other larger forecasting or monitoring systems. Bettencourt and Ribeiro (2008) elaborate on a simple relation of future incidence with predicted s. A reliable forecasting system of could then be used for short term forecasting of . A related idea is used in Nouvellet et al. (2018) for incidence forecast and other our purposes. Being at time , the process + , = 1, 2, … , can be easily simulated, by simulating and other parameters from the posterior at and in turn simulating + | + −1 until + , to produce MC samples of the posterior predictive distribution for . A feasible path for incidence forecasting based on forecasts of is envisaged, but we need to leave this possibility for future research.

Declaration of competing interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.