Reconciling early-outbreak estimates of the basic reproductive number and its uncertainty: framework and applications to the novel coronavirus (SARS-CoV-2) outbreak

A novel coronavirus (SARS-CoV-2) emerged as a global threat in December 2019. As the epidemic progresses, disease modellers continue to focus on estimating the basic reproductive number R0—the average number of secondary cases caused by a primary case in an otherwise susceptible population. The modelling approaches and resulting estimates of R0 during the beginning of the outbreak vary widely, despite relying on similar data sources. Here, we present a statistical framework for comparing and combining different estimates of R0 across a wide range of models by decomposing the basic reproductive number into three key quantities: the exponential growth rate, the mean generation interval and the generation-interval dispersion. We apply our framework to early estimates of R0 for the SARS-CoV-2 outbreak, showing that many R0 estimates are overly confident. Our results emphasize the importance of propagating uncertainties in all components of R0, including the shape of the generation-interval distribution, in efforts to estimate R0 at the outset of an epidemic.


Introduction
Since December 2019, a novel coronavirus (SARS-CoV-2) has been spreading globally [1]. Although the virus is likely to have originated from animal hosts [2], the ability of SARS-CoV-2 to directly transmit between humans, particularly without symptoms, has posed a greater threat for its spread [3]. As of 11 May 2020, more than 4 million cases of the coronavirus disease 2019 (COVID -19) have been confirmed internationally [4].
As SARS-CoV-2 began to spread in parts of China outside Hubei province, as well as in other countries, many analyses of the outbreak were published as pre-prints [5][6][7][8][9][10] and in peer-reviewed journals [11][12][13][14]. These analyses focused on estimating the basic reproductive number R 0 -the average number of secondary cases generated by a primary case in a fully susceptible population [15,16]-in order to assess the pandemic potential of SARS-CoV-2. Rapid dissemination of these early analyses played an important role in shaping the response to the outbreak [17].
We commend these researchers for their timely contribution and those who made the data publicly available. However, the estimates of R 0 from different research groups (as well as the associated degrees of uncertainty) vary considerably even though most analyses rely on similar data-reports of confirmed cases from China, particularly from Wuhan City. Comparing a disparate set of estimates of R 0 can be difficult when the estimation methods and their underlying assumptions vary widely. In some cases, similar methods can give different estimates; in other cases, different methods can give similar estimates. Understanding the differences between R 0 estimates is critical to controlling an epidemic as R 0 provides information about the level of intervention required to prevent further transmission [15], and about the potential final size of the outbreak [15,18].
Here, we show that a wide range of approaches to estimating R 0 can be understood and compared in terms of estimates of three quantities: the exponential growth rate r, the mean generation interval G and the generation-interval dispersion κ. The generation interval, defined as the interval between the time when an individual becomes infected and the time when that individual infects another individual [19], characterizes the relationship between r and R 0 [20][21][22][23]; therefore, estimates of R 0 depend directly on their assumptions about the generation-interval distribution and the exponential growth rate. Early in an epidemic, information is scarce and there is uncertainty surrounding both case reports (affecting the estimates of the exponential growth rate) and contact tracing (affecting the estimates of the generation-interval distribution). Ignoring these uncertainties leads to overly confident conclusions.
To formalize the estimation of uncertainty at the onset of an outbreak, we present a statistical framework for averaging across estimates of the basic reproductive number R 0 from multiple studies. We apply the method to seven disparate models published online as pre-prints between 23 and 26 January 2020 that estimate R 0 for the SARS-CoV-2 outbreak in Wuhan City, China [5][6][7][8][9][10]24]. Previous studies have directly calculated the average of reported R 0 values [17,25] but such methods mask differences in underlying model assumptions and statistical methods. Instead, we model the estimate of R 0 (as well as the associated generation-interval parameters, G, and κ) from each study with probability distributions that account for the uncertainty in the estimates; this allows us to re-estimate the corresponding distributions of the exponential growth rates r. We then use a Bayesian multi-level model to average the three key quantities (r, G and κ). The resulting pooled estimates (μ r , μ G and m k ) are used to calculate the pooled estimate of the basic reproductive number, R pool . Using pooled estimates allows us to average appropriately across the uncertainties present in modelling approaches and in their underlying assumptions. We use these pooled estimates to illustrate the importance of propagating different sources of error, particularly uncertainty in both the growth rate and the generation interval.

Description of the studies
We gathered information on estimates of R 0 for the SARS-CoV-2 outbreak in Wuhan City, China and their model assumptions from seven articles that were published online between 23 and 26 January 2020. Five studies [7][8][9][10]24] were uploaded to preprint servers (bioRxiv, medRxiv and SSRN); one report was posted on the website of Imperial College London [6]; and one report was posted on nextstrain.org [5] (table 1).

Model assumptions
Despite a wide range of models considered across the studies, all of them assume that the epidemic initially grows exponentially. The IDEA model (used in study 7) includes a discount parameter d that allows the model to deviate from exponential growth when d ≠ 0 [28], but study 7 estimates d = 0 across all parameters they consider. Even though some studies consider reported cases up to 26 January 2020-3 days after the travel restriction that took place on 23 January 2020 [29]-the exponential growth assumption can still describe the number of reported cases reasonably well; given the incubation period of around 5 days [30] as well as reporting delays of around 5 days [31], the majority of reported cases during the study periods are likely to have been infected prior to the travel ban.
When the epidemic is growing exponentially, the estimated basic reproductive number is determined by the exponential growth rate r and the intrinsic generation-interval distribution g(τ), which describes the infection time of secondary cases caused by a primary case in a fully susceptible population [32], via the Euler-Lotka equation [22]: Therefore, it is sufficient to consider the estimates and assumptions about the exponential growth rates and the shapes of the generation-interval distributions to understand disparate estimates of the basic reproductive number. All model assumptions reduce to properties of the exponential growth rate r and the shape of the generation-interval distribution g(τ).
For example, if a model relies on overly confident assumptions about the underlying observation (how new cases are reported) or process (how new cases are generated) model, the estimated confidence/credible intervals associated with the exponential growth rates or parameters of the generation-interval distributions (from each study) will necessarily be narrow. As most studies do not report their estimates of the exponential growth rate, we first summarize model outcomes using reported (either estimated or assumed) values of the basic reproductive number R 0 , mean generation interval G and generationinterval dispersion κ, represented by the squared coefficient of variation (table 1)-we re-estimate the corresponding exponential growth rates from these values later. Study 2 only reports their assumptions about the mean generation interval; for simplicity, we assume κ = 0.5 in our analysis. Study 6 presents R 0 estimates under 12 different scenarios regarding reporting rates (0-, 0.5-, one-or twofold increase in reporting rate) and the shapes of the generation-interval distributions based on previous coronavirus outbreaks (Middle East respiratory syndrome, MERS; severe acute respiratory syndrome, SARS; and their average)-we use their baseline scenario in our analysis to remain consistent with other studies, which do not account for changes in the reporting rate. While estimates of R 0 and the associated confidence intervals for study 6 in table 1 are based on G ¼ 8 d, we account for the uncertainty they consider for G in our formal analysis.
While most studies report confidence/credible intervals to quantify uncertainties associated with their estimates, some use different measures. In particular, study 2 reports a range of R 0 for the worst and best case scenarios, which correspond to the values of R 0 such that 95% and 5% of the simulated total number of cases by 18 January 2020 are greater than or equal to 4000, respectively; for simplicity, we treat these intervals as a royalsocietypublishing.org/journal/rsif J. R. Soc. Interface 17: 20200144 Table 1. Summary of the models, analysed data, reported estimates of the basic reproductive number, and the assumptions about the generation-interval distributions. Model details, estimates of R 0 and their assumptions about the shape of the generation interval distributions were collected from seven studies. Generation-interval dispersion represent the squared coefficients of variation in generation intervals.  These intervals reflect R 0 values for best and worst scenarios. We treat these intervals as a 90% confidence/credible interval in our analysis. b We assume κ = 0.5 in our analysis. c The authors presented R 0 estimates under different assumptions regarding the reporting rate; we use their baseline scenario in our analysis to remain consistent with other studies, which do not account for changes in the reporting rate.
royalsocietypublishing.org/journal/rsif J. R. Soc. Interface 17: 20200144 90% confidence/credible interval in our analysis. Uncertainty ranges reported by study 1 and study 7 are assumed to be uniform ranges. Some of these studies have now been published in peerreviewed journals [12,14] or have been updated with better uncertainty quantification [33]. As our primary focus is on the resolution of uncertainty in the available information during the earliest stages of an epidemic, rather than to provide more precise or accurate estimates of R 0 , we focus strictly on estimates that were published between 23 and 26 January 2020.

Gamma approximation framework for linking r and R 0
Here, we use the gamma approximation framework to the generation-interval distribution [20,[34][35][36][37][38] to (i) characterize the amount of uncertainty present in the exponential growth rates and the shape of the generation-interval distribution and (ii) assess the degree to which these uncertainties affect the estimate of R 0 . The gamma distribution provides a reasonable approximation for generation-interval distributions of many diseases, including Ebola, measles and rabies [20]. Studies 1, 5, 6 and 7 also used a gamma distribution (including the special cases of Dirac delta and exponential distributions) to model the generation-interval distribution for SARS-CoV-2. Assuming that generation intervals follow a gamma distribution with mean generation interval G and generation-interval dispersion κ, represented by the squared coefficient of variation of a gamma distribution, we have [20]: This equation demonstrates that a generation-interval distribution that has a larger mean (higher G) or is less variable (lower κ) gives a higher estimate of R 0 for the same value of r [22].

Re-estimation of the exponential growth rate
As most studies do not report their estimates of the exponential growth rate, we first re-estimate the exponential growth rate that corresponds to their model assumptions. Since the estimate of the basic reproductive number R 0 is determined by the exponential growth rate and the shape of generation-interval distributions, we can calculate the exponential growth rate from the basic reproductive number R 0 , the mean generation interval G and the generation-interval dispersion κ. First, to account for uncertainties in these parameters, we model reported values of the basic reproductive number R 0 , the mean generation interval G and the generation-interval dispersion κ with appropriate probability distributions. We use gamma distributions to model values reported with confidence/credible intervals (CI) and uniform distributions to model values reported with ranges; when confidence/credible intervals are reported, we parametrize the gamma distribution such that (i) its mean matches the estimated value and (ii) the probability that a random variable following the specified gamma distribution falls between the lower and upper confidence/credible limits is equal to the reported confidence/credible level. This probability is not necessarily based on equi-tailed quantiles. For example, study 3 estimated R 0 ¼ 2:92 (95% CI: 2.28-3.67); we model this estimate as a gamma distribution with a mean of 2.92 and a shape parameter of 67, which has a 95% probability of containing a value between 2.28 and 3.67 (see table 2 for a complete description).
For each study i, we construct a family of parameter sets by drawing 10 5 random samples from the corresponding probability distributions (table 2) that represent the estimates of (R 0 ) i,m and the assumed values of G i,m and κ i,m and calculate the exponential growth rate r i,m by inverting equation (2.2): where m = 1, …, 10 5 . This allows us to approximate the probability distributions of the exponential growth rates estimated by each study. Uncertainties in the probability distributions that we calculate for the estimated exponential growth rates reflect model assumptions, statistical methods, and also the quality of the data that each study relies on. This approach of reestimating the exponential growth rate does not affect the uncertainty captured by our analysis because we are re-estimating the probability distribution of r i that is consistent with the reported values of (R 0 ) i , G i and κ i ; in other words, we still obtain the same degree of associated uncertainty in (R 0 ) i if we calculate it from r i , G i and κ i . For study 6, we fix G ¼ 8 d and use the gamma distribution (table 2) that corresponds to R 0 ¼ 5:47 (95% CI: 4.16-7.10) during the re-estimation step for r to remain consistent with the original study, which assumed G ¼ 8 d for this particular estimate. We account for uncertainties in G for study 6 (table 1) in all other steps in order to properly incorporate parameter uncertainties in the estimate of R 0 . Study 7 uses the IDEA model [28], through which the authors effectively fit an exponential curve to the number of confirmed cases without propagating any statistical uncertainty. Instead of modelling R 0 with a probability distribution and recalculating r, we use r = 0.114 d −1 , which accounts for all uncertainty in the reported R 0 when combined with the considered range of G in the original article. Table 2. Probability distributions for R 0 , G and κ. We use these probability distributions to obtain a probability distribution for the exponential growth rate r. The gamma distribution is parametrized by its mean and shape. Constant values are fixed according to exp (r G) b Uniform (6, 10) 0 a We do not account for this uncertainty during our re-estimation of the exponential growth rate r because the reported estimate of R 0 and its uncertainty assumes G ¼ 8. We still account for this uncertainty in our pooled estimates (μ G ). b Instead of modeling R 0 with a probability distribution and re-estimating r, we use r = 0.114 days −1 (see text).

Pooled estimates
We construct pooled estimates for each parameter (r, G and κ) using a Bayesian multilevel modelling approach, which assumes that the parameter estimates across different studies are all drawn from the same gamma distributions: (2:4) where μ r , μ G , m k represent the pooled estimates, and σ r , σ G and s k represent between-study standard deviations. The pooled estimates, which are represented as probability distributions rather than point estimates, allow us to average across different modelling approaches while accounting for the uncertainties in their assumptions. Here, we do so by averaging across reported values, without explicitly re-fitting their models. We use a Markov chain Monte Carlo approach (cf. §2.7) and account for uncertainties associated with r i , G i and κ i (and correlations among them), by drawing a random set from the family of parameter sets (r i,m , G i,m , k i,m ) for each study i at each Metropolis-Hastings step. Since the gamma distribution does not allow κ = 0 (this corresponds to a Dirac delta generation-interval distribution), we substitute κ = 0.02 for study 7. Although this approach nominally treats all studies equally, the overall pooled estimate will still be weighted by the certainty of the reported estimates (e.g. r i will be sampled from a narrow distribution and therefore have stronger influence on μ r if the reported confidence/credible interval on r i is narrow).
Our approach does not account for non-independence between the parameter estimates made by different modellers. In this case, most estimates primarily depend on reported cases from China, particularly from Wuhan City. Differences among estimates are primarily driven by differences in estimation methods and underlying assumptions, rather than by epidemiological differences. The pooled estimates can become sharper (i.e. have narrower credible intervals) as we add more models even when the models or the data no longer add more information about the epidemic. Since SARS-CoV-2 spread primarily in Wuhan City, China, during this period, it is not possible to include independent sources of data from other countries. Thus, the pooled estimates should be interpreted with care.

Prior distributions
We use weakly informative priors on hyperparameters (m r , m G , m k , s r , s G , s k ): (2:5) These priors are chosen such that their 95% quantile ranges are sufficiently wider than biologically realistic parameter ranges. Specifically, 95% quantile ranges for μ r , μ G and m k are 0:02-0:40 d À1 , 0.8-19.5 d and 0.1-1.4, respectively; 95% prior quantile range for R 0 then corresponds to 1.05-12.00. Parameters that are outside these ranges are biologically unrealistic for SARS-CoV-2 outbreaks. Therefore, we do not expect our results to be sensitive to these priors. We follow recommendations outlined in Gelman et al. [39], parametrizing the top-level gamma distributions in terms of their means and standard deviations and imposing weakly informative prior distributions on between-study standard deviations, i.e. half-normal (0, 10). We initially used gamma priors with small shape parameters (<1) on between-study shape parameters (=μ 2 /σ 2 ) but found this put too much prior probability on large between-study variances-a known problem [39]. Alternative choices of prior for the between-study shape parameters are also suboptimal. Imposing strong priors (e.g. half-t (μ = 0, σ = 1, ν = 4)) assumes a priori that between-study variance is large (and therefore does not pool different estimates sufficiently). Overly weak priors (e.g. half-Cauchy (0,5)) lead to inefficient sampling and poor convergence.

Markov chain Monte Carlo
We run four independent Markov chain Monte Carlo chains each consisting of 500 000 burnin steps and 500 000 sampling steps using the Metropolis-Hastings algorithm. Proposal distributions are modelled using independent normal distributions. Initial values and variances of the proposal distributions are chosen by trial-and-error to ensure a reasonable acceptance rate (around 10%) and convergence within 1 000 000 steps. Posterior samples are thinned to every 1000 steps to remove autocorrelations among posterior samples. Convergence is assessed by ensuring that the Gelman-Rubin statistic is below 1.01 [40] and the effective sample size is greater than 1000 for all hyperparameters (m r , m G , m k , s r , s G , s k ); trace plots and marginal posterior distribution plots are presented in appendix A. Ninety-five per cent credible intervals (CI) are calculated by computing 2.5% and 97.5% quantiles from the marginal posterior distribution for each hyperparameter.

Comparing estimates of the basic reproductive number
In order to compare estimates of the basic reproductive number R 0 (and particularly their associated uncertainties) across different studies, we need a consistent measure of uncertainty. Instead of using reported uncertainty ranges from the original studies, we re-calculate the basic reproductive number from the parameter sets (r i , G i and κ i ) for each study using equation (2.2) and calculate the median and 95% equi-tailed quantile. We refer to these estimates as the base estimates. The distribution of the basic reproductive number for each study corresponds to the assumed distributions in table 2 for all studies except for study 6. The assumed distribution in study 6 in table 2 neglects uncertainty in the mean generation interval G, whereas the base estimates account for this uncertainty. Furthermore, since the distributions in table 2 are constructed by matching the mean and the probabilities associated with the reported uncertainty ranges, the exact values of the base estimates and their 95% quantiles differ slightly from the reported values in table 1. We compare the base estimates with a pooled estimate of the basic reproductive number (R pool ) based on the pooled estimates of underlying parameters (by substituting μ r , μ G , m k in equation (2.2)).

Sensitivity analysis
In order to understand how uncertainties in each component (r i , G i and κ i ) affect the estimate of (R 0 ) i from each study i, we replace r i , G i and κ i with our pooled estimates (μ r , μ G and m k , respectively) one at a time and recalculate the basic reproductive number R 0 . We refer to the resulting estimates of R 0 as 'substitute' estimates. For example, the r-substitute estimate for study i is computed as: where κ i and G i are taken from their corresponding parameter sets and μ r is drawn from the posterior distribution. This procedure allows us to assess the sensitivity of the estimates of R 0 across appropriate ranges of uncertainties. We compare royalsocietypublishing.org/journal/rsif J. R. Soc. Interface 17: 20200144 substitute estimates with the base estimates of R 0 (based on r i , G i and κ i ). Despite the large uncertainty associated with the underlying parameters, most studies consider narrower ranges of uncertainties in these parameters. No studies take into account how uncertainty in the generation-interval dispersion affects their estimates of R 0 : all studies assumed fixed values for κ, ranging from 0 to 1. The estimates of the between-study standard deviations further suggest that there is a large variability in the underlying parameters among the seven studies, particularly in r and κ: σ r = 0.07 d −1 (95% CI: 0.04-0.19 d −1 ), σ G = 1.02 d (95% CI: 0.54-2.50 d) and s k ¼ 0:51 (95% CI: 0.24-1.52). This variability is likely driven by the differences in modelling approaches and assumptions. Figure 2 shows how propagating uncertainty in underlying parameters affects estimates and CIs for R 0 . For illustrative purposes, we use our pooled estimates, which may represent a reasonable proxy for the state of knowledge as of 23-26 January 2020 (figure 2a). Comparing the estimates that include only some sources of uncertainty to the pooled estimate (R pool ¼ 3:0; 95% CI: 2.1-4.6; see 'all' in figure 2), we see that propagating error from the growth rate (as done by all but one of the studies reviewed) is absolutely crucial: uncertainty in the pooled estimates for both middle bars (μ G and m k ), which lack growth-rate uncertainty, is overly narrow. In this case, propagating error from the mean generation interval has a negligible effect compared to propagating  Figure 1. Comparisons of the reported parameter values with our pooled estimates. We inferred point estimates (black), uniform distributions (orange) or confidence/credible intervals ( purple) for each parameter from each study, and combined them into pooled estimates using a Bayesian multilevel model (red). Points represent medians calculated from the parameter set (r i , G i , k i ) for each study i (orange and purple). Error bars represent 95% equi-tailed quantiles calculated from the parameter set (r i , G i , k i ) for each study i. Red density plots represent distributions of 2000 posterior samples. Open triangle: we assumed κ = 0.5 for study 2, which does not report generation-interval assumptions.  Figure 2. Effects of the exponential growth rate r, mean generation interval G and generation-interval dispersion κ on the estimates of the basic reproductive number R 0 . We compare estimates of R 0 under nine scenarios that propagate different parameter uncertainties (a) based on our pooled estimates (μ r , μ G and m k ) and (b) assuming a fourfold reduction in uncertainty of our pooled estimate of the exponential growth rate (usingm r ¼ (m r þ 3 Â median(m r ))=4 instead of μ r ). Each uncertainty type represents R 0 estimates based the posterior distributions of one of three parameters (μ r , μ G and m k ) while using median estimates of two other parameters. The 'none' type represents R 0 estimate based on the median estimates of μ r , μ G and m k . The 'all' type represents R 0 estimates based on the joint posterior distributions of μ r , μ G and m k (also corresponds to R pool ). Points represent the median estimates. Vertical error bars represent the 95% credible intervals.
We further explore how the effects of uncertainties in generation-interval distributions change when the estimate of the exponential growth rate is more certain. This hypothetical example reflects scenarios, in which increased data availability allows researchers to estimate r with more certainty. To simulate estimates of the exponential growth rate with narrower uncertainty, we usem r ¼ (m r þ 3 Â median(m r ))=4 instead of μ r (figure 2b); thenm r has the same median as μ r but the associated 95% CI is four times narrower (0:16-0:19 d À1 ). As uncertainty associated with the exponential growth rate decreases, accounting for uncertainties in generation intervals becomes even more important. Propagating error only from the growth rate (m r in figure 2b) gives very narrow credible intervals in this case. Propagating errors from the mean generation interval (μ G in figure 2b) or generation-interval dispersion (m k in figure 2b) gives more realistic but still narrow credible intervals. Finally, figure 3 compares the reported estimates (table 1) with the base estimates (based on r i , G i and κ i for each study i) as well as 21 substitute estimates (3 parameter substitutions × 7 studies). The base estimates, which are probability-based approximations of the reported estimates, are broadly consistent with the reported estimates. All but eight substitute estimates have wider credible intervals compared to their corresponding base estimates-the cases with more certain substitute estimates are the G-substitute estimates for studies 1, 5 and 7, r-substitute estimates for studies 1 and 2 and κ-substitute estimates for studies 3, 6 and 7. Accounting for uncertainties in the estimate of r has the largest effect on the estimates of R 0 in most cases ( figure 3). For example, the r-substitute estimate of R 0 for study 7 is R 0 ¼ 3:9 (95% CI: 2.3-8.8), which is much wider than the uncertainty range reported by the authors (2.0-3.1). This is consistent with our earlier results (figure 2) that demonstrated the importance of accounting for uncertainty in the exponential growth rate r. In addition, the pooled estimate of the basic reproductive number (R pool ¼ 3:0; 95% CI: 2.1-4.6) has wider credible intervals than the base estimates for all studies except for study 6.

Discussion
Estimating the basic reproductive number R 0 is crucial for predicting the course of an outbreak and planning intervention strategies. However, comparing disparate estimates of R 0 can be difficult when they rely on different methods and assumptions. Here, we use a gamma approximation framework [20] to decompose R 0 estimates into three key quantities (r, G and κ) and apply a multilevel Bayesian framework to compare estimates of R 0 for the SARS-CoV-2 outbreak. Our results demonstrate the importance of accounting for uncertainties associated with the underlying generation-interval distributions, including uncertainties in the degree of dispersion in the generation intervals.
Our analysis shows that many early estimates of R 0 rely on overly confident assumptions. The neglect of uncertainties in the generation-interval dispersion is particularly important because it determines the shape of the r-R 0 relationship (figure 1): reducing κ from 1 (assuming exponentially distributed generation intervals) to 0 (assuming fixed generation intervals) changes the r-R 0 relationship from linear to exponential (see equation (2.2)). Assuming fixed parameter values here will lead to overly confident conclusions [41].
Omitting consideration of uncertainty in the generationinterval dispersion also explains the sensitivity of R 0 estimates to the exponential growth rate, particularly in study 7 ( figure 3). Since study 7 assumes a fixed generation interval   Figure 3. Sensitivity of the reported R 0 estimates with respect to our pooled estimates of the underlying parameters. We calculate substitute estimates by replacing the reported parameter values (growth rate r, mean generation interval G and generation-interval dispersion κ) with our corresponding pooled estimates (μ r , μ G and m k ) one at a time and recalculating R 0 . The pooled estimate represents R pool , which is calculated from the joint posterior distribution of μ r , μ G and m k ; this corresponds to replacing all reported parameter values with our pooled estimates, which gives identical results across all studies. The reported estimates refer to estimates listed in table 1. Points represent the medians of the reported, base, substitute and pooled estimates. Vertical error bars represent the 95% credible intervals of our base, substitute and pooled estimates (based on 2000 posterior samples). Horizontal dashed lines represent the 95% credible intervals of our pooled estimate.
royalsocietypublishing.org/journal/rsif J. R. Soc. Interface 17: 20200144 (κ = 0), they implicitly assume an exponential r-R 0 relationship, making their estimate too sensitive to r. Similarly, the credible intervals associated with the base estimates of studies 3 (κ = 0.2), 6 (κ = 0.2) and 7 (κ = 0) are wider than the credible intervals associated with their corresponding κsubstitute estimates, which rely on wider generation-interval distributions (m k ¼ 0:50; 95% CI: 0.26-1.10) and, therefore, are less sensitive to uncertainties in r and G. One exception is study 1: this estimate of R 0 is most sensitive to generation-interval dispersion κ, because the study assumes an exponentially distributed generation interval (κ = 1). Estimates that rely on this assumption implicitly assume a linear r-R 0 relationship.
As most studies rely on overly confident assumptions, the credible intervals associated with the base estimates of R 0 should tend to be narrower than the credible intervals of the pooled estimate (R pool ¼ 3:0; 95% CI: 2.1-4.6). While the point estimate of R pool is similar to other reported values from this date range, its credible interval is wider than the credible intervals of the base estimates of all but one study. This result does not mean that assumptions underlying the pooled estimate are too weak; rather, this credible interval more accurately reflects the level of uncertainties present in the information that was available when these models were fitted. In fact, because the pooled estimate does not account for overlap in data sources used by the models, it is more likely to be over-confident than under-confident. Because our median estimate averages over the various studies, particular studies have higher or lower median estimates. In particular, while the baseline example we used from study 6 may appear to be an outlier, the authors of this study also explore different scenarios involving changes in reporting rate over time, under which their estimates of R 0 are similar to other reported estimates.
Of the seven studies that we review, at least one of them directly fit their models to the cumulative number of confirmed cases. This approach is appealing because of its simplicity and apparent robustness, but fitting a model to cumulative incidence neglects autocorrelation between successive counts of cumulative cases. As a result, this approach both biases parameter estimates and gives overly narrow confidence/credible intervals [42,43]. Narrow uncertainties in the estimates of the exponential growth rate are probably driven by this approach.
Many sources of noise affect real-world incidence data, including both dynamical, or 'process', noise (randomness that directly or indirectly affects the actual number of cases occurring); and observation noise (randomness underlying how many of these cases are reported). Disease modellers face the choice of incorporating one or both of these in their data-fitting and modelling steps. Neglecting one or the other is not always a serious problem, particularly if the goal is inferring parameters rather than directly making forecasts [43]. Modellers should, however, be aware that oversimplifying the error model can give overly narrow confidence/credible intervals [42,44].
Our simple framework neglects some other important phenomena. Examples that seem relevant to this outbreak include: changing reporting rates; reporting delays (including the effects of weekends and holidays); and changing generation intervals. For emerging pathogens such as SARS-CoV-2, there may be an early period of time when the reporting rate is very low due to limited awareness or diagnostic resources; for example, Zhao et al. [10] (study 6) demonstrated that estimates of R 0 can change from 5.47 (95% CI: 4.16-7.10) to 3.30 (95% CI: 2.73-3.96) when they assume twofold changes in the reporting rate between 17 January, when the official diagnostic guidelines were released [45], and 20 January. Delays between key epidemiological timings (e.g. infection, symptom onset and detection) can also shift the shape of an observed epidemic curve and, therefore, affect parameter estimates as well as predictions of the course of an outbreak [46]. Even though a time-invariant delay between infection and detection may not affect the estimate of the growth rate, it can still affect the associated credible intervals. Other factors related to reporting-including changes in case definition, saturation in diagnostic test capacity, transparency of data, and representativeness of samples-will also affect estimation and inference. Finally, generation intervals can become shorter throughout an epidemic, as intervention strategies such as isolation of detected cases can reduce the infectious period [47]; since we are primarily focusing on the outbreak in Wuhan City before confinement, generation intervals are unlikely to change significantly. All of these factors, including fitting to cumulative curves or ignoring process error, affect the estimation of the exponential growth rate (as well as the associated uncertainties), which in turn affects the estimation of the basic reproductive number. Emergence of a new strain with different transmissibility could also affect disease dynamics, and complicate inference; this study does not address this possibility.
Here, we focus on the estimates of R 0 that are published within a very short time frame (23-26 January 2020). Since these estimates were published as pre-prints, rather than in peer-reviewed journals, the quality of the analyses as well as the resulting estimates were not necessarily finalized. For example, study 4 initially estimated R 0 ¼ 3:8 (95% CI: 3.6-4.0; Read et al. [9]) but revised their estimate on 28 January 2020 to R 0 ¼ 3:11 (95% CI: 2.39-4.13; Read et al. [33]); we do not include their revised estimates in our analysis in order to focus on information available at the very beginning of the outbreak. Some studies also lack detailed description of their methods, data, and/or assumptions. The variation in quality of these analyses adds further uncertainty to their results that is not captured by their uncertainty quantification (e.g. reported confidence/credible intervals) or by our analysis.
During early phases of an outbreak, it is reasonable to assume that the epidemic grows exponentially [15]. However, as the number of susceptible individuals decreases or behaviour changes in response to perception of the epidemic, the growth rate will decrease: estimates of r used for R 0 should account for the possibility that r is decreasing through time. Although our analysis applies strictly to the earliest stages of an epidemic, we expect certain lessons to hold more generally: confidence/credible intervals must combine as many sources of uncertainty as possible. In fact, as epidemics progress and more data become available, it is likely that inferences about exponential growth rate (and other epidemiological parameters) will generally become more precise; thus the risk of over-confidence (when uncertainty about the generation-interval distribution is neglected) will become greater. Incorporating estimates of the dynamics of susceptibility (e.g. using properly calibrated serological studies [48]) is also important for characterizing transmission as the outbreak progresses.
royalsocietypublishing.org/journal/rsif J. R. Soc. Interface 17: 20200144 We strongly emphasize the value of attention to accurate characterization of the transmission chains via both contact tracing and improved statistical frameworks for inferring generation-interval distributions from such data [49]. A combined effort between public-health workers and modellers in this direction is crucial both for predicting the course of an epidemic and for controlling it. We also emphasize the value of transparency from modellers. Model estimates during an outbreak, even in pre-prints, should include code links and complete explanations. Methods based on open-source tools allow for maximal reproducibility [50].
Despite our focus on estimating R 0 at the onset of an outbreak, many of the issues persist now. For example, Flaxman et al. [51] recently estimated the basic reproductive number for SARS-CoV-2 outbreaks in 11 European countries to be around 3.8 (2.4-5.6), on average. While these estimates appear to be broadly consistent with earlier estimates from China, comparing the exponential growth rate and the underlying generation-interval distributions suggest otherwise. The later paper assumes a shorter mean generation interval ( G ¼ 6:5 d) but similar generation-interval dispersion (κ = 0.38); based on these values, the exponential growth rate has to be considerably higher (r = 0.27 d −1 ) to obtain R 0 ¼ 3:8 than the exponential growth rate observed in China (μ r = 0.17 d −1 ; 95% CI: 0:12-0:25 d À1 ). Naively comparing estimates of the basic reproductive number without accounting for differences in underlying assumptions can lead to over-interpretation of apparent differences in the estimates.
We have provided a basis for comparing exponentialgrowth based estimates of R 0 and its associated uncertainty in terms of three components: the exponential growth rate, mean generation interval and generation interval dispersion. We hope this framework will help researchers understand and reconcile disparate estimates of disease transmission early in an epidemic.