Main

The instantaneous reproduction number, denoted Rt at time t, is an important and popular temporal measure of the transmissibility of an unfolding infectious disease epidemic1. This parameter defines the average number of secondary infections generated by a primary one at t, providing a critical threshold for delineating growing epidemics (Rt > 1) from those likely to become controlled (Rt < 1). Estimates of Rt derived from surveillance data are widely used to evaluate the efficacies of interventions1,2 (for example, lockdowns), forecast upcoming disease burden3,4 (for example, hospitalizations), inform policy-making5 and improve public health awareness6.

The reliability of these estimates depends fundamentally on the quality and timeliness of the surveillance data available. Practical epidemic monitoring is subject to various errors or imperfections that can obscure or bias inferred transmission dynamics7. Prime among these are under-reporting and reporting delays, which can scale and smear Rt estimates, potentially misinforming public health authorities8,9. Under-reporting causes some fraction of infections to never be reported, while delays redistribute reports of infections incorrectly across time. The ideal data source for estimating Rt is the time series of new or incident infections, It.

Unfortunately, infections are difficult to observe directly and proxies such as reported cases, deaths, hospitalizations, prevalence and viral surveys from wastewater must be used to gauge epidemic transmissibility5,10. Each of these data streams provides a noisy approximation to the unknown It but with distinct and important relative advantages. We focus on the most popular ones: the epidemic curve of reported cases Ct at t, and that of death counts Dt, and investigate how their innate noise sources differentially limit Rt inference quality.

Ct records the most routinely available data, that is, counts of new cases11, but is limited by delays and under-reporting. Ascertainment delays smear or reorder the case incidence and may emerge from fixed surveillance capacities, weekend effects and lags in diagnosing symptomatic patients (for example, the time from infection to a positive test)8,12. Delays may be classed as occurred but not yet reported (OBNR), when source times of delayed cases eventually become known (there delays cause right censoring of the case counts), or what we term never reported (NEVR), when source times of past cases are never uncovered13,14,15.

Case under-reporting or underascertainment strongly distorts the true, but unknown, infection incidence curve, altering its size and shape9,16. Temporal fluctuations in testing intensity, behavior-based reporting (for example, by severity of symptoms)17, undetected asymptomatic carriers and other surveillance bottlenecks can cause underascertainment or inconsistent reporting18,19. Constant reporting (CONR) describes the situation when the case detection fraction or probability is stable. We term the more realistic scenario in which this probability varies appreciably with time variable reporting (VARR).

Dt counts newly reported deaths attributable to the pathogen being studied and is also subject to under-reporting and reporting delays, but with two main differences10. First, death reporting delays incorporate an extra lag for the intrinsic time it takes an infection to culminate in mortality (this also subsumes hospitalization periods). Second, apart from the under-reporting fraction of deaths, there is another scaling factor known as the infection–fatality ratio (ifr), which defines the proportion of infections that result in mortality1,20. We visualize how the noise types underlying case and death curves distort infection incidence in Fig. 1.

Fig. 1: Under-reporting and delayed reporting noise.
figure 1

We simulate true infection incidence It over t (in days) from a renewal model (equation (9) with Ebola virus dynamics) with reproduction number Rt that switches from supercritical to subcritical spread due to an intervention. a, Under-reported case curves (50 realizations, various colors) with reporting fractions sampled from the distribution P(ρt) for sample fraction ρt plotted in the inset (mean sample fraction is \(\bar{\rho }\)). b, Delays in case reports (50 realizations, various colors) from the distribution P(δx) for delay δx (in days) plotted in the left-hand inset (mean delay is \(\bar{\delta }\)). We also provide the true Rt as the right-hand inset (red). The main question of this study is how we quantify which of scenarios a or b incurs the larger loss of the information originally available from It, ideally without simulation.

Source data

Although the influences of surveillance latencies and underascertainment fractions on key parameters, such as Rt, are known8,19,21,22 and much ongoing work attempts to compensate for these noise sources10,23,24,25, there exists no formal framework for assessing and exposing how they inherently limit information available for estimating epidemic dynamics. Most studies utilize simulation-based approaches (with some exceptions, for example, refs. 9,22) to characterize surveillance defects. While invaluable, these preclude generalizable insights into how epidemic monitoring shapes parameter inference.

Here we develop one such analytic framework for quantifying the information within epidemic data. Using Fisher information theory we derive a measure of how much usable information an epidemic time series contains for inferring Rt at every time. This yields metrics for cross-comparing different types of surveillance time series, as we are able to explicitly quantify how under-reporting (both CONR and VARR) and reporting delays (exactly for OBNR with a tight upper bound for NEVR) degrade available information. As this metric only depends on the properties of surveillance (and not Rt or It), we extract simulation-agnostic insights into what are the least and most detrimental types of surveillance noise.

We prove for constrained mean reporting fractions and mean delays that it is preferable to minimize variability among reporting fractions but to maximize the variability of the reporting delay distribution such that a minority of infections face large delays but the majority possess short lags to notification. This proceeds from standard experimental design theory applied to our metric, which shows that the information embedded within an epidemic curve depends on the product of the geometric means of the reporting fractions and cumulative delay probabilities corrupting this curve. This central result also provides a non-dimensional score for summarizing and ranking the reliability of (or uncertainty within) different surveillance data for inferring pathogen transmissibility.

Finally, we apply this framework to explore and critique a common claim in the literature, which asserts that death curves are more robust for inferring transmissibility than case curves. This claim is usually made for acute infectious diseases such as COVID-19 and pandemic influenza1,20, where cases are severely under-reported, with symptom-based fluctuations in reporting. In such settings it seems plausible to reason that deaths are less likely undercounted and more reliable for Rt inference. However, we compute our metrics using COVID-19 reporting rate estimates18,26 and discover few instances in which death curves are definitively more informative or reliable than case counts.

While this may not rule out the possibility of having a more reliable death time series, it elucidates and exposes how the different noise terms within the two data sources corrupt information and presents a methodology for exploring these types of questions more precisely. We illustrate how to compute our metrics practically using empirical COVID-19 and Ebola virus disease (EVD) noise distributions and outline how other common data such as hospitalizations, prevalence and wastewater virus surveys conform to our framework. Hopefully the tools we develop here will improve quantification of noise and information and highlight key areas where enhanced surveillance strategies can maximize impact.

Results

Methods overview

We summarize the salient points from Methods and outline the main arguments that underpin all subsequent Results sections. Our analysis is centered on the renewal model27, which is widely applied to describe the dynamics of epidemics of COVID-19, EVD, influenza, dengue, measles and many others21. This model posits that It are Poisson (Pois) distributed with a mean that is the product of Rt and total infectiousness (Λt). Here Λt defines how past infections engender new ones on the basis of w, the generation time distribution of the pathogen. In equation (9) and Supplementary Table 1 we provide precise definitions of these variables (as well as others later in the text).

An important problem in infectious disease epidemiology is the estimation of Rt across the duration of an epidemic5. However, as infections cannot be observed, we commonly have to infer Rt from noisy proxies such as the time series of reported cases or deaths. These can be described by generalized renewal models that include terms for practical noise sources such as under-reporting and delays in reporting28. We define these models in equations (1) and (2) and detail the properties of various noise sources in Methods. Our aim is to understand and quantify how much information for inferring Rt, as a fraction of what would be available if infections were observable, can be extracted from these proxies.

We pursue this aim by adapting concepts from statistical theory and information geometry. We first construct the log-likelihood function of the parameter time series \({R}_{1}^{\tau }:=\{{R}_{t}:1\le t\le \tau \}\), with τ as the present or last observation time and t scaled in units (for example, weeks) so that each Rt can be assumed independent. This function is \(\ell ({R}_{1}^{\tau })=\mathop{\sum }\nolimits_{t = 1}^{\tau }\ell ({R}_{t})\) with \(\ell ({R}_{t}):=\log {\mathbb{P}}({I}_{t}\,| \,{R}_{t})\) computed from the Poisson distribution of the renewal model. Equation (10) results and admits the maximum likelihood estimates (MLEs) \({\hat{R}}_{t}\) for all t as its maxima. The reliability of these MLEs is characterized by the Fisher information (FI) of \({R}_{1}^{\tau }\) from the time series or curve of incident infections \({I}_{1}^{\tau }:=\{{I}_{t}:1\le t\le \tau \}\). We also often compute FI values under the robust transform \({{{{\mathcal{R}}}}}_{t}=2\sqrt{{R}_{t}}\), which has useful statistical properties29.

Larger FI values imply smaller asymptotic uncertainty around the MLEs30. We obtain \({{\mathbb{F}}}_{I}({R}_{t})\), the FI of Rt, in equation (11) by evaluating the average curvature of the log-likelihood function. We then formulate the total information, \({\mathbb{T}}({I}_{1}^{\tau })\), as a product of \({{\mathbb{F}}}_{I}({R}_{t})\) terms across t as equation (12). This follows from the independence of the Rt variables and is a measure of the reliability of the infection time series. It is also delimits the maximum possible precision around the MLEs of Rt for any time series. Since \({I}_{1}^{\tau }\) is often unobservable, \({\mathbb{T}}({I}_{1}^{\tau })\) is generally not computable and is a theoretical maximum. However, our subsequent results circumvent this issue.

Using the models of equations (1) and (2), we employ this same recipe of constructing a log-likelihood and computing MLEs and FI values, but now for practical time series or data streams that are corrupted by under-reporting and delays. This yields equations (13)–(18), which contain the ingredients for deriving the total information in case, death and any other incidence data that is related to \({I}_{1}^{\tau }\) via a generalized renewal model (this includes prevalence, hospitalizations and virus abundance found in wastewater). We derive a key result for \({\mathbb{T}}({C}_{1}^{\tau })\) in equation (3), showing exactly how case data \({C}_{1}^{\tau }:=\{{C}_{t}:1\le t\le \tau \}\) cause a loss in Rt estimate reliability.

Building on this expression we develop metrics \(\eta ({C}_{1}^{\tau })\) and \(\theta ({C}_{1}^{\tau })\) in equations (4) and (5), which effectively quantify \(\frac{{\mathbb{T}}({C}_{1}^{\tau })}{{\mathbb{T}}({I}_{1}^{\tau })}\) that is, the level of informativeness of case data relative to true infections. The smaller these metrics are, the more information is lost due to surveillance noise. Importantly, these metrics are analytic, require no knowledge of \({I}_{1}^{\tau }\) or the generation time distribution (both are difficult to observe) and are interpretable, since each noise type contributes a separate geometric mean term. Further, they play an integral role in defining the statistical complexity of the generalized renewal model describing that time series, as we find in equation (6).

Repeating the above recipe we derive analogous metrics for death data \({D}_{1}^{\tau }:=\{{D}_{t}:1\le t\le \tau \}\), by characterizing the ratio \(\frac{{\mathbb{T}}({D}_{1}^{\tau })}{{\mathbb{T}}({I}_{1}^{\tau })}\) in equations (7) and (8). We can similarly compute ratios for hospitalizations, prevalence and viral wastewater data by inputting appropriate delay and under-reporting terms. We complete our results by including empirical estimates of case and death noise sources within our framework to compare \(\frac{{\mathbb{T}}({C}_{1}^{\tau })}{{\mathbb{T}}({I}_{1}^{\tau })}\) and \(\frac{{\mathbb{T}}({D}_{1}^{\tau })}{{\mathbb{T}}({I}_{1}^{\tau })}\) for COVID-19 and EVD and hence determine whether case or death data are likely more reliable for inferring \({R}_{1}^{\tau }\).

Renewal models with noisy observations

We denote the empirically observed or reported number of cases at time step or unit t, subject to noise from both under-reporting and reporting delays, as Ct with \({C}_{1}^{\tau }\) as the epidemic case curve. This curve is obtained from routine outbreak surveillance and is a corrupted version of the true incidence \({I}_{1}^{\tau }\) (ref. 10), modeled by equation (9). These noise sources (see Methods for statistical descriptions) are parametrized by reporting fractions \({\rho }_{1}^{\tau }:=\{{\rho }_{t}:1\le t\le \tau \}\) and a delay distribution δ := {δx: x ≥ 0}. Here ρt is the fraction of infections reported as cases at t and δx the probability of a lag from infection time to case report of x units.

We assume that these noise sources are estimated from auxiliary line-list or contact tracing data12,31. As a result, we can construct equation (1) as in ref. 25 (Methods). Note that if noise source estimates are unavailable then \({R}_{1}^{\tau }\) becomes statistically non-identifiable or ill-defined.

$${C}_{t} \sim {{{\rm{Pois}}}}\left(\mathop{\sum }\limits_{x=1}^{t}{\delta }_{t-x}{\rho }_{x}{\varLambda }_{x}{R}_{x}\right).$$
(1)

This noisy renewal model suggests that Ct (unlike It) contains partial information about the entire time series of reproduction numbers for x ≤ t as mediated by the delay and reporting probabilities. Perfect reporting corresponds to ρx = 1 for all x, δ0 = 1 (with all other δx = 0) and means Ct → It. Models (1) and (2) in Methods are obtained by individually removing noise sources from equation (1).

Other practical epidemic surveillance data such as the time series of new deaths or hospitalizations conform to the framework in equation (1) either directly or with additional effective delay and under-reporting stages20. The main one we investigate here is the count of new deaths (due to infections) across time, which we denote \({D}_{1}^{\tau }\). The death curve involves a reporting delay that includes the intrinsic lag from infection to death. We let γ := {γx: x ≥ 0} represent the distribution of lag from infection to observed death and \({\sigma }_{1}^{\tau }:=\{{\sigma }_{t}:1\le t\le \tau \}\) be the fraction of deaths that is reported.

An important additional component when describing the chain from \({I}_{1}^{\tau }\) to \({D}_{1}^{\tau }\) is ifrt, which is the probability at t that an infection culminates in a death event10. Fusing these components yields as a model for Dt

$${D}_{t} \sim {{{\rm{Pois}}}}\left(\mathop{\sum }\limits_{x=1}^{t}{{{{\rm{ifr}}}}}_{x}{\gamma }_{t-x}{\sigma }_{x}{\varLambda }_{x}{R}_{x}\right).$$
(2)

In a later section we explain how analogs of equations (1) and (2) also fit other data streams such as hospitalizations and prevalence. Some studies1,32 replace this Pois formulation with a negative binomial (NB) distribution to model extra variance in these data. In Supplementary Notes we show that this does not disrupt our subsequent results on the relative informativeness of surveillance data, but the NB formulation is less tractable and unsuitable for extracting generalizable, simulation-free insights.

Reliability measures for surveillance data

We analyze the information in the generalized renewal models of equations (1) and (2) by computing the FI for each Rt or its transform \({{{{\mathcal{R}}}}}_{t}=2\sqrt{{R}_{t}}\) (see Methods for details). This is denoted \({{\mathbb{F}}}_{C}(.)\) for case data, and as derived in Methods (equation (15)) allows us to obtain the total information, \({\mathbb{T}}({C}_{1}^{\tau })\), contained in those data about the parameter time series \({R}_{1}^{\tau }\) or \({{{{\mathcal{R}}}}}_{1}^{\tau }\) as a product across t of \({{\mathbb{F}}}_{C}(.)\) terms. This total information, \({\mathbb{T}}({C}_{1}^{\tau })\), relates inversely to the smallest joint uncertainty around unbiased estimates of all our parameters33. As larger \({\mathbb{T}}({C}_{1}^{\tau })\) implies reduced overall uncertainty, this is a rigorous measure of the statistical reliability of noisy data sources for inferring pathogen transmissibility.

We first consider the OBNR delay case under VARR reporting rates. Since the FI matrix under OBNR delays is diagonal, with each element given by equation (15), we can adapt equation (12) to derive

$${\mathbb{T}}({C}_{1}^{\tau })=\mathop{\prod }\limits_{t=1}^{\tau }\sqrt{{{\mathbb{F}}}_{C}({{{{\mathcal{R}}}}}_{t})}=\mathop{\prod }\limits_{t=1}^{\tau }\sqrt{{\rho }_{t}{F}_{\tau -t}{\varLambda }_{t}}.$$
(3)

Here we have applied the \({{{{\mathcal{R}}}}}_{t}=2\sqrt{{R}_{t}}\) transformation to show that the total information in this noisy stream can be obtained without knowing Rt. In the absence of this transform we would have \(\mathop{\prod }\nolimits_{t = 1}^{\tau }\sqrt{{\rho }_{t}{F}_{\tau -t}{\varLambda }_{t}{R}_{t}^{-1}}\).

Since \({C}_{1}^{\tau }\) is a distortion of the true infection incidence \({I}_{1}^{\tau }\) we normalize equation (3) by equation (12) to develop a reliability metric, \(\eta ({C}_{1}^{\tau }):={\mathbb{T}}({C}_{1}^{\tau }){\mathbb{T}}{({I}_{1}^{\tau })}^{-1}\). This is given in the following equation and valid under both Rt and \({{{{\mathcal{R}}}}}_{t}\).

$$0\le \eta ({C}_{1}^{\tau })=\mathop{\prod }\limits_{t=1}^{\tau }\sqrt{{\rho }_{t}{F}_{\tau -t}}\le 1.$$
(4)

We can relate this reliability measure to a fixed, effective reporting fraction, \(\theta ({C}_{1}^{\tau })\), which causes an equivalent information loss. Applying equation (4), we obtain \(\eta ({C}_{1}^{\tau })=\sqrt{\theta {({C}_{1}^{\tau })}^{\tau }}\), which yields the following equation. Here, \({\mathbb{G}}(.)\) indicates the geometric mean of its arguments over 1 ≤ t ≤ τ.

$$\theta ({C}_{1}^{\tau })=\mathop{\prod }\limits_{t=1}^{\tau }\root \tau \of {{\rho }_{t}{F}_{\tau -t}}={\mathbb{G}}({\rho }_{t}){\mathbb{G}}({F}_{\tau -t}).$$
(5)

Equation (5) is a central result of this work. It states that the total information content of a noisy epidemic curve is independently modulated by the geometric mean of its reporting fractions, \({\mathbb{G}}({\rho }_{t})\), and that of its cumulative delay probabilities, \({\mathbb{G}}({F}_{\tau -t})\). Moreover, equation (5) provides a framework for gaining analytic insights into the separate influences of both noise sources from different surveillance data and for ranking the overall quality of these diverse data. For example, from the properties of geometric means, we know that \({\mathbb{G}}(.)\) is bounded by the smallest and largest noise terms across t. Importantly, equation (5) has no dependence on Λt, which is generally unknown and sensitive to difficult-to-infer changes in the generation time distribution34.

Equation (5) applies to OBNR delays exactly and upper bounds the reliability of data streams with NEVR delays (see Methods for derivation). Tractable results for NEVR delays are not possible and would necessitate numerical computation of Hessian matrices of \(-\log {\mathbb{P}}({C}_{1}^{\tau }\,| \,{R}_{1}^{\tau })\) (we outline the log-likelihoods and other equations for a τ = 3 example in Supplementary Notes). However, we find that the equation (5) upper bound is tight for two elementary settings. The first is under a constant or deterministic delay of d, that is, δx=d = 1. Equation (1) reduces to Ct ~ Pois(ρtdΛtdRtd). As each Ct only informs on Rtd, OBNR and NEVR delays are the same and are corrected by truncation. Degenerate delays such as these can serve as useful elements for constructing complex distributions35.

The second occurs when transmissibility is constant or stable, that is, Rt = R for all t. This applies to inferring the basic reproduction number (R0) during initial phases of an outbreak5. We can sum equation (15) to obtain \({{\mathbb{F}}}_{C}(R)=\mathop{\sum }\nolimits_{t = 1}^{\tau }{\rho }_{t}{F}_{\tau -t}{\varLambda }_{t}{R}^{-1}\) for OBNR delays. We can calculate the FI for NEVR delays from equation (16), which admits a derivative \(\frac{\partial \ell (R)}{\partial R}=\mathop{\sum }\nolimits_{t = 1}^{\tau }{C}_{t}{R}^{-1}-{F}_{\tau -t}{\rho }_{t}{\varLambda }_{t}\) and hence an FI and MLE that are precisely equal to those for OBNR delays (derived in Methods). This proves a convergence in the impact of two fundamentally different delay noise sources and emphasizes that noise has to be contextualized with the complexity of the signal to be inferred. Simpler signals, such as a stationary R that remains robust to the shifts and reordering of \({I}_{1}^{\tau }\) caused by delays, may be notably less susceptible to fluctuations in noise probabilities.

Ranking noise sources by their information loss

The metric proposed in equation (5) provides a general framework for scoring proxies of incidence (for example, epidemic case curves, death counts, hospitalizations and others) using only their noise probabilities and without the need for simulations. We explore the implications of equation (5) for both understanding noise and ranking these proxies. The geometric mean decomposition allows us to separately dissect the influences of under-reporting and delays. We start by applying experimental design theory36,37 to characterize the best and worst noise types for inferring effective reproduction numbers.

We consider \({\mathbb{G}}({\rho }_{t})\), the geometric mean of the reporting probabilities across time. If we assume the mean sampling fraction \(\bar{\rho }=\frac{1}{\tau }\mathop{\sum }\nolimits_{t = 1}^{\tau }{\rho }_{t}\) is fixed (for example, by some overall surveillance capacity) then we immediately know from design theory that \(\bar{\rho }=\arg \mathop{\max }\limits_{{{{\boldsymbol{\rho }}}}}{\mathbb{G}}({\rho }_{t})\). This means that of all the possible distributions of sampling fractions fitting that constraint, ρ, CONR with probability \(\bar{\rho }\) is the most informative38. This result supports earlier studies recognizing that CONR is preferred to VARR, although they investigate estimator bias and not information loss9,21.

Accordingly, we also discover that the worst sampling distribution is maximally variable. This involves setting ρt ~ 1 for some time subset \({\mathbb{S}}\) such that \({\sum }_{t\in {\mathbb{S}}}{\rho }_{t}=\tau \bar{\rho }\) with all other ρt ~ 0 (we use approximate signs as we assume non-zero sampling probabilities). Relaxing this constraint, equation (5) presents a framework for comparing different reporting protocols. We demonstrate these ideas in Fig. 2, where ρt ~ Beta(a, b), that is, each reporting fraction is a sample from a Beta distribution. Reporting protocols differ in (a, b) choices. We select 104 ρt samples from each of 2,000 distributions with 10−1 ≤ b ≤ 102 and a computed to fulfill the mean constraint \(\bar{\rho }\). Variations in the resulting \(\theta ({C}_{1}^{\tau })\) metrics indicate the influence of reporting fraction uncertainties under this mean.

Fig. 2: The information loss in under-reporting.
figure 2

We investigate the effective information metric (\(\theta ({C}_{1}^{\tau })\)) for VARR strategies with reporting fraction ρt drawn from Beta distributions with different shapes. Smaller values of \(\theta ({C}_{1}^{\tau })\) indicate more substantial information losses. a, Changes in \(\theta ({C}_{1}^{\tau })\) with the mean reporting probability (\(\bar{\rho }\)) and its variance var(ρt) (inset, where each color indicates the various schemes with a given \(\bar{\rho }\)). The gray line (dashed) is the optimal CONR protocol. b, The Beta sampling distributions and their resulting var(ρt) and \(\theta ({C}_{1}^{\tau })\) (inset). The most and least variable reporting strategies for a given \(\bar{\rho }\) are in blue and red, respectively.

Source data

Figure 2a shows that \(\theta ({C}_{1}^{\tau })\) generally increases with the mean reporting probability \(\bar{\rho }\). However, this improvement can be denatured by the variance, var(ρt), of the reporting scheme (inset, where each color indicates the various schemes with a given \(\bar{\rho }\)). The CONR scheme is outlined with a grey line (dashed), and as derived is the most informative. Figure 2b confirms our theoretical intuition on how var(ρt) reduces total information, with the extreme (worst) sampling scheme outlined above in blue and the most stable protocol in red. There are many ways to construct ρt protocols. We chose Beta distributions because they can express diverse reporting probability shapes using only two parameters.

Similarly, we investigate reporting delays via \({\mathbb{G}}({F}_{\tau -t})\), the geometric mean of the cumulative delay or latency distribution across time. Applying a mean delay constraint \(\bar{\delta }={\sum }_{x\ge 0}x{\delta }_{x}=\mathop{\sum }\nolimits_{t = 1}^{\tau }(1-{F}_{\tau -t})\) (for example, reflecting operational limits on the speed of case notification), we adapt experimental design principles. As we effectively maximize an FI determinant (see derivation of equation (5)) our results are termed D optimal38. These suggest that \(\mathop{\max }\limits_{{{{\boldsymbol{\delta }}}}}{\mathbb{G}}({F}_{\tau -t})\) is achieved by cumulative distributions with the most uniform shape. These possess the largest δ0 within this constraint. Delay distributions with substantial dispersion (for example, heavy tails) attain this optimum while fixed delays (where \({\delta }_{x\sim \bar{\delta }}=1\) and 0 otherwise) lead to the largest information loss under this constraint.

This may seem counterintuitive, as deterministic delays best preserve information outside of that delay and can be treated by truncating the observed epidemic time series: for example, for a fixed weekly lag we can ignore the last week of data. However, this causes a bottleneck. No information is available for that truncated week, eliminating any possibility of timely inference (and making epidemic control difficult39). In contrast, a maximally dispersed delay distribution slightly lags the majority of cases, achieving the mean constraint with large latencies on a few cases. This ensures that, overall, we gain more actionable information about the time series.

We illustrate this point in Fig. 3, where we verify the usefulness of equation (5) as a framework for comparing the information loss induced by delay distributions of various shapes and forms. We model δ as \({{{\rm{NB}}}}(k,\,\frac{\bar{\delta }}{\bar{\delta }+k})\), with k describing the dispersion of the delay. Figure 3a demonstrates how our \(\theta ({C}_{1}^{\tau })\) metric varies with k (30 values taken between 10−1 and 102) at various fixed mean constraints (\(3\le \bar{\delta }\le 30\), each given as a separate color). In line with the theory, we find that decreasing k (increasing dispersion of the delay distribution) improves information at any given \(\bar{\delta }\).

Fig. 3: The information loss from delays.
figure 3

We compute \(\theta ({C}_{1}^{\tau })\) for delay distributions that are NB with various dispersions k. Smaller values of \(\theta ({C}_{1}^{\tau })\) indicate more substantial information losses, while smaller k indicates a more dispersed or variable reporting delay distribution. a, Influence of mean delay (\(\bar{\delta }\)) and k on \(\theta ({C}_{1}^{\tau })\), with colored curves representing different \(\bar{\delta }\) and the black dashed line as a reference value. Inset: variations in \(\theta ({C}_{1}^{\tau })\) at a given \(\bar{\delta }\) (matching colors) due to the dispersion k of the reporting delay distribution. b, Influence of the shape of the cumulative delay distributions, Fτt, at different k (increasing from blue to red) on our metric \(\theta ({C}_{1}^{\tau })\) (inset with corresponding colors).

Source data

The importance of both the shape and mean of reporting delays is indicated in the inset as well as by the number of distributions (seen as intersects of the dashed black line) that result in the same \(\theta ({C}_{1}^{\tau })\). Figure 3b plots corresponding cumulative delay probability distributions, validating our assertion from design theory that the best delays (blue, with metric in inset) are dispersed, forcing the cumulative probability of reporting delays up to τ − t time units (Fτt) high very early on (maximizing δ0 and leading to the most uniform shape). In contrast, the worst delay distributions are more deterministic (red, larger k). These curves are for OBNR delays and upper bound the performance expected from NEVR delays except for the settings described in the previous section, where the two types coincide.

Comparing different epidemic data streams

Our metric (equation (5)) not only allows the comparison of different under-reporting schemes and reporting delay protocols (Ranking noise sources by their information loss) but also provides a common score for assessing the reliability or informativeness of diverse data streams for inferring \({R}_{1}^{\tau }\). The best stream, from this information theoretic viewpoint, maximizes the product of the geometric means \({\mathbb{G}}(.)\) of the Fτt and ρt. Many common surveillance data types used for inferring pathogen transmissibility have been modeled within the framework of equation (1) and therefore admit related θ(.) metrics. Examples include time series of deaths, hospitalizations, the prevalence of infections and incidence proxies generated from viral surveys of wastewater.

We detail death count data in the next section but note that its model, given in equation (2), is a simple extension of equation (1). Hospitalizations may be described similarly with the ifr term replaced by the proportion of infections hospitalized and the intrinsic delay distribution now defining the lag from infection to hospital admission5. The infection prevalence conforms to equation (1) because it can be represented as a convolution of the infections with a duration of infectiousness distribution, which essentially contributes a reporting delay40. Viral surveys also fit equation (1). They offer a downsampled proxy of incidence, which is delayed by a shedding load distribution defining the lag before infections are detected in wastewater41. Consequently, our metrics are widely applicable.

While in this study we focus on developing methodology for estimating and contrasting the information from the above surveillance data, we find that our metric is also important for defining the complexity of a noisy renewal epidemic model. Specifically, we rederive equation (5) as a key term of its description length (L). Description length theory evaluates the complexity of a model from how succinctly it describes its data (for example, in bits)33,42. This measure accounts for model structure and data quality and admits the approximation \({L}_{C}\sim -\ell ({\hat{R}}_{1}^{\tau })+\frac{p}{2}\log \frac{m}{2\uppi }+\log \int \det \left[\frac{1}{m}{{\mathbb{F}}}_{C}({R}_{1}^{\tau })\right]\,{{\mathrm{d}}}{R}_{1}^{\tau }\). Here the first term indicates model fit by assessing the log-likelihood at our MLEs \({\hat{R}}_{1}^{\tau }\). The second term includes data quality through the number of parameters (p) and data size (m). The final term defines how model structure shapes complexity with the integral across the parameter space of \({R}_{1}^{\tau }\).

This formulation was adapted for renewal model selection problems in ref. 43 assuming perfect reporting. We extend this and show that our proposed total information \({\mathbb{T}}({C}_{1}^{\tau })\) plays a central role. Given some epidemic curve \({C}_{1}^{\tau }\) we can rewrite the previous integral as \(-\frac{p}{2}\log m+\log \mathop{\prod }\nolimits_{t = 1}^{\tau }\int \sqrt{{{\mathbb{F}}}_{C}({R}_{t})}\,\,{{\mathrm{d}}}{R}_{t}\) and observe that m = p = τ. It is known that under a robust transform such as \({{{{\mathcal{R}}}}}_{t}=2\sqrt{{R}_{t}}\) this integral is conserved33,37. Consequently, \(\int \sqrt{{{\mathbb{F}}}_{C}({R}_{t})}\,\,{{\mathrm{d}}}{R}_{t}=\sqrt{{{\mathbb{F}}}_{C}({{{{\mathcal{R}}}}}_{t})}\int\nolimits_{0}^{2{\sqrt{R}}_{\max }}1\,\,{{\mathrm{d}}}{{{{\mathcal{R}}}}}_{t}\) with Rmax as some maximum value that every Rt can take. Combining these expressions we obtain the following equation, highlighting the importance of our total information metric.

$${L}_{C}\sim -\ell ({\hat{R}}_{1}^{\tau })+\frac{\tau }{2}\log \frac{2{R}_{\max }}{\pi }+\log {\mathbb{T}}({C}_{1}^{\tau }).$$
(6)

If we have two potential data sources for inferring \({R}_{1}^{\tau }\) then we should select the one with the smaller LC value. Since the middle term in equation (6) remains unchanged in this comparison, the key points when comparing model complexity relate to the level of fit to the data and the total FI of the model given those data42. Our metrics therefore play a central role when comparing different data streams.

Are COVID-19 deaths or cases more informative?

In the above sections we developed a framework for comparing the information within diverse but noisy data streams. We now apply these results to better understand the relative reliabilities of two popular sources of information about transmissibility \({R}_{1}^{\tau }\): the time series of new cases \({C}_{1}^{\tau }\) and of new death counts \({D}_{1}^{\tau }\). Both data streams have been extensively used across the ongoing COVID-19 pandemic to better characterize pathogen spread5. Known issues stemming from fluctuations in the ascertainment of COVID-19 cases18,19 have motivated some studies to assert \({D}_{1}^{\tau }\) as the more informative and hence trustworthy data for estimating \({R}_{1}^{\tau }\) (refs. 1,20).

These works have reasonably assumed that deaths are more likely to be reliably ascertained. Case reporting can be substantially biased by testing policy inconsistencies and behavioral changes (for example, symptom-based healthcare seeking). In contrast, given their severity, deaths should be less likely to be underascertained5. However, no analysis, as far as we are aware, has explicitly tested this assumption. Here we make some progress towards better comprehending the relative merits of the two data streams. We start by computing ratios of our metric in equation (5) for both \({C}_{1}^{\tau }\) and \({D}_{1}^{\tau }\) via equations (1) and (2).

This results in \(\theta ({C}_{1}^{\tau })={\mathbb{G}}({\rho }_{t}){\mathbb{G}}({F}_{\tau -t})\) for cases and, by analogy, \(\theta ({D}_{1}^{\tau })={\mathbb{G}}({\sigma }_{t}{{{{\rm{ifr}}}}}_{t}){\mathbb{G}}({H}_{\tau -t})\) for deaths. In the same way that ρt defines the proportion of infections reported as cases, the product σtifrt defines the proportion of infections that are reported as deaths. This follows because ifrt is the fraction of infections that engender deaths and σt is the proportion of those deaths that are reported. \({H}_{\tau -t}:=\mathop{\sum }\nolimits_{x = 0}^{\tau -t}{\gamma }_{x}\) describes the cumulative probability of delays from infection to death up to τ − t time units in duration.

Using shorthand \({C}_{1}^{\tau }\succcurlyeq {D}_{1}^{\tau }\) for when \(\theta ({C}_{1}^{\tau })\ge \theta ({D}_{1}^{\tau })\) that is, indicates greater than or equal to with respect to total information, we obtain the following equation. We rearrange terms to obtain reporting fractions and delays on different sides by decomposing the geometric mean of a product into products of the geometric means in each term.

$${C}_{1}^{\tau }\succcurlyeq {D}_{1}^{\tau }:{\mathbb{G}}\left(\frac{{\rho }_{t}}{{\sigma }_{t}{{{{\rm{ifr}}}}}_{t}}\right)\ge {\mathbb{G}}\left(\frac{{H}_{\tau -t}}{{F}_{\tau -t}}\right).$$
(7)

Equation (7) states that cases are more informative when the geometric mean of the case to death reporting fractions is at least as large as that of the death and case cumulative delays. Studies preferring death data effectively claim that the variation in ρt (which we proved in a previous section always decreases the geometric mean for a given mean constraint) is sufficiently strong to mask the influences of ifrt, σt and any expected variations in those quantities.

Proponents of using death data to infer \({R}_{1}^{\tau }\) recognize that the infection-to-death delay (with cumulative distribution Hτt) is appreciably larger in mean than corresponding reporting lags from infection (Fτt) and therefore unsuitable for real-time estimation (where this extra lag denatures recent information as we showed in earlier sections). We allow for all of these adjustments. We assume that the ifr is constant (maximizing \({\mathbb{G}}({{{{\rm{ifr}}}}}_{t})\)) and that death ascertainment is perfect (σt = 1). Even for purely retrospective estimation with correction for delays we expect \({\mathbb{G}}\left(\frac{{H}_{\tau -t}}{{F}_{\tau -t}}\right)\le 1\). We set this to 1, maximizing the informativeness of \({D}_{1}^{\tau }\).

Combining these assumptions we reduce equation (7) to the following equation. This presents a sufficient condition for case data to be more reliable than the death time series.

$${C}_{1}^{\tau }\succcurlyeq {D}_{1}^{\tau }:{\mathbb{G}}{({\rho }_{t})}_{\bar{\rho }\in [0.07,0.38]}\ge {{{\rm{ifr}}}}\sim 0.01.$$
(8)

Here we choose a relatively large ifr for COVID-19 of 1% (ref. 44). Case reporting fraction estimates range from about 7% to 38% (ref. 18), which we apply to constrain \(\bar{\rho }\), the mean ρt. Inputting these estimates, we examine possible ρt sampling distributions under the Beta(a, b) formulation from earlier sections. Our main results are in Fig. 4. We take 104 samples of ρt from each of 2,000 distributions parametrized over 10−1 ≤ b ≤ 102 with a set to satisfy our \(\bar{\rho }\) reporting constraints.

Fig. 4: The relative information in epidemic case data and death counts.
figure 4

Using \(\theta ({C}_{1}^{\tau })\) we compare the information in case curves \({C}_{1}^{\tau }\) and death counts \({D}_{1}^{\tau }\) under assumptions that lead to equation (8). We examine various case reporting strategies parametrized as Beta distributions with \(\bar{\rho }\) from 0.07 to 0.38 (ref. 18) and compare the resulting \(\theta ({C}_{1}^{\tau })\) against the equivalent from deaths (which reduces to just the infection–fatality ratio, \(\theta ({D}_{1}^{\tau })={{{\rm{ifr}}}}\)). a, \(\theta ({C}_{1}^{\tau })\) for reported case data at different \(\bar{\rho }\) (each color represents a fixed \(\bar{\rho }\)) as compared with ifr (black dashed threshold). The best reporting strategy is in gray. Inset: proportion of case reporting distributions from the main plot for which \(\theta ({C}_{1}^{\tau }) > {{{\rm{ifr}}}}\). b, Those distributions \({\mathbb{P}}({\rho }_{t})\) at the ends of the empirical \(\bar{\rho }\) range with red indicating when \(\theta ({C}_{1}^{\tau }) > {{{\rm{ifr}}}}\).

Source data

Figure 4a plots our metric against these constraints and the ifr threshold. Whenever \(\theta ({C}_{1}^{\tau })\ge {{{\rm{ifr}}}}\) we find that case data are more reliable. This appears to occur for many possible combinations of ρt. The inset charts the proportion of Beta distributions that cross the threshold. This varies from about 45% at \(\bar{\rho }=0.07\) to 90% at \(\bar{\rho }=0.38\). While these figures will differ depending on how likely a given level of variability is, they offer robust evidence that death counts are not necessarily more reliable. Even when deaths are perfectly ascertained (σt = 1) the small ifr term in \({D}_{1}^{\tau }\) means that 99% of the original incidence data are lost, contributing appreciable uncertainty.

These points are reinforced by the design choices we have made, which inflate the relative information in the death time series. In reality, σt < 1, ifr < 0.01, neither is constant44,45 and the uncertainty we include around \({\bar{\rho }}_{t}\) is wider than that inferred in ref. 18. Our results are therefore resilient to uncertainties in noise source estimates. Figure 4b displays the distributions of our sampling fractions, with red (blue) indicating which shapes provide more (less) information than death data (equation (8)). Our results also hold for both real-time and retrospective analyses, as we ignored the noise induced by the additional delays that death data contain (relative to case reports) when we maximized \({\mathbb{G}}\left(\frac{{H}_{\tau -t}}{{F}_{\tau -t}}\right)\).

Consequently, death data cannot be assumed, without rigorous and context-specific examination, to be generally more epidemiologically meaningful. For example, while \({D}_{1}^{\tau }\) is unlikely to be more reliable in well mixed populations, it may be in high-risk settings (for example, care homes) where the local ifr is notably larger. Vaccines and improved healthcare, which substantially reduce ifr values in most contexts, will make death time series less informative about \({R}_{1}^{\tau }\). However, pathogens such as Ebola virus, which induce large ifr parameters, might result in death data that are more reliable than their case counts. We explore these points and demonstrate the practical applicability of our metrics in the next section.

Practical applications of information metrics

Our metrics provide an interpretable, simulation-agnostic and easily computable approach to quantifying the relative reliability of different epidemic time series. Because θ(.) is independent of usually unknown Rt and Λt terms, it is robust to generation time misspecification and only requires estimates of noise terms for its calculation (hence no epidemic curve simulations are needed). Moreover, it depends purely on the geometric means of noise variables, which can be decomposed such that the influence of any noise source is clearly interpreted from the magnitude of its specific mean (see equation (5)).

These properties make θ(.) of practical use and we illustrate the benefits of our methodology using COVID-19 and EVD examples. In contrast to Fig. 4 where we maximized the information in deaths and minimized that from cases to bolster our rejection of the assertion that death data are definitively more informative, here we focus on inputting empirical noise distributions derived from real data. When distributions are unavailable we describe noise uncertainties via maximum entropy distributions based on what estimates are available (these are geometric, Geo, if a mean is given and uniform, Unif, over 95% credible intervals).

For COVID-19 we once again examine if death data are more reliable. From equation (7) we conclude \({C}_{1}^{\tau }\succcurlyeq {D}_{1}^{\tau }\) if \(\frac{{\mathbb{G}}({\rho }_{t})}{{\mathbb{G}}({\sigma }_{t}){\mathbb{G}}({{{{\rm{ifr}}}}}_{t})}\ge \frac{{\mathbb{G}}({H}_{\tau -t})}{{F}_{0}}\). This follows as \({F}_{0}={\delta }_{0}=\min {\mathbb{G}}({F}_{\tau -t})\) and ensures (if we are using NEVR delays) that we do not take ratios of upper bounds, as \({\mathbb{G}}({H}_{\tau -t})\) already bounds the information in the infection-to-death delay. If delays are OBNR then equation (7) will be exact. We model ρt ~ Unif(0.06, 0.08) (ref. 18), \({\delta }_{x} \sim {{{\rm{Geo}}}}(\frac{1}{1+10.8})\) (ref. 46), \({\sigma }_{t} \sim {{{\rm{Unif}}}}(\frac{1}{1.34},\frac{1}{1.29})\) (ref. 45), \({{{{\rm{ifr}}}}}_{t} \sim {{{\rm{Unif}}}}(\frac{0.53}{100},\frac{0.82}{100})\) (ref. 44) and \({\gamma }_{x} \sim {{{\rm{NB}}}}(\frac{1}{1+1.1},\frac{21}{21+\frac{1}{1+1.1}})\) (ref. 47) and sample from these distributions 104 times. We compute the terms in the inequality above and represent the relative information as \(\log \theta ({C}_{1}^{\tau })-\log \theta ({D}_{1}^{\tau })\) for easy visualization.

This leads to the top panel of Fig. 5. Despite our use of the smallest reporting proportions from ref. 18 we find that death data are less reliable. For EVD, we test the alternative hypothesis that case data are less reliable in the bottom panel of Fig. 5. We decide \({D}_{1}^{\tau }\succcurlyeq {C}_{1}^{\tau }\) if \(\frac{{\mathbb{G}}({\rho }_{t})}{{\mathbb{G}}({\sigma }_{t}){\mathbb{G}}({{{{\rm{ifr}}}}}_{t})}\ge {H}_{0}\), as we know \({H}_{0}={\gamma }_{0}=\min {\mathbb{G}}({H}_{\tau -t})\) and \(\max {\mathbb{G}}({F}_{\tau -t})=1\). We let σt = 1 (no estimates were easily available) and model ρt ~ Unif(0.33, 0.83) (ref. 16), ifrt ~ Unif(0.69, 0.73) (ref. 48) and \({\gamma }_{x} \sim {{{\rm{NB}}}}(1.5,\frac{21.4}{21.4+1.5})\) (roughly from ref. 48). The negative values of \(\log \theta ({C}_{1}^{\tau })-\log \theta ({D}_{1}^{\tau })\) in Fig. 5 suggest EVD death data as the more informative source. However, this can easily change if σt 1, as the difference is not as strong as for COVID-19.

Fig. 5: The relative information in case and death data for EVD and COVID-19 case studies.
figure 5

We compute information metrics θ(.) for case (\({C}_{1}^{\tau }\)) and death (\({D}_{1}^{\tau }\)) time series using empirically derived under-reporting and delay noise distributions for COVID-19 (top panel) and EVD (bottom panel). See the main text for the specific distributions used, which account for the uncertainty in noise estimates (in the absence of knowledge of this uncertainty, maximum entropy distributions are applied). We take 104 samples from each distribution and calculate the logarithmic difference \(\log \theta ({C}_{1}^{\tau })-\log \theta ({D}_{1}^{\tau })\), with positive or negative values indicating when case or death data have the higher information content, respectively.

Source data

While we tried to keep estimates as realistic as possible, the point of Fig. 5 is to demonstrate how our metrics may be practically applied given noise estimates. Sampling from appropriate distributions means that we can propagate the uncertainty on these estimates into our metrics. We provide open source code for modifying this template analysis to include any user-defined distributions at https://github.com/kpzoo/information-in-epidemic-curves. As high-resolution outbreak data collection initiatives such as Global.health49 and REACT7 progress, enhancing surveillance and our quantification of noise sources, we expect our framework to grow in practical utility.

Discussion

Public health policy-making is becoming progressively data driven. Key infectious disease parameters6 such as instantaneous reproduction numbers and growth rates, fitted to heterogeneous outbreak data sources (for example, case, death and hospitalization incidence curves), are increasingly contributing to the evidence base for understanding pathogen spread, projecting epidemic burden and designing effective interventions4,6,50. However, the validity and value of such parameters depends substantially on the quality of the available surveillance data1,7. Although many studies have made important advances in underscoring and correcting errors in these data12,31, a framework to directly and generally quantify epidemic data quality is needed.

We made progress towards such a framework by finding the total information, \({\mathbb{T}}(.)\), available from epidemic curves corrupted by reporting delays and under-reporting. These are predominant noise sources that limit surveillance quality and apply to common outbreak data for inferring pathogen transmissibility such as cases, death counts, hospitalizations, infection prevalence and wastewater virus surveys. By maximizing \({\mathbb{T}}(.)\), we minimize the overall uncertainty of our transmissibility estimates, hence measuring the reliability of this data stream. This approach yielded a non-dimensional metric θ(.) that allows analytic and generalizable insights into how noisy surveillance data degrade estimate precision.

Our framework provided insight into the nuances of noise, elucidating how the mean and variability of delay and under-reporting schemes both matter. For example, fluctuating reporting protocols with larger mean may outperform more stable ones at lower mean. Moreover, under mean surveillance constraints, our metrics revealed that constant under-reporting of cases minimizes loss of information, while constant delays in reporting maximize this loss. The first result bolsters conventional thinking9, while the second highlights the need for timely data39.

Because the reporting of cases can vary substantially when tracking acute diseases such as COVID-19, various studies have assumed death data to be more reliable5. Using our metrics, we were able to qualify this claim. We found that ifr acts as a reporting fraction with very small mean. Only the most severely varying case reporting protocols can cause larger information loss, suggesting that in many instances this assertion may not hold. Note that this analysis does not even consider the additional advantages that case data bring in terms of timeliness, and shows the ability of our framework to rank the quality of different data streams.

However, there may be other crucial reasons for preferring to estimate pathogen spread from death data. For example, if extremely little is known about the level of reporting (very limited surveillance capacity might cause insurmountable case reporting fraction uncertainties) or if a death-based reproduction number is itself of interest as a severity indicator20. Our framework can also help inform these discussions by improving the precision of our reasoning about noise. This is exemplified by our EVD analysis, where we could show that the large ifr of the disease translated into death counts being the better data, provided their under-reporting is not large.

As hospitalization curves generally interpolate among the types of noise in case and death data, this might be the best a priori choice of data for inferring transmissibility. Some studies also propose to circumvent these ranking issues by concurrently analyzing multiple data streams32,50. This then opens questions about how each data stream should be weighed in the ensuing estimates. Our framework may also help by quantifying the most informative parts of each contributing stream. A common way of deriving consensus weights individual estimates by their inverse variance51. As the FI defines the best possible inverse variance of estimates, our metrics naturally apply.

While our framework can enhance understanding and quantification of surveillance noise, it has several limitations. First, it depends on renewal model descriptions of epidemics27. These models assume homogeneous mixing and that the generation time distribution of the disease is known. While the inclusion of more realistic network-based mixing may not improve transmissibility estimates52 (and this extra complexity may occlude insights), the generation time assumption may only be ameliorated through the provision of updated, high-quality line-list data34,49. However, our relative metrics in equations (4) and (5) and equations (7) and (8) are mostly robust to generation time distribution misspecifications (and even changes) as they do not depend on the Λt terms (these cancel out).

Further, our analysis is contingent on having estimates of the delays, underascertainment rates and other noise sources within data streams. These may be unavailable or themselves highly unreliable. If at least some information on their uncertainties is available we can propagate these into our metrics by replicating the Monte Carlo approach underlying our case studies. If no estimates are available then we cannot perform any analyses as Rt will not be identifiable. However, our framework can still be of use as a rigorous testbed for examining hypotheses on potential noise sources without extensive simulation.

Recent initiatives have aimed at improving the resolution and completeness of outbreak data7,49. Concurrently, estimating noise sources from both existing and novel data streams is a growing research area18,53. As a result, we expect that our metrics will only increase in practical utility and that concerns around the availability of noise estimates will diminish. We also assume that the time scale t chosen ensures that Rt parameters are independent. This may be invalid but in such instances we can append non-diagonal terms to FI matrices or use our metric as an upper bound.

Finally, we defined the reliability or informativeness of a data stream in terms of minimizing the joint uncertainty of the entire sequence of reproduction numbers \({R}_{1}^{\tau }\). This is known as a D-optimal design36. However, we may instead want to minimize the worst uncertainty among the \({R}_{1}^{\tau }\) (which may better compensate known asymmetries in inferring transmissibility54). Our framework can be reconfigured to tackle such problems by appealing to other design laws. We can solve this specific problem by deriving an E-optimal design, which maximizes the smallest eigenvalue of our FI matrix.

Methods

Renewal models and Fisher information theory

The renewal model27,35 is a popular approach for describing how infections dynamically propagate during the course of an epidemic. The number of new infections at t, It, depends on Rt, which counts the new infections generated per infected individual (on average), and Λt, which measures how many past infections (up to t − 1) will effectively produce new ones. This measurement weighs past infections by w. We define ws as the probability that it takes s time units for a primary infection to generate a secondary one. The distribution is then \({{{\mathbf{w}}}}={w}_{1}^{\infty }:=\{{w}_{1},\,{w}_{2},\ldots ,{w}_{\infty }\}\).

The statistical relationship between these quantities is commonly modeled as in the following equation, with Pois specifying a Poisson distribution21. This relationship only strictly holds if It is perfectly recorded both in size (no under-reporting) and in time (no delays in reporting).

$${I}_{t} \sim {{{\rm{Pois}}}}\left({\varLambda }_{t}{R}_{t}\right),\quad {\varLambda }_{t}:=\mathop{\sum }\limits_{x=1}^{t-1}{w}_{t-x}{I}_{x}.$$
(9)

However, as infections are rarely observed, It is often approximated by proxies such as reported cases and w replaced with the serial interval distribution, describing the times between the onset of symptoms of those cases. Equation (9) has been widely used to model transmission dynamics of many infectious diseases, including COVID-195, influenza55 and EVD48.

A common and important problem in infectious disease epidemiology is the estimation of the latent variable Rt from the incidence curve of infections or some more easily observed proxy. If this time series persists during 1 ≤ t ≤ τ, then we aim to infer the vector of parameters \({R}_{1}^{\tau }:=\{{R}_{t}:1\le t\le \tau \}\) from time series \({I}_{1}^{\tau }:=\{{I}_{t}:1\le t\le \tau \}\) or its proxy (see Results for this more practical inference problem). We assume that time is scaled in units such that Rt can be expected to change (independently) at every t. This may be weekly for COVID-19 or malaria21,56 but monthly for rabies57. Note that w and It must be aggregated, as needed, to match these units. Related branching58 and moving-average models59 feature similar aggregation.

Following the development in refs. 43,55, we solve this inference problem by constructing the incidence log-likelihood function \(\ell ({R}_{1}^{\tau })=\log {\mathbb{P}}({I}_{1}^{\tau }\,| \,{R}_{1}^{\tau })\) as in the following equation with Kτ as some constant that does not depend on any Rt. This involves combining Poisson likelihoods from equation (9) across time units 1 ≤ t ≤ τ as in ref. 21.

$$\ell ({R}_{1}^{\tau })=\mathop{\sum }\limits_{t=1}^{\tau }{I}_{t}\log {R}_{t}-{\varLambda }_{t}{R}_{t}+{K}_{\tau }.$$
(10)

We compute the MLE of Rt as \({\hat{R}}_{t}\), which is the maximal solution of \(\frac{\partial \ell ({R}_{1}^{\tau })}{\partial {R}_{t}}=0\). From equation (10) this gives \({\hat{R}}_{t}={I}_{t}{\varLambda }_{t}^{-1}\) (ref. 27). Repeating this for all t we obtain estimates of the complete vector of transmissibility parameters \({R}_{1}^{\tau }\) underlying \({I}_{1}^{\tau }\).

To quantify the precision (the inverse of the variance, var) around these MLEs or any unbiased estimator of Rt we calculate the FI that \({I}_{1}^{\tau }\) contains about Rt. This is \({{\mathbb{F}}}_{I}({R}_{t}):={\mathbb{E}}\left[-\frac{{\partial }^{2}\ell ({R}_{1}^{\tau })}{\partial {R}_{t}^{2}}\right]\), where expectation \({\mathbb{E}}[.]\) is taken across the data \({I}_{1}^{\tau }\) (hence the subscript I). The FI defines the best (smallest) possible uncertainty asymptotically achievable by any unbiased estimate, \({\tilde{R}}_{t}\). This follows from the Cramér–Rao bound30, which states that \(\,{{\mbox{var}}}({\tilde{R}}_{t})\ge {{\mathbb{F}}}_{I}{({R}_{t})}^{-1}\). The confidence intervals around \({\tilde{R}}_{t}\) converge to \({\tilde{R}}_{t}\pm 1.96\,{{\mathbb{F}}}_{I}{({R}_{t})}^{-\frac{1}{2}}\). The FI also links to the Shannon mutual information that \({I}_{1}^{\tau }\) contains about Rt (these measures are bijective under Gaussian approximations)60,61 and is pivotal to describing both model identifiability and complexity30,33.

Using the Poisson renewal log-likelihood in equation (10) we obtain the FI as the left-hand equality in the following equation. Observe that this depends on the unknown ‘true’ Rt.

$${{\mathbb{F}}}_{I}({R}_{t})={\varLambda }_{t}{R}_{t}^{-1},\quad {{\mathbb{F}}}_{I}(2\sqrt{{R}_{t}})={\varLambda }_{t}.$$
(11)

This reflects the heteroscedasticity of Poisson models, where the estimate mean and variance are co-dependent. We construct a square root transform that uncouples this dependence43, yielding the right-hand formula in equation (11). We can evaluate \({{\mathbb{F}}}_{I}(2\sqrt{{R}_{t}})\) purely from \({I}_{1}^{\tau }\). The result follows from the FI change of variables formula \({{\mathbb{F}}}_{I}({{{{\mathcal{R}}}}}_{t})={{\mathbb{F}}}_{I}({R}_{t}){\left(\frac{\partial {R}_{t}}{\partial {{{{\mathcal{R}}}}}_{t}}\right)}^{2}\) (ref. 30). This transformation has several optimal statistical properties29,37 and so we will commonly work with \({{{{\mathcal{R}}}}}_{t}:=2\sqrt{{R}_{t}}\).

As we are interested in evaluating the informativeness or reliability of the entire \({I}_{1}^{\tau }\) time series for inferring transmission dynamics we require the total FI it provides for all estimable reproduction numbers, \({R}_{1}^{\tau }\). As we noted above, the inverse of the square root of the FI for a single Rt corresponds to an uncertainty (or confidence) interval. Generalizing this to multiple dimensions yields an uncertainty ellipsoid with volume inversely proportional to the square root of the determinant of the FI matrix33,37. This matrix has diagonals given by \({{\mathbb{F}}}_{I}({R}_{t})\) and off-diagonals defined as \({\mathbb{E}}[-\frac{{\partial }^{2}\ell ({R}_{1}^{\tau })}{\partial {R}_{t}{R}_{s}}]\) for 1 ≤ t, s ≤ τ.

Maximizing this non-negative determinant, which we denote the total information \({\mathbb{T}}({I}_{1}^{\tau })\) from the data \({I}_{1}^{\tau }\), corresponds to what is known as a D-optimal design36. This design minimizes the overall asymptotic uncertainty around estimates of the vector \({R}_{1}^{\tau }\). As the renewal model in equation (9) treats every Rt as independent, off-diagonal terms are 0 and \({\mathbb{T}}({I}_{1}^{\tau })\) is a product of the diagonal FI terms. Transforming \({R}_{t}\to {{{{\mathcal{R}}}}}_{t}\) we then obtain

$${\mathbb{T}}({I}_{1}^{\tau })=\mathop{\prod }\limits_{t=1}^{\tau }\sqrt{{{\mathbb{F}}}_{I}({{{{\mathcal{R}}}}}_{t})}=\mathop{\prod }\limits_{t=1}^{\tau }\sqrt{{\varLambda }_{t}}.$$
(12)

If we work directly in Rt we obtain \(\mathop{\prod }\nolimits_{t = 1}^{\tau }{\varLambda }_{t}^{\frac{1}{2}}{R}_{t}^{-\frac{1}{2}}\) instead. In two dimensions (that is, τ = 2) our ellipsoid becomes an ellipse and equation (12) intuitively means that its area is proportional to a product of lengths \({{\mathbb{F}}}_{I}{({{{{\mathcal{R}}}}}_{1})}^{-\frac{1}{2}}{{\mathbb{F}}}_{I}{({{{{\mathcal{R}}}}}_{2})}^{-\frac{1}{2}}\), which factors in the uncertainty from each estimate.

We will use this recipe of formulating a log-likelihood for \({R}_{1}^{\tau }\) given some data source and then computing the total information, \({\mathbb{T}}(.)\), it provides about these parameters to quantify the reliability of case, death and other \({I}_{1}^{\tau }\) proxies for inferring transmissibility. Comparing data source quality will involve ratios of these total information terms. Metrics such as equation (12) are valuable because they measure the usable information within a time series and also delimit the possible distributions that a model can describe given these data (see refs. 33,62 for more on these ideas, which emerge from information geometry). Transforms such as \({{{{\mathcal{R}}}}}_{t}=2\sqrt{{R}_{t}}\) stabilize these metrics (that is, maximize robustness) to unknown true values29,37.

Epidemic noise sources and surveillance models

We investigate two important and common sources of noise, under-reporting and reporting delay, which limit our ability to precisely monitor \({I}_{1}^{\tau }\), the true time series of new infections. We quantify how much information is lost due to these noise processes by examining how these imperfections degrade \({\mathbb{T}}({I}_{1}^{\tau })\), the total information obtainable from \({I}_{1}^{\tau }\) under perfect (noiseless) surveillance for estimating parameter vector \({R}_{1}^{\tau }\) (equation (12)). Figure 1 illustrates how these two main noise sources individually alter the shape and size of incidence curves.

(1) Under-reporting or underascertainment. Practical surveillance systems generally detect some fraction of the true number of infections occurring at any given t. If this proportion is ρt ≤ 1 then the number of cases, Ct, observed is generally modeled as Ct ~ Bin(It, ρt) (refs. 23,56), where Bin indicates the binomial distribution. The under-reported fraction is 1 − ρt and so Ct ~ Pois(ρtΛtRt). Reporting protocols are defined by choices of ρt. CONR is the simplest and most popular, assuming every ρt = ρ (ref. 21). VARR describes general time-varying protocols where every ρt can differ28.

(2) Reporting delays or latencies. There can be notable lags between an infection and when it is reported13. If δ defines the distribution of these lags with δx as the probability of a delay of x ≥ 0 time units, then the new cases reported at t, Ct, sums infections actually occurring at t but not delayed and those from previous days that were delayed10. This is commonly modeled as \({C}_{t} \sim {{{\rm{Pois}}}}\left(\mathop{\sum }\nolimits_{x = 0}^{t-1}{\delta }_{x}{\varLambda }_{t-x}{R}_{t-x}\right)\) (refs. 20,28) and means that true incidence It splits over future times as ~Mult(It, δ), where Mult denotes multinomial12. The Ct time series is OBNR if we later learn about the past It splits (right censoring), else we say data are never reported (NEVR).

We make some standard assumptions8,11,21,28 in incorporating the above noise sources within renewal model frameworks. We only consider stationary delay distributions, that is, δ and any related distributions do not vary with time, and we neglect co-dependences between reporting and transmissibility. Additionally, we assume that these distributions and all reporting or ascertainment fractions, that is, ρt and related parameters, are inferred from other data (for example, contact tracing studies or line lists)12. In the absence of these assumptions \({R}_{1}^{\tau }\) would be non-identifiable and the inference problem ill-defined. In Results we examine how noise sources (1) and (2) in combination limit the information available about epidemic transmissibility.

FI derivations for practical data

We derive the FI of parameters \({R}_{1}^{\tau }\) given the case curve \({C}_{1}^{\tau }\) under the model of equation (1). This procedure mirrors that used above to obtain equation (12). We initially assume that reporting delays are OBNR, that is, we eventually learn the source time of cases at a later date. This corresponds to a right censoring that can be compensated for using nowcasting techniques13. Later we prove that this not only defines a practical noise model but also serves as an upper bound on the information available from NEVR delays, where the true timestamps of cases are never resolved. Mathematically, the OBNR assumption lets us decompose the sum in equation (1). We can therefore identify the component of Ct that is informative about Rx. This follows from the statistical relationship Ct | Rx ~ Pois(txρxΛxRx).

As we are interested in the total information that \({C}_{1}^{\tau }\) contains about every Rt we collect and sum contributions from every Ct. We can better understand this process by constructing the matrix Q in the following equation, which expands the convolution of the reporting fractions with the delay probabilities over the entire observed time series.

$$Q=\left[\begin{array}{ccccc}{\delta }_{0}{\rho }_{\tau }&{\delta }_{1}{\rho }_{\tau -1}&{\delta }_{2}{\rho }_{\tau -2}&\cdots \,&{\delta }_{\tau -1}{\rho }_{1}\\ 0&{\delta }_{0}{\rho }_{\tau -1}&{\delta }_{1}{\rho }_{\tau -2}&\cdots \,&{\delta }_{\tau -2}{\rho }_{1}\\ \vdots &0&{\delta }_{0}{\rho }_{\tau -2}&\ddots &\vdots \\ \vdots &\vdots &0&\ddots &{\delta }_{1}{\rho }_{1}\\ 0&0&\cdots \,&\cdots \,&{\delta }_{0}{\rho }_{1}\end{array}\right].$$
(13)

We work with the vector \({\mathbf{\upmu}} ={\left[{\mu }_{\tau },{\mu }_{\tau -1},\ldots ,{\mu }_{1}\right]}^{{\mathrm{T}}}\) with μt = ΛtRt and T denoting the transpose operation. Then \(Q{\mathbf{\upmu}} ={\left[{\mathbb{E}}[{C}_{\tau }],{\mathbb{E}}[{C}_{\tau -1}],\ldots ,{\mathbb{E}}[{C}_{1}]\right]}^{{\mathrm{T}}}\) with \({\mathbb{E}}[{C}_{t}]\) as the mean of the reported case incidence at t.

The components of \({C}_{1}^{\tau }\) that contain information about every Rt parameter follow from QTμ = [δ0ρτμτ, (δ0 + δ1)ρτ−1μτ−1, …, (δ0 + … + δτ−1)ρ1μ1]. The elements of this vector are Poisson means formed by collecting and summing the components of \({C}_{1}^{\tau }\) that inform about [Rτ, Rτ−1, …, R1], respectively. Hence we obtain the key relationship in the following equation with \({F}_{\tau -t}:=\mathop{\sum }\nolimits_{x = 0}^{\tau -t}{\delta }_{x}\) as the cumulative probability delay distribution.

$${C}_{1}^{\tau }\,| \,{R}_{t} \sim {{{\rm{Pois}}}}\left({\rho }_{t}{F}_{\tau -t}{\varLambda }_{t}{R}_{t}\right).$$
(14)

The ability to decompose the row or column sums from Q into the Poisson relationships of equation (14) is a consequence of the independence properties of renewal models and the infinite divisibility of Poisson formulations.

Using equation (14) and analogs to Poisson log-likelihood definitions from equation (10) we derive the FI that \({C}_{1}^{\tau }\) contains about Rt as follows:

$${{\mathbb{F}}}_{C}({R}_{t})={\rho }_{t}{F}_{\tau -t}{\varLambda }_{t}{R}_{t}^{-1}.$$
(15)

As in equation (11) we recompute the FI in equation (15) under the transform \({{{{\mathcal{R}}}}}_{t}=2\sqrt{{R}_{t}}\) to obtain \({{\mathbb{F}}}_{C}({{{{\mathcal{R}}}}}_{t})={\rho }_{t}{F}_{\tau -t}{\varLambda }_{t}\). It is clear that under-reporting and delays can substantially reduce our information about instantaneous reproduction numbers. As we might expect, if ρt = 0 (no reports at time unit t) or Fτt = 0 (all delays are larger than τ − t) then we have no information on Rt at all from \({C}_{1}^{\tau }\). If reporting is perfect then ρt = 1, Fτt = 1 and \({{\mathbb{F}}}_{C}({R}_{t})\) is equal to the FI from \({I}_{1}^{\tau }\) in equation (11).

The MLE, \({\hat{R}}_{t}\), also follows from equation (14) (see subsections above) as \((\mathop{\sum }\nolimits_{x = t}^{\tau }{C}_{x}\,| \,{R}_{t}){({\rho }_{t}{F}_{\tau -t}{\varLambda }_{t})}^{-1}\), with CxRt as the component of Cx containing information about Rt. By comparison with the MLE under perfect surveillance we see that \((\mathop{\sum }\nolimits_{x = t}^{\tau }{C}_{x}\,| \,{R}_{t}){({\rho }_{t}{F}_{\tau -t})}^{-1}\) is equivalent to applying a nowcasting correction as in refs. 12,13. An important point to make here is that, while such corrections can remove bias, allowing inference despite these noise sources, they cannot improve on the information (in this case equation (15)) inherently available from the data. This is known as the data processing inequality63,64.

If we cannot resolve the components of every Ct from equation (1) as \(\mathop{\sum }\nolimits_{x = 1}^{t}{{{\rm{Pois}}}}({\delta }_{t-x}{\rho }_{x}{\varLambda }_{x}{R}_{x})\), then the reporting delay is classed as NEVR (that is, we never uncover case source dates). Hence we know Qμ but not QTμ. Accordingly, we must use equation (1) to construct an aggregated log-likelihood \(\ell ({R}_{1}^{\tau })=\log {\mathbb{P}}({C}_{1}^{\tau }\,| \,{R}_{1}^{\tau })=\mathop{\sum }\nolimits_{t = 1}^{\tau }\log {\mathbb{P}}({C}_{1}^{\tau }\,| \,{R}_{t})\). This gives the following equation with the aggregate term \(h({R}_{1}^{t}):=\mathop{\sum }\nolimits_{x = 1}^{t}{\delta }_{t-x}{\rho }_{x}{\varLambda }_{x}{R}_{x}\). We ignore constants that do not depend on any Rt in this likelihood.

$$\ell ({R}_{1}^{\tau })=\mathop{\sum }\limits_{t=1}^{\tau }{C}_{t}\log h({R}_{1}^{t})-h({R}_{1}^{t}).$$
(16)

For every given Rt we decompose \(h({R}_{1}^{s})\) for s ≥ t into the form δstρtΛtRt + at, where at collects all terms that are not informative about this specific Rt. Here s ≥ t simply indicates that information about Rt is distributed across later times due to the reporting delays.

We can then obtain the FI contained in \({C}_{1}^{\tau }\) about Rt by computing \({\mathbb{E}}[-\frac{{\partial }^{2}\ell ({R}_{1}^{\tau })}{\partial {R}_{t}^{2}}]\), yielding the following equation (see Supplementary Notes for derivation details), with \({b}_{t}:={a}_{t}{({\delta }_{s-t}{\rho }_{t}{\varLambda }_{t}{R}_{t})}^{-1}\).

$${{\mathbb{F}}}_{C}({R}_{t})=\mathop{\sum }\limits_{x=t}^{\tau }{\delta }_{x-t}{\rho }_{t}{\varLambda }_{t}{({R}_{t}+{b}_{x})}^{-1}.$$
(17)

If we could decouple the interactions among the reproduction numbers then the bx terms would disappear and we would recover the expressions derived under OBNR delay types. Since bx is a function of other reproduction numbers, the overall FI matrix for \({R}_{1}^{\tau }\) is not diagonal (there are non-zero terms from evaluating \({\mathbb{E}}[-\frac{{\partial }^{2}\ell ({R}_{1}^{\tau })}{\partial {R}_{t}{R}_{x}}]\)).

However, we find that this matrix can be reduced to a triangular form with determinant equal to the product of terms (across t) in equation (17). We show this for the example scenario of τ = 3 in Supplementary Notes. As a result, the FI term for Rt in equation (17) does behave like and correspond to that in equation (15). Interestingly, as bx ≥ 0, equation (17) yields the revealing inequality \({{\mathbb{F}}}_{C}({R}_{t})\le {\rho }_{t}{F}_{\tau -t}{\varLambda }_{t}{R}_{t}^{-1}\). This proves that OBNR delays upper bound the information available from NEVR delays. Last, we note that robust transforms cannot be applied to remove the dependence of equation (17) on the unknown Rt parameters.

The best we can do is evaluate equation (17) at the MLEs \({\hat{R}}_{t}\). These MLEs emerge as the joint maxima of the set of coupled differential equations \(\frac{\partial \ell ({R}_{1}^{\tau })}{\partial {R}_{t}}=\mathop{\sum }\nolimits_{x = t}^{\tau }\frac{{I}_{x}}{{b}_{x}+{R}_{t}}-{\delta }_{x-t}{\rho }_{t}{\varLambda }_{t}\), that is, numerical solutions of equation (18) for all t.

$$\mathop{\sum }\limits_{x=t}^{\tau }{C}_{x}{({\hat{R}}_{t}+{b}_{x})}^{-1}={\rho }_{t}{F}_{\tau -t}{\varLambda }_{t}.$$
(18)

Here sums start at t as they include only time points that contain information about Rt. Expectation-maximization algorithms, such as the deconvolution approaches outlined in ref. 10, are viable means of computing these MLEs or equivalents. Note that the nowcasting methods used to correct for OBNR delays do not help here12 and that for both OBNR and NEVR delays the cumulative probability terms must be aggregated to match chosen time units (for example, if empirical delay distributions are given in days but t is in weeks then Fx sums over 7x days).