Differences between the true reproduction number and the apparent reproduction number of an epidemic time series

The time-varying reproduction number $R(t)$ measures the number of new infections per infectious individual and is closely correlated with the time series of infection incidence by definition. The timings of actual infections are rarely known, and analysis of epidemics usually relies on time series data for other outcomes such as symptom onset. A common implicit assumption, when estimating $R(t)$ from an epidemic time series, is that $R(t)$ has the same relationship with these downstream outcomes as it does with the time series of incidence. However, this assumption is unlikely to be valid given that most epidemic time series are not perfect proxies of incidence. Rather they represent convolutions of incidence with uncertain delay distributions. Here we define the apparent time-varying reproduction number, $R_A(t)$, the reproduction number calculated from a downstream epidemic time series and demonstrate how differences between $R_A(t)$ and $R(t)$ depend on the convolution function. The mean of the convolution function sets a time offset between the two signals, whilst the variance of the convolution function introduces a relative distortion between them. We present the convolution functions of epidemic time series that were available during the SARS-CoV-2 pandemic. Infection prevalence, measured by random sampling studies, presents fewer biases than other epidemic time series. Here we show that additionally the mean and variance of its convolution function were similar to that obtained from traditional surveillance based on mass-testing and could be reduced using more frequent testing, or by using stricter thresholds for positivity. Infection prevalence studies continue to be a versatile tool for tracking the temporal trends of $R(t)$, and with additional refinements to their study protocol, will be of even greater utility during any future epidemics or pandemics.


Introduction
Infectious disease epidemics are a major threat to public health.Over the past two decades there has been major outbreaks of SARS [1], MERS [2], influenza [3], Ebola [4], dengue [5] and most recently the emergence of SARS-CoV-2 [6], which resulted in the COVID-19 pandemic.Accurately quantifying the transmission dynamics of infection is crucial for informing decisions about changes to policy on public health interventions.
The reproduction number, the expected number of secondary infections resulting from a typical primary infection, is a key quantity describing a pathogen's transmission dynamics.If public health interventions are to be implemented, R(t) can inform on the magnitude of the interventions required to bring the epidemic under control [7].Analysis of how R(t) changes over time is also important for assessing the impact of interventions including vaccination and social distancing.[8,9].
Mathematically, R(t) is linked to the time series of infection incidence (the rate of new infections) by a renewal process.If the infection incidence and the generation time (the time between a primary and secondary infection) distribution are known, then R(t) can simply be calculated [10].However, the timings of infections are rarely known, and so accurate time series of infection incidence rarely exist.Estimates of R(t) must instead rely on other epidemic time series that are often treated as proxies for the infection incidence [11].
During an epidemic there are often many sources of time series data collected.These can include measurements of the frequency of downstream points in the natural history of infection such as onset of symptoms, hospitalisation, and death [12].During the SARS-CoV-2 pandemic, time series data for infection prevalence (the proportion of people testing positive for the virus) was also available in England [13,14].Infection prevalence time series are often less biased than other epidemic time series, which can be biased by changes in behaviour (health-care seeking behaviour, test-seeking behaviour, etc) or in severity of the virus (e.g, due to vaccination).However, there were concerns that such data would be ill-suited for estimating R(t) due to the presence of long-term shedders -individuals who continue to test positive for a long duration of time, following infection.
In general, an epidemic time series is linked to the time series of infection incidence by a convolution; if the function for the convolution is known then it is possible to estimate the time series of incidence, and from that R(t) [15].However, in many instances the form of this function is not known accurately (or at all) and so R(t) cannot be straightforwardly calculated.In these instances, it is common practice for the epidemic time series to be treated as a reasonable proxy for the infection incidence and estimates of R(t) are obtained by assuming the same renewal process between downstream points in the natural history of infection, for example from onset to onset or from death to death [11].Often this assumption is still made even when an estimate of the convolution function could be made [16][17][18]; despite the existence of computer packages that allow the inclusion of a convolution function when estimating R(t) [19,20].
There are other potential biases in epidemic time series which may influence estimates of R(t).However, biases introduced by assuming the epidemic time series to be a perfect proxy for infection incidence have often been overlooked.Here we investigate the form that these biases take and how they depend on the relationship between an epidemic time series and the infection incidence.We further investigate how estimates of R(t) made during the SARS-CoV-2 pandemic may have been biased by the epidemic time series that were available (onset of symptoms, infection prevalence, hospitalisations and deaths considered), and how data could be better collected and used to minimise these biases.

The relationship between incidence and R(t)
The reproduction number can be written as where η(τ ) describes the rate of secondary infections as a function of the time since a primary infection, τ .Normalising η(τ ) gives the generation time, the probability distribution describing the time between primary and secondary infections.The incidence at time t, I(t) can then be written in terms of the incidence at times less than t: Where R(t) is the reproduction number at time t.Expressed in this way the reproduction number at time t can be calculated directly from the time-series of incidence (if the generation time function is known -which we will assume throughout) through the equation:

The relationship between epidemic time series and R(t)
A general epidemic time-series, J(t), can be written in terms of the incidence at times less than t and a convolution function, f (τ ), that describes the relationship between the two time series: The convolution function will take different forms depending on the epidemic time series.For example, for the time series of deaths due to an infectious disease, f (τ ) would describe the average rate of deaths as a function of the time since infection.Substituting the relationship in equation 3 into equation 5 we obtain: In general there is not a simple relationship between J(t) and R(t).To calculate R(t) from J(t) the convolution function, f (τ ), would need to be known (and of course g(τ )).However, it is often assumed that the same relationship exists between J(t) and R(t), as the one between I(t) and R(t): Here R A (t) is the apparent reproduction number inferred from a general epidemic time series under this incorrect assumption, and is given by: In general R A (t) = R(t), but under certain conditions relationships between the two will hold.

R(t) = constant
If R(t), has been constant for the period of time over which f (τ ) goes to 0 then equation 6 can be written as: Where we have written R(t) as R(t) to make it clear this equation is only valid when R(t) is constant.In this situation the relationship between J(t) and R(t) is the same as the relationship between I(t) and R(t) and we have If the convolution function, f (τ ), is the Dirac-delta function with mean µ, δ(µ − τ ), then J(t) can be written as: R A (t) can then be written as: 3 Results

The effect of convolution on estimates of R(t)
Trends in the apparent reproduction number, R A (t), lagged trends in the actual reproduction number, R(t), in ways that were dependent on the shape of the underlying infection time series.Infection incidence was simulated for two time series of R(t): sinusoidal and square waves.A sinusoidal wave describes gradual changes in R(t); gradual changes in R(t) would be expected over the course of an epidemic (due to the depletion of susceptibles).A square wave describes rapid changes in R(t), which would be expected when some public health interventions (e.g., lockdowns, school closures) are implemented.Epidemic time series and their associated R A (t) were then calculated for different gamma distributed convolution functions (Figure 1).As we would expect, the greater the mean of the convolution function, the greater the lag between R A (t) and R(t).When R(t) changed gradually (simulated as a sine function in time) (Figure 1b) there was a delay between R(t) and R A (t) reaching their respective maximums (Figure 2c).This time delay was approximately equal to the mean of the convolution function.When R(t) underwent a step decrease (figure 1a) there was a delay between R(t) decreasing and R A (t) decreasing (with similar results for a step increase).Measuring the time delay between R(t) decreasing (a step change) and R A (t) falling to under 95% of its maximum value (Figure 2a) we observed a greater delay for convolution functions with greater means and lower standard deviations.
In addition to R A (t) lagging R(t), trends in R A (t) (the shape of the curve Fig. 2 Diagnostic differences between R(t) and R A (t). Diagnostic differences between R(t) and R A (t), for different gamma-distributed convolution functions.Only convolution functions with a mean value greater than their standard deviation are included.(A,B) Differences between R A (t) and R(t), when R(t) describes a step decrease (see Figure 1a, day 50).(A) The number of days it takes for R A (t) to fall below 95% of its maximum value following the step change in R(t).(B) The number of days between R A (t) falling below 95% of its maximum value and falling below 5% of its maximum value.(C,D) Differences between R A (t) and R(t), when R(t) describes a sine wave (see Figure 1b).(C) The number of days between R(t) reaching its maximum value and R A (t) reaching its maximum value.(D) A measure of the difference between the peak values of R(t) and R A (t), defined as 1 it reflects the proportion of the maximum value that R A (t) reaches.
over time) were distorted relative to R(t), with increases in the standard deviation of the convolution function resulting in greater levels of distortion.When R(t) changed gradually (simulated as a sine function in time) the maximum value of R A (t) was lower than the maximum value of R(t).The greater the standard deviation of the convolution function the greater the relative reduction in the maximum value of R A (t) (Figure 2d).When R(t) underwent a sharp decrease (a step change over a single day) R A (t) decreased at a slower rate (Figure 1a).Measuring the time delay between R A (t) falling from 95% of its maximum value to 5% of its maximum value we observed a greater delay when the standard deviation of the convolution function was greater (Figure 2b).

Case study: SARS-CoV-2 epidemic time series
For all epidemic time series considered (that would have been available during the SARS-CoV-2 pandemic) there were clear differences between R A (t) and R(t), during a representative scenario (chosen to resemble trends in R(t) for SARS-CoV-2 in England from March to December 2020) (Figure 3).During periods when R(t) was constant, the value of R A (t) was the same (as expected).
When R(t) changed, R A (t) lagged R(t) and had its shape distorted in a way that was dependent on the shape of the convolution function.
The values of R A (t) calculated from the time series of symptom onset most closely resembled the underlying R(t).The mean and standard deviation of the convolution function for symptom onset was lower than all other time series considered (Figure 3b).This was expected for deaths and hospitalisations, for which the convolution functions were composed of the distributions of time from infection to symptom onset (incubation period) and time from symptom onset to outcome (death/hospitalisation) (see Methods).Interestingly, the peak probability density of the convolution function for infection prevalence was the same as for symptom onset, but due to a long tail in the distribution the mean and standard deviation were greater.

Improving inference using frequent testing
Increasing the frequency of random/asymptomatic testing reduced the mean and standard deviation of the convolution function (Figure 4c).Random testing or asymptomatic testing for SARS-CoV-2 should, in the absence of repeat testing, have a convolution function describing the probability of testing positive following infection.In our simulations, increasing the frequency of testing decreased the mean convolution function, as infections were more likely to be identified earlier.Testing once every three days gave a convolution function with a lower mean than that of symptom onset.Testing every day resulted in a convolution function with mean and standard deviation of approximately 3 days and 1 day respectively.The lower mean and standard deviation of the convolution function resulted in R A (t) estimates more closely resembling R(t) as expected (Figures 4a and 4b).less likely that early-infections or long-term shedders are included within the epidemic time series.This led to substantial decreases in the standard deviation of the convolution function, and smaller decreases in the mean of the convolution function.Accordingly, this led to the shape of R A (t) more closely resembling that of R(t), while the lag between the two signals stayed approximately constant (Figure 5a and 5b).Imposing a stricter threshold reduced the number of individuals classified as positive and therefore the power of any analysis (Figure 5d).

Discussion
During the SARS-CoV-2 pandemic, estimates of the time-varying reproduction number, R(t), were crucial for informing public health interventions.In most counties there were numerous epidemic time series available for estimating R(t).We have described the biases that will be present in estimates of 'R(t)' when it is assumed that these time series are reliable proxies for infection incidence (and not the convolution of infection incidence with some other function).This assumption is commonly made [11], despite methods existing to incorporate estimates of the convolution function into estimates of R(t) [19].We presented the convolution functions representing four different epidemic time series for SARS-CoV-2: date of symptom onset (symptom reporting studies), prevalence (rt-PCR testing of random samples or asymptomatic testing), hospitalisations and deaths.Additionally, the convolution function for casenumbers, obtained from mass testing (available in many countries worldwide), would likely be a linear combination of the convolution functions for prevalence (asymptomatic testing) and symptom onset (symptomatic testing) with an additional delay to test distribution (delay between symptom onset and positive test).
If the exact form of these convolution function was known, then estimates of R(t) could easily be made from the corresponding epidemic time series [15], but this is not always the case.During the early stages of a pandemic there is little epidemic information, and so convolution functions cannot be inferred.During later periods of a pandemic there can still be great uncertainty in estimates of the convolution function.There is also no guarantee that the convolution functions remain constant over time.For example, rates of symptomatic infection can change (effecting mass testing) [21,22], severity can change (effecting hospitalisation and death time series) [23], and even the duration an individual remains test positive can change due to new variants [24], or due to vaccination [25].Thus, there are times when only the apparent reproduction number, R A (t), can be estimated, and so understanding the inherent biases in these estimates is crucial.
Changes in R A (t) lagged changes in R(t), with the duration of the lag controlled by the mean of the convolution function.If timeliness in R(t) estimates are required (e.g., for pandemic surveillance), then this time lag is a major limitation (though estimates could still be useful for retrospective analysis).The temporal trends in R A (t) were distorted relative to R(t), with a greater degree of distortion when variance in the convolution function was greater; If there was no variance in the convolution function (standard deviation=0), R A (t) would exhibit the exact same temporal trends as R(t) (lagged by the mean of the convolution function).This distortion was most pronounced when R(t) changed rapidly, and less pronounced when R(t) changed more gradually.During many epidemics only gradual changes in R(t) are observed (e.g., due to the changing proportion of susceptibility as individuals are infected and develop immunity) and such distortions would be minimal.However, in some instances R(t) may change rapidly, such as when restrictions are implemented [26] -for example lock downs during the SARS-CoV-2 pandemic -or when schools are closed/opened [8].In these instances, there will be a high degree of distortion between R(t) and R A (t).
The convolution function for SARS-CoV-2 symptom onset had the smallest mean and variance, of the four convolution functions considered, which would result in estimates of R A (t) that were the most similar to R(t).However, there are often multiple circulating pathogens which may present similar symptom profiles.To estimate R A (t), it must be determined which individuals reporting symptoms are infected with the relevant pathogen.For this symptomatic testing the exact convolution function would rely on a distribution describing the time between symptom onset and testing (testing delay); this would likely increase the convolution function's mean and variance.To improve estimates of R A (t) the testing delay distribution should be reduced as much as possible.Additionally, the date of symptom onset should be routinely collected; this would reduce the mean and variance of the convolution function (no testing delay component required) and would also allow R(t) to be estimated (once the incubation period has been estimated) as the convolution function would be less likely to vary over time (testing delay can change over time).
Random sampling has been proposed as a method for tracking COVID-19 infections [27], without the biases present in standard pandemic surveillance methods such as mass testing [28].There has been some concern that such studies would be less well suited for estimating R(t), due to the presence of long-term shedders -individuals who continue to test positive for a long duration of time, following infection.Random sampling studies measure the infection prevalence, the proportion of a population that test positive for the virus.We have demonstrated that the convolution function for prevalence of SARS-CoV-2 has a mean and variance only slightly greater than that of symptom onset.In fact, compared to mass-testing, which relies heavily on symptomatic testing, it is likely that the mean and variance of the convolution functions would be comparable, depending on the testing delay distribution.
The convolution function for symptomatic testing can at best match the incubation period of the pathogen.In contrast we have shown that there exist methods for improving the inference of R A (t) based on prevalence studies.Imposing a stricter threshold for positivity can reduce variance of the convolution function (though larger samples may be required as more positive tests would be excluded).It is worth noting that the Ct value model we presented is not a perfect analogue to rt-PCR positivity and so results are informative, but not the exact outcome that would be obtained using a stricter threshold for positivity.Increasing the frequency of testing could also reduce both the mean and variation of the convolution function.We observed that for daily testing the convolution function's mean and variance was smaller than that of the incubation period of SARS-CoV-2.It is highly unlikely that a random sampling study, which is very expensive, could be designed to undertake such high rates of testing.However, the convolution function presented would also be valid for asymptomatic testing.Regular asymptomatic testing was performed by many individuals and businesses throughout the pandemic.Studies or surveillance systems set up around frequent asymptomatic testing could result in estimates of R A (t) that more closely resemble R(t).In the latter stages of the pandemic Lateral Flow Viral Antigen detection devices (LFDs) were regularly used by some individuals.LFDs are less sensitive in general than rt-PCR testing, but they are still highly sensitive at detecting infections with high viral loads [29].This is at times a limitation as less infected individuals will be identified, but the resulting convolution function would likely have even less variation.
We have only considered the biases in R(t) due to the convolution function.There are often many other sources of bias associated with epidemic time series data [28].For example, changing testing rates can bias estimates of R(t) obtained from cases identified through mass testing [30].Additionally, estimates of R(t) also rely on accurate estimates of the generation time distribution [10].The generation time is not always well characterised (especially during the early period of a pandemic) and in some circumstances it can change over time [31].When the generation time is not known (or poorly estimated), the growth rate is regularly used to quantify epidemic growth and decline as an alternative to R(t) [32].However, though the growth rate is independent of generation time, estimating it from a general epidemic time series would introduce a similar set of biases as for R(t) due to the convolution function.

Conclusion
There are clear biases when estimating R(t) from an epidemic time series, under the assumption that the time series is a perfect proxy for infection incidence.By collecting data to inform on the convolution function linking the epidemic time series to incidence, estimates of R(t) could be made without such biases.During periods in which the convolution function cannot be estimated, the apparent reproduction number can and should still be estimated.Despite its biases, it is highly useful for quantifying transmission dynamics and informing public health interventions.To minimise the inherent biases present, more careful consideration should be given into what epidemic data is used and how it can be better collected and manipulated.Additionally, studies in which it is the apparent reproduction number being estimated should state this explicitly, making it clear what possible biases may be present.

Simulating epidemic time series
The time series of R(t) are pre-specified.R(t) is assumed to be 1 for a period of 100 days (days -99 to 0) prior to the simulation for ease of later integral calculations (integration's are all performed over the previous 100 days).The time series of R(t) can then be defined in any way.We define two main time series of R(t) following a square wave and a sine wave, both with a period of 100 days, a maximum value of 1.5 and a minimum value of 0.5.We also define a time series of R(t) that follows approximately the same trend as the estimated R(t) in England from March 2020 to December 2020 [26].With a time series of R(t), the time series for infection incidence can be calculated through equation 3: The generation time is assumed to be gamma distributed with a mean of 6.36 days and a standard deviation of 4.20 days [33].This was chosen to match the generation time of SARS-CoV-2, but as we assume perfect knowledge of the generation time all results are independent of the choice of the generation time distribution used.When calculating the value of I(t) we perform the integral over the previous 100 days.This ensures the integral has fallen to approximately 0. Days -99 to 0 are assumed to have I(t) = 1.General epidemic time series, J(t) are calculated directly from I(t) through equation 5: The integration is again performed over the previous 100 days, ensuring the integral has fallen to approximately 0. f (τ ) is a convolution function which we define over a period of only 100 days.

Calculating the reproduction number
We calculate the apparent reproduction number, R A (t), directly from the general epidemic time series J(t) through equation 8: The integration is again performed over the previous 100 days, ensuring the integral has fallen to approximately 0. As before g(ζ) is the generation time distribution which is known exactly when performing the calculation.

Convolution functions
We define many convolution functions, f (τ ), for converting I(t) into a general epidemic time-series.All convolution functions are defined only for values of τ between 0 and 100.

Gamma-distributed convolution functions
The gamma-distributed convolution functions are defined using the gamma distribution.A gamma distribution defined with scale parameter, k, and shape parameter, θ has mean, kθ and standard deviation, k 1/2 θ.We then define our gamma-distributed convolution functions with mean, µ and standard deviation, σ using the inverse relationships: k = µ 2 /σ 2 and θ = σ 2 /µ.

Convolution functions for SARS-CoV-2 epidemic time series
The convolution function for symptom onset is defined by the incubation period of SARS-CoV-2, which we assume is gamma distributed with mean 6.0 days and standard deviation 3.1 days [34].
The convolution functions for deaths and hospitalisations are calculated using the equation: In this equation f 1 (τ ) describes the incubation period of SARS-CoV-2 (same gamma distribution as above) and f 2 (ζ) describes the distribution of the time between symptom onset and death/hospitalisation.For deaths we assume f 2 (ζ) is gamma distributed with mean 15.0 days and standard deviation 6.9 days [34].For hospitalisations we assume f 2 (ζ) is gamma distributed with mean 7.8 days and standard deviation 6.0 days [35].The parameters used for all distributions are only meant to be informative, there are many other potential values that could have been selected from the literature.The convolution function for prevalence is defined using the probability of testing positive as a function of time since infection.We use the median of the modelled probability of testing positive (as a function of time since infection) estimated in Hellewell et al 2021 [36].

Convolution function for frequent testing
The convolution function for frequent testing is calculated using the distribution describing the probability of testing positive as a function of time since infection (the convolution function for prevalence).When there is frequent testing the convolution function describes the day in which an individual first tests positive only.Writing the probability of testing positive τ days after infection as P (τ ), we can calculate the probability of first testing positive τ days after infection as: P (first positive, τ ) = P (τ ) where N is the total number of tests performed since infection and γ is the number of days between tests.The product in the above equation is calculating the probability of testing negative in all previous tests.

Convolution function for different Ct value thresholds
The convolution function for different Ct value thresholds of positivity was calculated by simulating the trajectory of Ct values over time since infection for 1000 individuals.The proportion of individuals with a Ct value less than a threshold Ct value was then calculated as a function of time since infection.Ct values were simulated using the results and data of Hay et al 2022 [37] (Ct values represent ORF1ab).In the paper an individual's Ct value trajectories can be quantified using three parameters: the minimum Ct value reached, the time taken for Ct value to decrease from the limit of detection (Ct value=40) to its minimum value (linear decrease assumed), and the time taken for Ct value to increase from its minimum value to the limit of detection (linear increase assumed).The parameters for an individual are assumed to be drawn from parameter distributions describing the population as a whole.To simulate 1000 individuals we extracted 1000 individual level parameters from the population parameter distributions estimated for the model fit to the data for

Fig. 1
Fig. 1 Estimates of the time-varying reproduction number from epidemic time series with different gamma distributed convolution functions.(A, B) The actual reproduction number, R(t) (black), and the apparent reproduction number, R A (t) (coloured), for epidemic time series with different gamma distributed convolution functions.(C) The gamma distributed convolution functions used in (A) and (B).

Fig. 3
Fig.3Trends in R(t) and R A (t) for representative epidemic time series during the SARS-CoV-2 pandemic.(A) The actual reproduction number, R(t) (black), and the apparent reproduction number, R A (t) (coloured), for epidemic time series with different convolution functions.R(t) was chosen to resemble a possible trajectory of R(t) in England from May 2020 to December 2020 (B) The probability distribution of the convolution functions for deaths (Green), hospitalisations (Orange), prevalence (Purple), symptom onset (Pink).The mean and standard deviation of all convolution functions are presented in the inset graph.

Fig. 4
Fig.4The effect of increasing testing frequency on estimates of R A (t). (A,B) The actual reproduction number, R(t) (black), and the apparent reproduction number, R A (t) (coloured), for epidemic time series with different convolution functions.(C) The probability distribution of convolution functions for different frequencies of rt-PCR testing.Testing frequency of 'every 100 days' reflects no repeat testing and has the same convolution function as for prevalence in figure3.The mean and standard deviation of all convolution functions are presented in the inset graph.

Imposing a stricter viralFig. 5
Fig.5The effect of imposing stricter Ct thresholds for positivity on estimates of R A (t). (A,B) The actual reproduction number, R(t) (black), and the apparent reproduction number, R A (t) (coloured), for epidemic time series with different convolution functions.(C) The probability distribution of convolution functions for different Ct thresholds of rt-PCR positivity.Note that the convolution function was defined from a model of population Ct values rather that rt-PCR positivity, and so will not match the convolution function for prevalence in figure3.The mean and standard deviation of all convolution functions are presented in the inset graph.(D) The proportion of individual tests classified as positive relative to the number defined as positive with a Ct threshold of 37.