Defining genomic epidemiology thresholds for common-source bacterial outbreaks: a modelling study

Summary Background Epidemiological surveillance relies on microbial strain typing, which defines genomic relatedness among isolates to identify case clusters and their potential sources. Although predefined thresholds are often applied, known outbreak-specific features such as pathogen mutation rate and duration of source contamination are rarely considered. We aimed to develop a hypothesis-based model that estimates genetic distance thresholds and mutation rates for point-source single-strain food or environmental outbreaks. Methods In this modelling study, we developed a forward model to simulate bacterial evolution at a specific mutation rate (μ) over a defined outbreak duration (D). From the distribution of genetic distances expected under the given outbreak parameters and sample isolation dates, we estimated a distance threshold beyond which isolates should not be considered as part of the outbreak. We embedded the model into a Markov Chain Monte Carlo inference framework to estimate the most probable mutation rate or time since source contamination, which are both often imprecisely documented. A simulation study validated the model over realistic durations and mutation rates. We then identified and analysed 16 published datasets of bacterial source-related outbreaks; datasets were included if they were from an identified foodborne outbreak and if whole-genome sequence data and collection dates for the described isolates were available. Findings Analysis of simulated data validated the accuracy of our framework in both discriminating between outbreak and non-outbreak cases and estimating the parameters D and μ from outbreak data. Precision of estimation was much higher for high values of D and μ. Sensitivity of outbreak cases was always very high, and specificity in detecting non-outbreak cases was poor for low mutation rates. For 14 of the 16 outbreaks, the classification of isolates as being outbreak-related or sporadic is consistent with the original dataset. Four of these outbreaks included outliers, which were correctly classified as being beyond the threshold of exclusion estimated by our model, except for one isolate of outbreak 4. For two outbreaks, both foodborne Listeria monocytogenes, conclusions from our model were discordant with published results: in one outbreak two isolates were classified as outliers by our model and in another outbreak our algorithm separated food samples into one cluster and human samples into another, whereas the isolates were initially grouped together based on epidemiological and genetic evidence. Re-estimated values of the duration of outbreak or mutation rate were largely consistent with a priori defined values. However, in several cases the estimated values were higher and improved the fit with the observed genetic distance distribution, suggesting that early outbreak cases are sometimes missed. Interpretation We propose here an evolutionary approach to the single-strain conundrum by estimating the genetic threshold and proposing the most probable cluster of cases for a given outbreak, as determined by its particular epidemiological and microbiological properties. This forward model, applicable to foodborne or environmental-source single point case clusters or outbreaks, is useful for epidemiological surveillance and may inform control measures. Funding European Union Horizon 2020 Research and Innovation Programme.

To address uncertainty regarding the time since initial source contamination and evolutionary rate, we embed our 84 model into a statistical framework to estimate either the duration of the outbreak (D) or its substitution rate (μ)

85
( Figure 1B). We estimate D and µ from the observed pairwise genetic distance matrix by minimizing the distance outbreaks for which parameters were known. L, N and g were fixed as above. Outbreaks generated using of parameters, 20 individual outbreaks were run, leading to 2400 different simulated datasets. For each outbreak,

111
we estimated D and μ separately with three independent MCMC chains.

115
The possible food source of contamination was chocolate mousse. Among 47 cases, 13 isolates were sequenced 116 from patients. One isolate differed by 12 SNPs and another by five SNPs, which was interpreted as at least three 117 distinct strains being present in the food source. We used the authors' high substitution rate and high duration of 118 outbreak for D lit and μ lit .

119
Outbreaks 2 and 3 also involved Salmonella Typhimurium in Australia (2). These two outbreaks occurred between hot bread shop. At least 25 isolates for outbreak 2 and 20 isolates for outbreak 3 were sequenced and available (as 122 well as some sporadic cases that we did not consider in the present paper). We used the same substitution rate for 123 μ lit as for Outbreak 1 and duration of outbreaks D lit were chosen as the duration between the first and the last 124 sample dates (+24H).
Outbreaks 4 (3) and 5 (4) involved Campylobacter jejuni infections caused by contaminated chicken liver pâté in were identified, ST528 (five isolates), ST535 (two isolates) and ST991 (two food isolates); here we focused on the 25 isolates were a mix of outbreak and background isolates, with only 11 isolates ultimately being attributed a shorter D and the highest μ. Estimating the genetic threshold (with the classical 100 simulations) is very fast and 158 takes around 0.1, 0.1, 0.2 and 1.6 seconds for outbreaks 1, 6, 10 and 5 respectively. Nevertheless, estimation of D 159 or μ with one MCMC chain is much longer and takes around 40, 47, 100 and 1500 seconds for the estimation of μ 160 as well as 46, 70, 43, and 300 seconds for the estimation of D for outbreaks 1, 6, 10 and 5, respectively. The higher

161
were the values of D or μ, the longer the estimation was. Especially, μ had a more important impact on execution 162 time.    Figure S10, S14 and S15). For outbreak 12, a better fit was observed with μ estimated compared to Figure S1. Analysis of outbreak 1: Comparison between distance thresholds derived from the original publication on the outbreak, and from the current modelling framework.
In A, C and E are presented the SNP distance distributions: observed distribution (blue), simulated distribution without estimation (green), simulated with the estimated duration of outbreak (dark red) and simulated with the estimated evolutionary rate (red). Error bars represent the interval of prediction at 95% of 100 simulations. Red vertical lines correspond to the derived distance threshold. In B, D and F are presented the resulting singlelinkage clusters according to the derived distance threshold, defined here as the 99 th percentile of the simulated distributions with values corresponding to panels A, C and E, respectively.

Figure S2. Analysis of outbreak 2: Comparison between distance thresholds derived from the original publication on the outbreak, and from the current modelling framework.
In A, C and E are presented the SNP distance distributions: observed distribution (blue), simulated distribution without estimation (green), simulated with the estimated duration of outbreak (dark red) and simulated with the estimated evolutionary rate (red). Error bars represent the interval of prediction at 95% of 100 simulations. Red vertical lines correspond to the derived distance threshold. In B, D and F are presented the resulting singlelinkage clusters according to the derived distance threshold, defined here as the 99 th percentile of the simulated distributions with values corresponding to panels A, C and E, respectively Figure S3. Analysis of outbreak 3: Comparison between distance thresholds derived from the original publication on the outbreak, and from the current modelling framework.
In A, C and E are presented the SNP distance distributions: observed distribution (blue), simulated distribution without estimation (green), simulated with the estimated duration of outbreak (dark red) and simulated with the estimated evolutionary rate (red). Error bars represent the interval of prediction at 95% of 100 simulations. Red vertical lines correspond to the derived distance threshold. In B, D and F are presented the resulting singlelinkage clusters according to the derived distance threshold, defined here as the 99 th percentile of the simulated distributions with values corresponding to panels A, C and E, respectively Figure S4. Analysis of outbreak 4: Comparison between distance thresholds derived from the original publication on the outbreak, and from the current modelling framework.
In A, C and E are presented the SNP distance distributions: observed distribution (blue), simulated distribution without estimation (green), simulated with the estimated duration of outbreak (dark red) and simulated with the estimated evolutionary rate (red). Error bars represent the interval of prediction at 95% of 100 simulations. Red vertical lines correspond to the derived distance threshold. In B, D and F are presented the resulting singlelinkage clusters according to the derived distance threshold, defined here as the 99 th percentile of the simulated distributions with values corresponding to panels A, C and E, respectively.  In A, C and E are presented the SNP distance distributions: observed distribution (blue), simulated distribution without estimation (green), simulated with the estimated duration of outbreak (dark red) and simulated with the estimated evolutionary rate (red). Error bars represent the interval of prediction at 95% of 100 simulations. Red vertical lines correspond to the derived distance threshold. In B, D and F are presented the resulting singlelinkage clusters according to the derived distance threshold, defined here as the 99 th percentile of the simulated distributions with values corresponding to panels A, C and E, respectively.

Figure S8. Analysis of outbreak 8: Comparison between distance thresholds derived from the original publication on the outbreak, and from the current modelling framework.
In A, C and E are presented the cgMLST distance distributions: observed distribution (blue), simulated distribution without estimation (green), simulated with the estimated duration of outbreak (dark red) and simulated with the estimated evolutionary rate (red). Error bars represent the interval of prediction at 95% of 100 simulations. Red vertical lines correspond to the derived distance threshold. In B, D and F are presented the resulting single-linkage clusters according to the derived distance threshold, defined here as the 99 th percentile of the simulated distributions with values corresponding to panels A, C and E, respectively.

Figure S9. Analysis of outbreak 9: Comparison between distance thresholds derived from the original publication on the outbreak, and from the current modelling framework.
In A, C and E are presented the cgMLST distance distributions: observed distribution (blue), simulated distribution without estimation (green), simulated with the estimated duration of outbreak (dark red) and simulated with the estimated evolutionary rate (red). Error bars represent the interval of prediction at 95% of 100 simulations. Red vertical lines correspond to the derived distance threshold. In B, D and F are presented the resulting single-linkage clusters according to the derived distance threshold, defined here as the 99 th percentile of the simulated distributions with values corresponding to panels A, C and E, respectively. Figure S10. Analysis of outbreak 10: Comparison between distance thresholds derived from the original publication on the outbreak, and from the current modelling framework.
In A, C and E are presented the cgMLST distance distributions: observed distribution (blue), simulated distribution without estimation (green), simulated with the estimated duration of outbreak (dark red) and simulated with the estimated evolutionary rate (red). Error bars represent the interval of prediction at 95% of 100 simulations. Red vertical lines correspond to the derived distance threshold. In B, D and F are presented the resulting single-linkage clusters according to the derived distance threshold, defined here as the 99 th percentile of the simulated distributions with values corresponding to panels A, C and E, respectively.

Figure S11. Analysis of outbreak 11: Comparison between distance thresholds derived from the original publication on the outbreak, and from the current modelling framework.
In A, C and E are presented the cgMLST distance distributions: observed distribution (blue), simulated distribution without estimation (green), simulated with the estimated duration of outbreak (dark red) and simulated with the estimated evolutionary rate (red). Error bars represent the interval of prediction at 95% of 100 simulations. Red vertical lines correspond to the derived distance threshold. In B, D and F are presented the resulting single-linkage clusters according to the derived distance threshold, defined here as the 99 th percentile of the simulated distributions with values corresponding to panels A, C and E, respectively.
Figure S12. Analysis of outbreak 12: Comparison between distance thresholds derived from the original publication on the outbreak, and from the current modelling framework.
In A, C and E are presented the cgMLST distance distributions: observed distribution (blue), simulated distribution without estimation (green), simulated with the estimated duration of outbreak (dark red) and simulated with the estimated evolutionary rate (red). Error bars represent the interval of prediction at 95% of 100 simulations. Red vertical lines correspond to the derived distance threshold. In B, D and F are presented the resulting single-linkage clusters according to the derived distance threshold, defined here as the 99 th percentile of the simulated distributions with values corresponding to panels A, C and E, respectively. Figure S13. Analysis of outbreak 13: Comparison between distance thresholds derived from the original publication on the outbreak, and from the current modelling framework.
In A, C and E are presented the cgMLST distance distributions: observed distribution (blue), simulated distribution without estimation (green), simulated with the estimated duration of outbreak (dark red) and simulated with the estimated evolutionary rate (red). Error bars represent the interval of prediction at 95% of 100 simulations. Red vertical lines correspond to the derived distance threshold. In B, D and F are presented the resulting single-linkage clusters according to the derived distance threshold, defined here as the 99 th percentile of the simulated distributions with values corresponding to panels A, C and E, respectively.   In A, C and E are presented the cgMLST distance distributions: observed distribution (blue), simulated distribution without estimation (green), simulated with the estimated duration of outbreak (dark red) and simulated with the estimated evolutionary rate (red). Error bars represent the interval of prediction at 95% of 100 simulations. Red vertical lines correspond to the derived distance threshold. In B, D and F are presented the resulting single-linkage clusters according to the derived distance threshold, defined here as the 99 th percentile of the simulated distributions with values corresponding to panels A, C and E, respectively. Figure S17. D estimation analysis. In A, average best estimation of D of the three MCMC chains for each of the 20 outbreaks. In B, average HPD range of the 20 outbreaks for 10 sampling densities (from 5% to 50%). In C, average distance between real and best estimated D. In B and C, error bars correspond to the 95% prediction interval. Figure S18. μ estimation analysis. In A, average best estimation of μ of the three MCMC chains for each of the 20 outbreaks. In B, average HPD range of the 20 outbreaks for 10 sampling densities (5 to 50%). In C, average distance between real and best estimated μ. In B and C, error bars correspond to the 95% prediction interval.