“Outbreak reconstruction with a slowly evolving multi-host

12 In a multi-host system, understanding host-species contribution to transmission is key 13 to appropriately targeting control and preventive measures. Outbreak reconstruction methods 14 aiming to identify who-infected-whom by combining epidemiological and genetic data could 15 contribute to achieving this goal. However, the majority of these methods remain untested on 16 realistic simulated multi-host data. Mycobacterium bovis is a slowly evolving multi-host 17 pathogen and previous studies on outbreaks involving both cattle and wildlife have identified 18 observation biases. Indeed, contrary to cattle, sampling wildlife is difficult. The aim of our 19 study was to evaluate and compare the performances of three existing outbreak reconstruction 20 methods (seqTrack, outbreaker2 and TransPhylo ) on M. bovis multi-host data simulated with 21 and without biases.

Extending an existing transmission model, we simulated 30 bTB outbreaks involving 23 cattle, badgers and wild boars and defined six sampling schemes mimicking observation biases. 24 We estimated general and specific to multi-host systems epidemiological indicators. We tested 25 four alternative transmission scenarios changing the mutation rate or the composition of the 26 epidemiological system. The reconstruction of who-infected-whom was sensitive to the 27 mutation rate and seqTrack reconstructed prolific super-spreaders. TransPhylo and outbreaker2 28 poorly estimated the contribution of each host-species and could not reconstruct the presence 29 of a dead-end epidemiological host. However, the host-species of cattle (but not badger) index 30 cases was correctly reconstructed by seqTrack and outbreaker2. These two specific indicators 31 improved when considering an observation bias. 32 We found an overall poor performance for the three methods on simulated biased and 33 unbiased bTB data. This seemed partly attributable to the low evolutionary rate characteristic 34 of M. bovis leading to insufficient genetic information, but also to the complexity of the Introduction 54 Over 60% of pathogens can infect more than one host-species [1,2]. This possible 55 contribution of multiple host-species to transmission dynamics complicates disease control and 56 surveillance for these multi-host pathogens, especially when one of the host-species to consider 57 is a free-ranging wildlife species. Indeed, quantifying contribution to transmission in order to 58 select appropriate control measures as well as the implementation of said measures can be from transmission trees, in which each node represents an infected host and these infected hosts 85 are linked by directed edges representing transmission events [21]. Such a reconstruction of 86 who-infected-whom in the outbreak makes it possible to estimate transmission parameters 87 specific to each host-species (such as the number of transmission events due to an individual of 88 a particular host-species), and thus sheds more light on the transmission dynamics within the 89 studied multi-host system. 90 In principle, outbreak (here meaning transmission tree) reconstruction could be based 91 solely on epidemiological data obtained via contact tracing methods (e.g. [22]); however data 92 collected are not always reliable nor detailed enough to enable accurate reconstruction [23]. 93 Therefore, some outbreak reconstruction methods have combined both genomic and 94 epidemiological data in transmission tree inference [24][25][26][27][28][29]. These outbreak reconstruction 95 methods can be divided into two categories according to how genomic data is treated [30], those 96 that consider a link between phylogenetic and transmission trees (generally by annotating 97 branches or internal nodes with infected hosts) [26,28,31,32] and those that solely consider 98 genetic distances [21,25,33]. While some outbreak reconstruction methods were developed to 99 study pathogen transmission within a specific multi-host system (e.g. foot-and-mouth disease 100 [34]), most were developed using the example of a single-host system, e.g. slowly evolving M. 101 tuberculosis [26,31], and more rapidly evolving pathogens like methicillin-resistant 102 Staphylococcus aureus [26,33] or SARS-CoV-1 [25] in a human population. However, the 103 development of outbreak reconstruction methods on single-host systems does not preclude them 104 from yielding insightful results in multi-host systems; for instance Willgert et al. recently 105 reconstructed the transmission history of SARS-CoV-2 in a human-deer system in Iowa (USA) 106 [35]. In a multi-host system, other than correctly reconstructing transmission events between 107 individuals and estimating outbreak size (general epidemiological indicators), we expect 108 outbreak reconstruction methods to allow accurate estimation of host-species contribution to the outbreak and to identify the host-species of the index case (specific multi-host 110 epidemiological indicators). 111 While some outbreak reconstruction methods assume that all cases are known and 112 sampled [21,28,36], others account for the presence of unsampled cases by either allowing the 113 annotation of unsampled hosts in the phylogenetic tree [31] or the presence of intermediary 114 unsampled hosts between two sampled hosts [24,25]. When not all cases are sampled in the 115 outbreak, there exists a difference between the actual outbreak and the transmission tree these 116 methods can aim to reconstruct from sampled sequences. Indeed, even if the outbreak 117 reconstruction method accounts for the presence of unsampled hosts [25,31], these hosts can 118 only be inferred if they have descendant sampled hosts [35] and the transmission tree that can 119 be reconstructed is therefore a subtree induced by the sampling process. 120 The sampling process in a multi-host system that implicates a free-ranging wildlife 121 species can also result in incomplete or even biased data, when observation efforts differ 122 between host-species. For instance, M. bovis wildlife surveillance in France was implemented 123 later than cattle surveillance (2012 vs. 1954) and only investigates bTB infection in badgers, 124 boars, red deer (Cervus elaphus) and roe deer (Capreolus capreolus) [11]. However, 125 estimations of bTB infection rates in red foxes (Vulpes vulpes) have recently been investigated 126 in France and yielded similar results to those found in badgers and wild boars [37]. These 127 sampling biases between host-species could have an important impact on outbreak 128 reconstruction.

129
Our aim was to evaluate and compare performances of existing outbreak reconstruction 130 methods on bTB outbreaks in a multi-host system and study whether these performances were 131 affected by sampling biases. Therefore, we simulated bTB transmission within a multi-host 132 system situated in a previously studied area in the South-West of France. In this area, bTB 133 surveillance has reported M. bovis circulation in cattle, badgers and wild boars [38]. Multiple 134 sampling schemes were implemented to reflect the late implementation of wildlife surveillance 135 (temporal bias) and the fact that not all host-species are surveilled (species bias). In order to 136 evaluate the quality of reconstructed transmission trees, we calculated general as well as 137 specific multi-host epidemiological indicators.  141 We extended an existing model that simulated bTB transmission trees, for the 11 142 genotypes identified, in a badger-cattle system present in a study area in the South-West of 143 France, from January 2007 to January 2020 [39]. We narrowed our study to one of the two 144 genotypes of M. bovis, which were isolated in both wildlife and cattle within our study region.

145
Since infected wild boars have also been detected in this study area [40,41] and our aim was to 146 study a complex multi-host system, we added a wild boar meta-population to the modeled 147 epidemiological system (see details in S1 Appendix). Similarly to the badger population, wild 148 boars could either be susceptible (S) or infected (I) while cattle had an additional latent state 149 (E), when animals could be detected infected but could not transmit the pathogen [39]. 150 Moreover, transmission trees simulated with the original model considered cattle farms 151 and badger social groups as epidemiological units whereas we aimed to reconstruct individual 152 transmission links. Therefore, we extended the model to randomly select infected animals 153 within these groups according to the SEI/SI system dynamics and thus, simulated animal-to-154 animal transmission. The resulting transmission trees are termed below reference transmission 155 trees (terms written in italic are defined in Table 1). 8 156 157 We chose cattle as index cases and bTB spread in the multi-host system was simulated 158 during 13 years. We generated 30 reference transmission trees, in order to investigate various 159 simulated outbreaks while limiting the computational time. These 30 trees had to include less 160 than 500 infected hosts in total, for computational reasons, and at least 15 infected hosts from 161 each host-species, in order to be able to implement sampling schemes. A reference transmission 162 tree corresponded to a list of six variables: identification (id) of infector, id of infected, host-163 species of infector, host-species of infected, date of infection and date of death. 164 We simulated genetic sequences along the reference trees according to a  Kishino-Yano (HKY) substitution model (with transition/transversion ratio parameter, κ) [42], 166 since this substitution model was previously used to study M. bovis phylogenies [12][13][14], as 167 well as a fixed mutation rate (µ). We chose µ equal to 0.0024 substitutions per site per year and 168 κ equal to 5.9. Indeed, these values had been previously estimated on 167 M. bovis sequences 169 (171 SNPs in length) isolated in cattle and wildlife from this study area [12].

170
At t = 0, we considered that the index case was infected by a single sequence randomly 171 selected from the 167 sequences isolated in our study area [12]. Our substitution algorithm was 172 based on the Gillespie approach [43] implemented in the phastSim package [44] (Fig 1). Taking 173 into account the low genetic diversity observed in M. bovis sequences from the same region, 174 we assumed no within-host diversity by considering a single lineage per host but we allowed 175 within-host mutation. 176 We simulated sequences until February 2020. Then, the last simulated sequence was 177 recorded for each host, which corresponded to either the sequence present at the time of removal 178 or in February 2020, for infected hosts not yet removed at the end of the simulation. For each 179 reference transmission tree, we thus obtained a reference set of cases (Table 1) reporting and contact [25]. In this method, probability of transmission is inferred from known  Here, we assumed that generation time and sampling interval nonparametric distributions could 231 be obtained without bias by estimating them from the reference trees, which contained every 232 infected host, timed transmission event between hosts and host sampling time. We selected a 233 chain length of 100,000 iterations, a sampling frequency of 1 in 50 and a burn-in period of 10% (for details on priors used and other arguments see S1 Appendix). We graphically checked for 235 convergence and independence of sampling (Effective Sample Size (ESS) above 200 for each 236 parameter), after estimation using the coda R package v.0.19-4 [48]. When the ESS were lower 237 than 200, we ran an additional 100,000 iterations and then checked the ESS again. This step 238 was repeated until every ESS was above 200.

239
Then, we built the consensus tree, as suggested by the authors, computing the most previously reconstructed phylogenetic tree [31] (for details on phylogenetic reconstruction see 254 S1 Appendix). We assumed that the generation time and sampling interval followed a Gamma 255 distribution and that the mean and standard deviation could be obtained without bias by 256 estimating them from the reference trees using the epitrix R package v.0.2.2 [49]. We selected 257 a number of iterations of 500,000, a sampling frequency of 1 in 50 and a burn-in period of 20%

258
(for details on priors used and other arguments see S1 Appendix). We used the same method as with outbreaker2 to check for convergence and independence of sampling, however we 260 considered a lower threshold for the ESS, 100 for each parameter as suggested by the authors 261 [50]. When the ESS were lower than 100, we ran an additional 500,000 iterations and then 262 checked the ESS again. This step was repeated until convergence and independence of sampling 263 parameters were satisfied or the number of iterations reached 2,500,000, we then discarded the 264 reference trees for which convergence was not obtained in every sampling scheme.   infected pairs (meaning every "id"-"ances" for seqTrack, "from"-"to" for outbreaker2 and 309 "infector_id"-"infected_id" for the transmission matrix estimated from TransPhylo) were one 310 of the correct transmission events or not. For all three methods, we considered super-spreaders to be present in a reconstructed 313 tree when less than 10% of infected hosts were responsible for over 80% of transmission events. from the reconstructed index cases ("id" for whom the "ances" is unknown). For outbreaker2, 324 we considered the host-species of the index case to be the most frequent host-species among 325 cases infected at the earliest date ("to" with the earliest "time").

326
o Outbreak size 327 We evaluated the ability of outbreaker2 and TransPhylo to estimate the size of the 328 outbreak (seqTrack does not estimate outbreak size and was thus excluded for this indicator).

329
The simulated outbreak size was the number of infected hosts present in each reference tree. 330 We calculated the corresponding estimate by dividing the number of sampled hosts in each reconstructed tree with the median of the sampling proportion provided by outbreaker2 and 332 TransPhylo. In addition, we tested if the results for this indicator differed depending on whether 333 we were considering the reconstructible outbreak or the reference tree. Therefore, we also  To test whether the reconstruction of outbreak size and accuracy were influenced by the 374 complexity of the epidemiological system, we then simulated 30 new reference trees of a single- We then analyzed whether asymmetrical roles within the multi-host system influenced 379 the reconstruction of the host-species contributions. With the same protocol (30 reference trees 380 and a low evolutionary rate), we tested a transmission scenario where one of the host-species 381 could be infected but could not play any role in transmission (dead-end epidemiological host). 382 We obtained a multi-host system where wild boars played no part in onward bTB transmission 383 by setting transmission parameters between and from wild boars to 0.  of infected hosts compared to those whose set of sequences and trees converged (S1 Table).

395
Computational time varied greatly between sets of cases (or consensus phylogenetic 396 trees) and reconstruction methods: less than 10 min for all 126 trees reconstructed by seqTrack, 397 from less than 20 min (when only 100,000 iterations were needed) to two hours per tree 398 reconstructed by outbreaker2, and from less than an hour to over 12 hours (for 2,500,000 399 iterations) per tree reconstructed by TransPhylo. Moreover, phylogenetic reconstruction with 400 BEAST2 was needed to implement TransPhylo and computational time also varied between 401 sets of cases: from five hours to two days. In total, the computational time for these 378 (126 402 trees*3 methods) reconstructed trees was around three months.

403
The median proportion of unique sequences in the reference set of cases for which 404 convergence was obtained was 6.1%. The median of the mean transmission divergence was 405 0.19 (S1 Table) and the majority of transmission pairs shared the same sequence (S1 Fig).  Table). reference sampling scheme was set as reference, hence the "-" present on the method line and the 427 absence of the reference sampling scheme. T stands for "temporal bias", S B for "badger bias" and S W 428 for "wild boar bias". T+S B (T+S W ) combined the temporal and the badger (wild boar) bias.  Table). The most frequent host-species responsible for this 436 maximum number of transmission events was cattle (from 57% in the reference sampling 437 scheme to 86% in the combined temporal and wild boars bias). None of the trees reconstructed 438 by the two other methods presented super-spreaders.

Host-species of the index case 440
When all sequences were sampled, the proportion of correctly reconstructed host-441 species of the index case (i.e. cattle) was 76% for trees reconstructed by seqTrack, 81% for 442 outbreaker2 and 57% for TransPhylo (S4 Table). Except when considering the temporal bias 443 alone with the TransPhylo method, a temporal and a badger bias (combined or not) led to an 444 increase in the proportion of correctly reconstructed index cases. In the reference trees, the median number of infected hosts was 245 (S5 Table). Overall, 447 the simulated outbreak size was close to the credible interval estimated by outbreaker2 (Fig 3).  (Table 3).  size was set as the offset and the reference sampling scheme was set as reference. T stands for "temporal 473 bias", S B for "badger bias" and S W for "wild boar bias". T+S B (T+S W ) combined the temporal and the 474 badger (wild boar) bias.

476
The median number of transmission events due to each host-species in the reference 477 trees was 175 for cattle, 24 for badgers and 40 for wild boars (S6 Table).

478
In the reference sampling scheme, the credible interval contained the simulated number  According to the statistical model, the underestimation of the number of reconstructed 491 transmission events due to cattle (Fig 4) was not significant for either method ( transmission events in the reference tree was set as the offset and the reference sampling scheme was 506 set as reference. T stands for "temporal bias", S B for "badger bias" and S W for "wild boar bias". T+S B 507 (T+S W ) combined the temporal and the badger (wild boar) bias.  Table).

513
A higher mutation rate increased markedly the median accuracy for all three methods: considering a higher mutation rate (35 vs. 108, S10 Table).

519
The credible interval contained the simulated outbreak size for 16/21 trees reconstructed 520 by outbreaker2 and in only 4/21 trees reconstructed by TransPhylo, otherwise the outbreak size 521 was overestimated (Fig 5 and S11 Table). For both methods, the credible interval contained the 522 number of transmission events due to each host-species in only 4/21 trees for cattle, 3/21 523 (outbreaker2) and 5/21 (TransPhylo) for badgers and 1/21 for wild boars (Fig 6). Otherwise, 524 cattle contribution was underestimated by TransPhylo (16/21 trees in Fig 6, and S12 Table) and 525 wildlife contribution was overestimated by outbreaker2 (13/21 trees for badgers and wild boars 526 in Fig 6, and S12 Table).  The point corresponds to the number of transmission events due to each host-species in the simulated 533 outbreak. 535 Sequences simulated within a single-host system presented a lower proportion of unique 536 sequences (median: 3.6%) and a lower mean transmission divergence (median: 0.14) (S7 537   Table).

543
The credible interval contained the simulated outbreak size in all 26 trees reconstructed 544 by outbreaker2 and in 16/26 trees reconstructed by TransPhylo, otherwise the outbreak size 545 was overestimated (Fig 7 and S11 Table).  The credible interval never contained the simulated number of transmission events due to 551 wild boars and wild boar contribution was overestimated in all 17 reconstructed trees (Fig 8).  Table).

563
For all transmission scenarios, even the reference multi-host scenario, similar results 564 were obtained when considering the reference tree or the reconstructible outbreak (S2 Appendix 565 and S11-S12 Tables). to a specific method [25,31,52] but to the slowly evolving multi-host pathogen. Moreover, the 575 epidemiological indicators we estimated were also relevant in a multi-host system and not just 576 general performance indicators [53].  The lowest accuracy always being estimated for seqTrack could be due to the fact that 610 this method does not consider a transmission model [21], but simply sampling dates and genetic according to host-species), "the order of ancestries cannot be inferred with certainty" [55].  Other than reconstructing who-infected-whom, outbreak reconstruction can aim to 630 estimate epidemiological indicators, from which practical measures can be directly inferred.

631
The first we studied was the outbreak size, which could by comparison with the number of 632 sampled cases be informative e.g. of the need to increase the sampling effort [35]. Outbreak 633 size estimation was sensitive to sampling biases, the complexity of the epidemiological system and also the mutation rate. The outbreak size was correctly estimated by outbreaker2 but 635 consistently overestimated by TransPhylo, even though we considered the same non- were developed and tested on single-host systems [21,25,31], and not on multi-host systems 667 where each host-species play a different role. While TransPhylo has previously been applied to 668 multi-host systems, a human-deer SARS-CoV-2 system [35] and two badger-cattle bTB 669 systems [57,58], the estimation of host-species contribution to transmission in these systems and while phylogenetic evidence seemed to support multiple human-to-deer spillover events, 673 deer-to-human transmission could not be ruled out [35]. In a badger-cattle system in the South- Mascot [19], previously used to study complex bTB multi-host systems [59].

751
The overall poor performances we obtained for accuracy and host-species contribution, Reference stands for the complex multi-host system where cattle are index cases and wild boars contribute to transmission. High mutation rate is the same scenario as the reference except for the higher mutation rate used to simulate sequences. Single-host stands for the only-cattle scenario. Dead-end host stands for the scenario where wild boars did not contribute to transmission and badger index, the reference scenario with badger as index cases.

S2 Fig. Credible interval of the number of transmission events due to cattle estimated by outbreaker2
and TransPhylo compared (color) to the number in the simulated outbreak (point) according to sampling scheme. T stands for "temporal bias", SB for "badger bias" and SW for "wild boar bias". T+SB (T+SW) combined the temporal and the badger (wild boar) bias.

S3 Fig. Credible interval of the number of transmission events due to badgers estimated by outbreaker2
and TransPhylo compared (color) to the number in the simulated outbreak (point) according to sampling scheme. T stands for "temporal bias", SW for "wild boar bias" and T+SW combined the temporal and the wild boar bias.

S4 Fig. Credible interval of the number of transmission events due to wild boars estimated by outbreaker2
and TransPhylo compared (color) to the number in the simulated outbreak (point) according to sampling scheme. T stands for "temporal bias", SB for "badger bias" and T+SB combined the temporal and the badger bias. S1 Table. Comparison between reference trees that converged in BEAST2 and TransPhylo and those that did not.  Table. Accuracy tested with a Binomial GLM using method as the explanatory variable, according to transmission scenario. S10 Table. Number of trees reconstructed where super-spreaders were present and the maximum number of transmission events a single super-spreader could be responsible for according to method and transmission scenario. S11 Table. Outbreak size with a Negative Binomial GLM using method as the explanatory variable, according to transmission scenario. S12 Table. Host contribution tested with a Negative Binomial GLM using method as the explanatory variable, according to transmission scenario.