The speciation and hybridization history of the genus Salmonella

Bacteria and archaea make up most of natural diversity, but the mechanisms that underlie the origin and maintenance of prokaryotic species are poorly understood. We investigated the speciation history of the genus Salmonella , an ecologically diverse bacterial lineage, within which S. enterica subsp. enterica is responsible for important human food-borne infections. We performed a survey of diversity across a large reference collection using multilocus sequence typing, followed by genome sequencing of distinct lineages. We identified 11 distinct phylogroups, 3 of which were previously undescribed. Strains assigned to S. enterica subsp. salamae are polyphyletic, with two distinct lineages that we designate Salamae A and B. Strains of the subspecies houtenae are subdivided into two groups, Houtenae A and B, and are both related to Selander’s group VII. A phylogroup we designate VIII was previously unknown. A simple binary fission model of speciation cannot explain observed patterns of sequence diversity. In the recent past, there have been large-scale hybridization events involving an unsampled ancestral lineage and three distantly related lineages of the genus that have given rise to Houtenae A, Houtenae B and VII. We found no evidence for ongoing hybridization in the other eight lineages, but detected subtler signals of ancient recombination events. We are unable to fully resolve the speciation history of the genus, which might have involved additional speciation-by-hybridization or multi-way speciation events. Our results imply that traditional models of speciation by binary fission and divergence are not sufficient to account for Salmonella evolution.


Introduction 56
Bacteria and archaea make up most of natural diversity, both in terms of species richness and 57 biological functions [1,2]. However, the mechanisms that underlie the origin and maintenance 58 of prokaryotic species are poorly understood. It is often assumed that there is a single 59 phylogenetic tree representing the relationships amongst prokaryotic taxa, with the branch 60 lengths reflecting divergence times between them. However, bacteria and archaea acquire 61 foreign DNA by homologous and non-homologous recombination and can recombine 62 frequently, including in the Salmonella genus [3][4][5][6][7][8][9]. High recombination rates can maintain 63 genetic cohesion within a species, preventing divergence and speciation from occurring until 64 barriers to gene flow develop. Recombination has been shown in laboratory experiments to be 65 supressed by nucleotide mismatches between donor and recipient [10,11]. This property 66 provides a potential mechanism for speciation. It has been shown by simulation that large 67 effective population sizes and neutral genetic drift can precipitate speciation by increasing the 68 average pairwise divergence between strains, leading to either binary or multi-way speciation 69 events [5,12]. 70 Conversely, distinct new lineages or species can potentially arise almost instantaneously by 71 hybridization of existing distantly related ones. Such large-scale hybridization events can 72 occur at once by recombination of large genomic regions (e.g., [13]), or through multiple 73 exchanges of small chromosomal segments associated with ecological convergence [14]. 74 Therefore, to describe relationships between prokaryotes and understand patterns of species 75 richness and phenotypic diversity, it is important to characterise the process of speciation and 76 gene flow between species, including large-scale hybridization events [15]. 77 Salmonellae are a prominent speciation model, where experimental and genomic studies of 78 recombination and hybridization have been pioneered [4][5][6][7][8][9][10]14]. The genus Salmonella is 79 divided into a number of phylogroups, namely bongori, enterica, salamae, arizonae, 80 diarizonae, houtenae, and indica [16][17][18]. Salmonella bongori has been classified a distinct 81 species [18], while the other phylogroups are considered to be subspecies of a single species, 82 S. enterica. These taxa are further subdivided into serovars based on antigenic variation of 83 flagellins and O-antigen. 84 Members of the genus Salmonella are major pathogens of humans and other warm-blooded 85 animals. Human infections mostly involve S. enterica subspecies enterica, which can cause 86 gastroenteritis, enteric fever and other infections [19,20]. Other S. enterica subspecies, as well 87 2019-02-14 5 as the species S. bongori, are more typically isolated from cold blooded animals or the 88 environment, and are rarely reported from human infections [21]. 89 Here we are concerned with evolutionary relationships rather than taxonomy and we 90 designate phylogroups by names that derived from these subspecies' labels, e.g. Bongori, 91 Arizonae, Diarizonae, etc., with Enterica representing subspecies enterica. We use italicised 92 names such as houtenae to refer to previous subspecies designations, which sometimes differ 93 from our phylogroup assignments. A seventh S. enterica subgroup (group VII) was 94 distinguished based on multilocus enzyme electrophoresis and gene sequencing [22][23][24] [9,23,24,[27][28][29][30][31][32][33][34][35][36][37]. This lack of consensus might reflect technical issues with phylogenetic 100 reconstruction but a more biologically interesting possibility is that the history of Salmonella 101 is not well-characterized by a simple model in which speciation proceeds stepwise by 102 irreversible binary fissions. 103 To test this hypothesis, we sampled the genetic diversity within the little studied groups from 104 cold-blooded hosts and used whole genome sequences from representative isolates of 105 phylogroups to characterize the genetic relationships between them and to infer historical 106 populations splits and gene flow. We show that while a binary fission model of speciation 107 works for some of the values, where each coancestry value corresponds to the number of fragments copied from one 160 genome to another (Figure 2). Secondly, we performed pairwise comparisons of genomes 161 using a gene-by-gene approach. For each pair of genomes, we computed the genetic distances 162 of all shared genes, and the distribution of these distances was plotted as a cumulative curve 163 ( Figure 3). Thirdly, the CHROMOPAINTER analysis was repeated using only nine unrelated 164 genomes: one for each of the 12 phylogroups but excluding VII and Houtenae B due to recent 165 shared ancestry with Houtenae A, and considering Enterica A and B as a single group. Each 166 genome was therefore reconstructed as a mosaic of the other eight unrelated genomes. This 167 allowed us to explore deeper relationships between phylogroups, since when all genomes are 168 included each genome from a phylogroup copies mostly from other genomes of the same 169 phylogroup (Figure 2). The resulting coancestry matrix was plotted as a heatmap (Figure 4). Analysis of accessory genome was performed using the Roary pan-genome pipeline version 176 3.6.2 [47]. Since the draft genomes were very unequally fragmented and synteny information 177 therefore was of variable reliability we used the "don't-split-paralogs" option. The analysis 178 was performed with a protein identity cut-off of 85% and the core genome was defined as 179 genes present in > 99% of the genomes studied. The Pearson correlation between accessory 180 gene content of the genomes were visualised using the R software CORRPLOT package [48]. Based on MLST diversity, 46 genomes were chosen for genome sequencing and compared to 196 27 previously published genome sequences of Enterica, Arizonae, and Bongori (Table S2). A 197 phylogenetic tree based on the genome sequences is shown in Figure 1. This tree implies that 198 S. enterica subsp. salamae is not a monophyletic group but instead forms two lineages with 199 distinct evolutionary histories that we designate Salamae A and Salamae B. Whereas Salamae 200 A contained 138 (88%) of the salamae strains, Salamae B comprised 18 isolates collected 201 from a human (one isolate), a bat (one isolate) or reptiles (16 isolates, including 6 from 202 chameleons). In contrast, 49 (41.5%) of Salamae A isolates were from humans, and only 34 203 (28.8%) were from cold-blooded animals, suggesting important ecological and pathogenic 204 differences between the two Salamae groups. S. enterica subsp. houtenae was also subdivided 205 into two distinct phylogroups, which we have designated Houtenae A and Houtenae B, and 206 which clustered together with group VII on the tree. The genome-wide phylogenetic analysis 207 also uncovers a hitherto unknown phylogroup, labelled VIII, made of strains previously 208 identified as either salamae, diarizonae or of the former Hisingen serotype of S. enterica 209 subsp. enterica [25]. The description of Salamae B, Houtenae B and VIII represent the first 210 novel Salmonella phylogroups described since the identification of group VII by Selander and 211 colleagues more than 25 years ago [24,27]. Our analysis therefore defines 11 phylogroups 212 within Salmonella. The phylogenetic tree also shows further subdivisions at shallower levels, 213 including the division of S. enterica subsp. enterica into Enterica A and Enterica B as 214 previously described [5,9]. Note that the genomes of the present study have been publicly 215 In order to test whether high coancestry between groups might be explained by recent 238 recombination between them, we looked for evidence of sharing of very similar stretches of 239 DNA between pairs of lineages [14] by plotting, for each pairwise comparison, the proportion 240 of genes with divergence below a threshold increasing from 0% to 25% (Figure 3). 241 Consistent with recent recombination between them, Enterica B and Houtenae B showed 242 many more genes with very similar sequences than expected based on their position in the 243 phylogenetic tree, with 20% of the genes of an Enterica B strain having less than 1% 244 divergence to Houtenae B, compared to only 5% between Enterica B and Houtenae A (Figure  245 3A). These divergence curves are also consistent with recent recombination between 246 Houtenae A, Houtenae B and VII. For example, approximately 5% of the VII genome and 6% 247 of Houtenae A has less than 0.1% divergence with Houtenae B (Figure 3B) have inherited the same ancestrally imported stretch will be painted by each other for those 262 stretches. Therefore, we selected a single strain from each phylogroup and performed a 263 distinct chromosome painting analysis. We excluded VII and Houtenae B due to the recent 264 shared ancestry with Houtenae A, and also included only a single representative for both 265 Enterica A and Enterica B. The chromosome painting results (Figure 4) show high coancestry 266 between Bongori and Arizonae and between Indica and Enterica. These relationships can be 267 interpreted using a vertical phylogenetic model, as they agree with a large number of different 268 analyses including ours (Figure 1) that Arizonae is the deepest branching lineage after 269 Bongori and that Indica is a sister group of Enterica [9,24,33,34]. Conversely, the 270 chromosome painting analysis reveals a large number of intransitive relationships (i.e., in 271 which A has elevated coancestry with B and B has high coancestry with C but C does not 272 have high coancestry with A). First, Diarizonae and Arizonae have high coancestry, as do 273 Diarizonae and Salamae B but Salamae B and Arizonae do not (Figure 4) To complement the results above, we used Treemix to infer a history that allows for 285 recombination events in the origins of the phylogroups. Treemix attempts to model the 286 covariance matrix reflecting SNP sharing between strains by assuming a phylogenetic model 287 of divergence via genetic drift, but with a limited number K of mixing events in the history. 288 Our application of Treemix to Salmonella gave results which varied in important details 289 depending on the value of K. Each of the events that were identified at a given value of K had 290 counterparts in the inference performed for higher values, but details of the inferred 291 phylogenetic tree and the location and direction of the hybridization events were not 292 consistent. For example, for K=1 and K=2 Houtenae A and Houtenae B are sister taxa whose 293 common ancestor received genetic material from VII, while for K=3, VII and Houtenae B 294 share a common ancestor, which contributed genetic material to Houtenae A. We present the 295 Treemix results for K=3 (Figure 5) because all of the events inferred are supported by signals 296 identified by chromosome painting and cumulative divergence (Figures 2, 3 and 4). The 297 Treemix results with K=3 imply that Houtenae A, Houtenae B and VII all have hybrid 298 origins. All three of them received DNA from a shared lineage which branched between 299 Arizonae and Diarizonae (black arrowhead, Figure 5), but differ in the remaining source of 300 their ancestry (red arrows, Figure 5), which, according to the Treemix estimates, account for 301 about half of their genome in all three cases (1: ancestor of Arizonae to VII: 0.461; 2: Enterica 302 B to Houtenae B: 0.42; 3: ancestor of VII to houtenae A: 0.49). Note that according to this 303 reconstruction, no pure, or nearly pure, representative of this shared ancestral lineage is 304 present in the sample, a feature which is likely to have contributed largely to the instability of 305 the Treemix analysis and makes all types of evolutionary reconstruction considerably more 306 challenging. 307 The second source for Houtenae B is inferred to be Enterica B (red arrow 2, Figure 5), which 308 is consistent with the results from chromosome painting and of the pairwise distances, as 309 discussed above, and is consistent with recent genetic exchange having taken place. The 310 second source for VII is inferred to branch at the same point as Arizonae does in the tree. The 311 deep position of this ancestry source is supported by the distribution of pairwise distances VII 312 has to shallower branching lineages such as Diarizonae or Salamae A, which are more similar to the distribution found for Arizonae than to that of either Houtenae A or Houtenae B (e.g., 314 Figure 3C). The distribution of distances to Arizonae is similar to that of other shallow-315 branching lineages, suggesting that the recombination was not with Arizonae itself. Finally, 316 the second source for Houtenae A branches next to Salamae A, which is consistent with the 317 reconstructed position of Houtenae A in the phylogenetic tree in Figure 1 and the high 318 coancestry of Houtenae A and Salamae A in Figure 4. However, unlike for Houtenae B, there 319 is no signal of recent recombination of Houtenae A with other lineages in Figure 2. 320 Furthermore, the pairwise distance curves of Salamae A to Houtenae A and Houtenae B are 321 comparable ( Figure 3C). These features imply that there has not been recent recombination 322 between Houtenae A and Salamae A. Instead, they are consistent with the second source that 323 contributed to Houtenae A being an unsampled sister taxa to Salamae A. 324 325

Unequal evolutionary rates of the different taxa 326
One important feature of the phylogenetic tree (Figure 1) is the different branch lengths 327 leading to each phylogroup. This feature might be caused by either unequal substitution rates 328 between lineages or by recombination, which can cause hybrid lineages to branch closer to the 329 root. Evidence for unequal substitution rates comes for example from comparisons with 330 Bongori or Arizonae, which can tentatively be treated as outgroups. Salamae A and Salamae 331 B have smaller genetic distances than other lineages to either (Figure 3D), despite the 332 chromosome painting indicating no evidence of elevated recombination between them. 333 Furthermore, Salamae A and Salamae B show low genetic distances compared to potential 334 sister lineages to all taxa, suggesting that they have substantially lower substitution rates than 335 other groups. Because our reconstruction of Salmonella evolutionary history is incomplete 336 and uncertain, we do not attempt to formally model all of these processes occurring together. 337 338 Accessory genome relationships 339 Accessory genes contribute most to ecological specialization and the pattern of horizontal 340 gene transfer among phylogroups might provide important complementary information 341 regarding functional and ecological correlates of the recombination history that we inferred in 342 this work [9]. We therefore analysed the pan-genome of the dataset, which with a protein 343 identity cut-off of 85% rendered a core genome of 1818 gene clusters and a total pan-genome 344 of 21973 genes. Unfortunately, estimations of the strain relationships based on gene 345 2019-02-14 14 presence/absence and analysis of the shared ancestry revealed that the analyses were strongly 346 affected by the fragmentation of the genomic assemblies (Table S2), as was particularly 347 visible for the highly fragmented Diarizonae genomes ( Figure S3). Analysis of the horizontal 348 gene transfer pattern among phylogroups therefore requires higher quality assemblies and will 349 be the subject of future studies. 350 351

Conclusions 352
We investigated the diversification and hybridization history within Salmonella, a group of 353 prominent public health importance and an early model for microbial speciation and 354 evolutionary studies. By sampling largely in the non-enterica subspecies, we uncovered three 355 novel phylogenetic groups that had not been recognized since the last group, VII, was 356 described in 1991. Our snapshot of diversity within phylogroups of Salmonella implies that 357 recombination among phylogroups is relatively rare at any point in time but that when it 358 happens it can be with distantly related lineages rather than sister taxa and can involve large 359 fractions of the core genome. These events are likely to provide substantial potential for 360 phenotypic innovation but may also entail a great deal of hybrid disgenesis. 361

362
The three hybridization events that we have been able to elucidate with any degree of 363 certainty are ongoing or took place in the recent past and all involved a lineage that is not 364 present in unhybridized form in the dataset. This circumstance makes it challenging to 365 estimate simple properties of the events such as the direction of hybridization and the 366 proportion of genome acquired from each source. We can nevertheless robustly conclude that 367 the hybridization has involved at least three entirely different branches of the Salmonella tree 368 and has led to the formation of three phylogroups, namely Houtenae A, Houtenae B and VII. 369 Interestingly the latter group was inferred to be a 'hybrid' in early MLEE studies [24]. This 370 suggests an interesting question that is likely to be informative about the general nature of 371 species boundaries in bacteria, namely what has happened to make one lineage particularly 372 prone to hybridization in the recent past? 373

374
We see less conclusive but nevertheless still strong evidence for hybridization events in the 375 more distant past. Phylogenetic trees of Salmonella phylogroups are notoriously unstable, 376 including in different analyses we have performed (data not shown). In particular, relationships amongst Salamae A, Salamae B, Diarizonae, Enterica and VIII are difficult to 378 elucidate. The coancestry relationships between these lineages are highly intransitive (Figure  379 4). One possibility is that this intransitivity is due to a complex multi-way speciation event 380 [5], such that there is no true splitting order to infer. However, it may also represent 381 hybridization events after stepwise speciation. The two lineages that branch deeply ( Figure  382 1), namely VIII and Diarizonae, both show evidence of shared ancestry with basal lineages, 383 Bongori and Arizonae, respectively (Figure 4), which is likely to have substantially affected 384 their phylogenetic position. 385

386
The events of recombination inferred in this work explain the difficulties to reconstruct the 387 phylogeny of the genus that have led to multiple distinct hypotheses on the phylogenetic 388 relationships among subspecies. The phylogenetic relationships which do appear to be 389 reasonably certain are that Bongori split from the other phylogroups first, followed by 390 Arizonae and that Indica is a sister group to Enterica. Houtenae A seems to have been a sister 391 taxon of Salamae A, prior to its mixture event. These examples demonstrate that in the right 392 circumstances, phylogenetic signal can be preserved over long evolutionary time periods 393 despite recombination between phylogroups. The problem of reconstructing ancestral 394 hybridization events is a hard one and we do not have the tools or genomes available to 395 reconstruct an entire history with any degree of confidence. 396