Sequential Acquisition of Virulence and Fluoroquinolone Resistance Has Shaped the Evolution of Escherichia coli ST131

ABSTRACT Escherichia coli ST131 is the most frequently isolated fluoroquinolone-resistant (FQR) E. coli clone worldwide and a major cause of urinary tract and bloodstream infections. Although originally identified through its association with the CTX-M-15 extended-spectrum β-lactamase resistance gene, global genomic epidemiology studies have failed to resolve the geographical and temporal origin of the ST131 ancestor. Here, we developed a framework for the reanalysis of publically available genomes from different countries and used this data set to reconstruct the evolutionary steps that led to the emergence of FQR ST131. Using Bayesian estimation, we show that point mutations in chromosomal genes that confer FQR coincide with the first clinical use of fluoroquinolone in 1986 and illustrate the impact of this pivotal event on the rapid population expansion of ST131 worldwide from an apparent origin in North America. Furthermore, we identify virulence factor acquisition events that predate the development of FQR, suggesting that the gain of virulence-associated genes followed by the tandem development of antibiotic resistance primed the successful global dissemination of ST131.

Despite its successful dissemination globally, little information is available about evolution and emergence of ST131. Two recent independent genomic studies demonstrated that ST131 emerged from a single ancestor and that most strains belong to clade C/H30-R (2,5). Notably, we found that recombination accounted for the majority of variation within the ST131 lineage, and recombination events were associated with the positions of MGEs (2). However, despite the number of isolates in both studies, neither resolved the geographical or temporal origin of the ST131 ancestor. In contrast, studies of other large sets of bacteria with geographical or temporal separation have determined accurate dates of divergence of important clades using statistical analyses such as the Bayesian framework implemented in BEAST (Bayesian Evolutionary Analysis by Sampling Trees) (18). For example, Glaser identified tetracycline resistance as the major driver of diversification among the global population of group B streptococci (19). Similarly, a large study of methicillinresistant Staphylococcus aureus (MRSA) was able to date the emergence of an FQR clade to the mid-1980s (20). These studies motivated us to combine data sets from our geographically diverse previous study (2) and from the temporally diverse study by Price et al. (5) to investigate the evolution of ST131 with the highest possible resolution.

RESULTS AND DISCUSSION
Curation of a high-quality ST131 genome sequence data set. We first sought to obtain a high-quality set of data to carry out our analyses. A total of 199 draft Illumina paired-end E. coli ST131 genomes were retrieved from public read data repositories (see Data Set S1 in the supplemental material). Initial phylogenetic analyses of de novo and reference-guided assemblies of all 199 E. coli ST131 genomes indicated that several draft genomes were of low quality. Suboptimal genome data quality could interfere with subsequent phylogenetic analyses and may invalidate conclusions drawn from tree topologies. To ensure that only high-quality sequences were included in our analyses, we removed 14 genome data sets that were determined to be outliers according to at least one of our assembly or mapping metrics, including number of uncalled bases, number of scaffolds, and assembled genome size (see Fig. S1 in the supplemental material). This quality control (QC) filter is broadly applicable to reanalyzing public genomic data from multiple sources.
Phylogenomic analysis of ST131. We next carried out phylogenetic reconstruction using our combined data set of 185 Illumina paired-end sequences, which represented strains from humans (n ϭ 167), animals (n ϭ 15), and other sources (n ϭ 3) isolated from the United States, Canada, New Zealand, Australia, Spain, India, Portugal, Korea, and the United Kingdom between 1967 and 2011 ( Fig. 1; see Data Set S1 in the supplemental material). Sequence read mapping of these 185 high-quality ST131 genomes (and simulated reads from the SE15, EC958, and JJ1886 complete genomes) to the ST131 clade C reference genome E. coli EC958 defined 21,373 substitution single nucleotide polymorphisms (SNPs) that were used to create an unrooted maximum likelihood (ML) phylogeny (see Fig. S2A in the supplemental material). An independent phylogenetic tree produced by the kSNP alignmentfree method was consistent with the overall topology of the ML trees (see Fig. S3 in the supplemental material). Using a Bayesian modeling algorithm, we identified 204 nonoverlapping segments encompassing 1.542 Mb and containing 15,902 substitution SNPs that were introduced into the ST131 lineage by recombination (see Fig. S4 and Data Set S1 in the supplemental material). The length of recombinant sequence is higher than previously reported (2) as the larger data set increases the probability that one strain will have a recombinant fragment not encountered before. The length of the nonrecombinant core ST131 genome is 0.19 Mb less than previously reported (2), encompassing 69.4% of the EC958 chromosome, or approximately 3.55 Mb. However, the proportion of SNPs introduced by recombination (74.4%) is consistent with our previous study and highlights the important role recombination has played in shaping the ST131 lineage (2). Exclusion of these recombinant SNPs from phylogenetic analyses reduced the number of SNPs to 5,471 and resulted in a tree that maintained the original overall topology, albeit with substantially reduced branch lengths and some major within-clade reclustering of strains (see Fig. S2B). Consistent results were achieved using an independent method of recombination detection and removal in- dicating that our tree topologies are not biased by the chosen methodology (see Fig. S5 in the supplemental material). Extending our phylogenomic analyses to include isolates from two large international collections provided a far greater resolution of the evolution within the ST131 lineage (Fig. 2). The global phylogeny of E. coli ST131 separated the strains into three distinct lineages (clades A, B, and C). Congruent with our previous work, strains in clade C were characterized by the fimH30 allele and the FQR-conferring alleles gyrA1AB and parC1aAB (Fig. 2). Notable exceptions were strains JJ2643 and U004 in clade C, which contain the fimH35 allele. This appeared to be due to a recombination event encompassing fimH in these strains and highlights why we have retained a neutral nomenclature (i.e., A, B, and C) for our clade classifications. Likewise, the CTX-M-15 allele is not ubiquitous in all clade C2 strains, making this a more scalable classifica-tion than the H30-Rx designation originally suggested by Price et al. (5) (Fig. 2). In addition to harboring CTX-M-15 genes, clade C2 strains contain more resistance genes in total compared with other ST131 clades (Fig. 3), consistent with colocalization of multiple plasmid-encoded resistance genes (see Data Set S1 in the supplemental material). Although the context of multidrug resistance cassettes can be resolved in some cases from draft genome data from ST131 isolates or transformants (16,17), the full complexity of plasmid-mediated resistance in ST131 requires the generation of more complete genomes as per EC958 and JJ1886 (7,21).
A combined data set enables greater resolution of ST131 subclades. The combined ST131 data set enabled greater resolution of the differences between clade B and C strains. We previously showed that clade C strains can be further segregated into two distinct subclades, C1 and C2 (2). Our new analysis defined five   Branch support was performed by 1,000 bootstrap replicates (see Fig. 2B in the supplemental material). The scale bar indicates the number of substitution SNPs. Taxon labels for clades A, B, and C are colored red, orange, and green, respectively. Seven strains sharing intermediate characteristics between clades B and C are colored pink. Of note, clade A strains were collapsed and the clade A-specific branches shortened for display. Metadata are represented as circles as follows: year of isolation in gray and gradient and geographical region in assorted colors as depicted in the legend. Allelic profiling information is shown as colored strips surrounding the phylogram (from inner to outer) for the fimH, parC, gyrA, and CTX-M genes. Two additional distinctions were made for some fimH variants: "Untypeable" corresponds to a strain with a truncated or missing fimH gene, and "Pseudogene" corresponds to a strain in which fimH is disrupted by an insertion sequence. Clades B0 to B5 and C0 subclades are shown as arcs in the outermost ring, with arrows and dotted lines denoting the division between subclades C1 and C2. discrete subclades in clade B (B1 to B5), each with distinct repertoires of selected marker genes fimH, parC, and gyrA (Fig. 2). Intriguingly, we found that seven strains originating from the United States could be classified as "intermediate" on the basis of their SNP pattern (Tables 1 and 2). These strains showed progressive acquisition of clade C-defining point mutations, with the three isolates closer to clade B (classified B0) and four isolates closer to clade C (classified C0) illustrating the precise evolutionary events leading to the emergence of clade C (Fig. 4). Strains from the five clade B sublineages varied in their parC and gyrA allelic profile, with the vast majority of clade B strains containing allele combinations that are not associated with fluoroquinolone resistance. Additionally, while all strains from subclades B2 and B5 are associated with the United States, and B1 with Spain, strains within subclades B3 and B4 have a more diverse geographic origin. Each subclade showed a distinctive recombination profile (see Fig. S4 and S5 in the supplemental material) and MGE repertoire (see Fig. S6 in the supplemental material), indicative of independent evolutionary trajectories. In contrast, we found that the prevalence of virulence genes is largely conserved across all B subclades, with the absence of several uropathogenic E. coli (UPEC)-specific genes apparent in clade B3 (see Fig. S7 in the supplemental material). By comparison, our investigation of clade C MGEs and other regions of interest (as originally defined in the clade C reference strain EC958) showed a high degree of conservation across clade C, with the exception of the prophage Phi6, the capsule loci, and genomic island GI-selC (see Fig. S8 in the supplemental material). For example, GI-selC is only found in a geographically homogeneous cluster of clade C strains that include EC958 and excludes the reference strain JJ1886 ( Fig. 2; see Fig. S6). Despite the general conservation of gene content within clade C genomes, it is apparent that genomic islands are hot spots for insertions, deletions (indels), and genome rearrangement (see Fig. S6 and Data Set S1 in the supplemental material). EC958 GI-pheV has several small indels relative to JJ1886 GI-pheV, and we have previously shown that the CMY-23 ␤-lactamase gene that confers resistance to third-generation cephalosporins has inserted within the EC958 GI-leuX, whereas the JJ1886 GI-leuX element has a large duplication relative to EC958 (12) (see Data Set S1).
Temporal analysis of ST131 identifies major divergence dates. Our initial studies of ST131 strains collected between 2001 and 2011 showed insufficient temporal depth to robustly date the emergence of clade C (2). By including 91 more strains from Price et al. (5), including 8 that predated 2000 ( Fig. 1), we anticipated that we would be able to resolve this question using existing public data alone. We generated a linear regression of the genetic distance from the root to tip against time for the 172 ST131 isolates within clades B and C using Path-O-Gen (22). This analysis revealed a positive correlation (R 2 ϭ 0.3233, P Ͻ 0.0001) confirming the molecular clock-like signal (see Fig. S8 in the supplemental material). To accurately estimate the date of divergence of clade C from clade B we employed BEAST (18). BEAST analysis rejected the strict clock and favored the uncorrelated log-normal clock model in combination with a Bayesian skyline population model (see Data Set S1 in the supplemental material). A mutation rate of 4.39 ϫ 10 Ϫ7 SNPs per site per year (95% highest posterior density [HPD], 3.58 ϫ 10 Ϫ7 to 5.23 ϫ 10 Ϫ7 ) was calculated, consistent with other large-scale phylogenomic analyses of the E. coli/Shigella lineage (6.0 ϫ 10 Ϫ7 SNPs per site per year [95% HPD, 5.2 ϫ 10 Ϫ7 to 6.7 ϫ 10 Ϫ7 ]) (23) (see Fig. S9 in the supplemental material). Based on this approach, we could estimate the divergence of the last common ancestor of clade B and C strains to have occurred between 1930 and 1958 (Fig. 4A), consistent with the Path-O-Gen prediction (see Fig. S8a). We could estimate the divergence of clade C from clade B to have occurred in 1980 (95% HPD, 1973 to 1986), which was slightly earlier than the Path-O-Gen prediction Clade C1 No. of resistance-associated genes (see Fig. S8a). Importantly, we identified that further diversification of clade C2 from clade C1 dated to 1987 (95% HPD, 1983 to 1992), subsequent to all clade C1 and C2 strains acquiring gyrA1AB and parC1aAB alleles imparting elevated FQR ( Fig. 4A  and 4B). Bayesian skyline plots show a relatively constant population size over several decades, followed by a short recent expansion occurring in the late 1990s and early 2000s and subsequent stabilization (Fig. 4C). Interestingly, this pattern is consistent with the introduction of FQ for clinical use in 1986 (24) and the subsequent stabilization may reflect the improved stewardship of FQ (or its removal from general use). A similar phenomenon was observed for FQR among the members of the MRSA ST22 global phylogeny (20). Remarkably, an identical date was identified in a recent preprint report using 81 ST131 genomes from Price et al. (5), supplemented with~100 newly sequenced ST131 genomes from more geographically dispersed strains (17), highlighting the value of careful analysis of existing data sets. Although acquisition of the CTX-M-15 gene within clade C2 may be a major factor in the diversification of C1 and C2, it is worth emphasizing that this alone does not explain the success of ST131 given that the population expansion identified in our study encompasses both clade C1 and clade C2 strains.
Phylogeography of ST131. To investigate the geographical context underlining the expansion of the multidrug-resistant ST131 O25b:H4 clone from clades B to C, we performed a discrete phylogeographical analysis as implemented in BEAST on the 172 ST131 isolates within clades B and C that included a variety of geographic sources (Fig. 1A) and dates (Fig. 1B). To reduce sampling biases due to the high number of strains isolated from United States, Canada, United Kingdom, and Spain, we performed independent analyses on 10 randomly subsampled data sets containing 85 strains each (see Data Set S1 in the supplemental material). Under Bayesian stochastic search variable selection (BSSVS) and symmetric diffusion models, results systematically supported the United States (74.31%; standard deviation [SD], 12.1%) as the most likely origin of clades B and C (Fig. 5A). The origin of clade C (C0, C1, and C2) was predicted to be associated with either the United States (51.83%; SD, 35.5%) or Canada (45.59%; SD, 36.1%), over all of North America (Fig. 5B). These results are consistent with the observation that the oldest reported ST131 strain was isolated in the United States in 1967 (5), as well as an independent BEAST analysis using a partly overlapping data set (17). Although our resampling approach has minimized bias in strain origin, a data set with greater diversity of strains from different geographical regions and pre-2000 isolation dates would be necessary to rule out a different origin (e.g., current data sets are underrepresented in South America, Africa, and many European and Asian countries). A greater number of strains would also help identify local outbreaks or clusters: with the exception of GI-selC carrying clade C strains from the United Kingdom, which cluster phylogenetically (Fig. 2), we did not observe other significant geographic clustering using this data set alone.
Intermediate strains reveal key MGE acquisitions that define clade C. Overall, excluding intermediate B0 and C0 strains, clade C differs from clade B by only 42 substitution SNPs (Tables 1 and  2). This list included the majority of the 70 clade C-defining SNPs reported in our earlier study (2), but was not identical due to the greater number of recombinant regions identified and removed in the present study. Closer examination of the recombination analysis identified intermediate patterns of recombination, primarily clustered around known MGEs, indicative of stepwise evolution among these intermediate strains (see Fig. S4 and S5 in the supplemental material). Most notably, we could trace the acquisition of GI-pheV and GI-leuX genomic islands to the most recent common ancestor of the C0 strains (C001, JJ2244, G213, and CD306), several years before the acquisition of the FQR mutations that define clade C (Fig. 4). The pheV genomic island acquired by clade C ST131 strains is known to carry the autotransporter genes agn-43 and sat, the ferric aerobactin biosynthesis gene cluster (iucABCD), and its cognate ferrisiderophore receptor gene iutA (3,7). The clade-defining fimH30 allele was acquired by recombination (25), possibly in conjunction with the acquisition of the nearby GI-leuX island; the same time point is also predicted for the acquisition of the ISEc55 insertion sequence within the fimB recombinase gene that we have previously shown to affect the expression of type 1 fimbriae (3, 7). Thus, close scrutiny of these "intermediate" genomes enabled us to trace the acquisition of virulence-associated genes in ST131, which appears to have primed this clone for success prior to the acquisition of FQR mutations in the late 1980s. Further molecular analysis is required to determine the contribution of these elements to ST131 colonization/fitness in the gastrointestinal and urinary tracts. Notably, the role of virulence in the success of this clone may have been underappreciated in a recent report as these particular strains were distributed throughout clades B and C despite their inconsistent parC, gyrA, and fimH alleles (17). The B0 and C0 strains were also   dispersed in our tree that includes recombinant regions (see Fig. S2A in the supplemental material), suggesting that differences in recombination removal may account for these discrepancies. However, we cannot rule out differences in phylogeny due to the particular set of genomes analyzed. Future studies combining more publicly available genome datasets will help to further reveal the precise evolutionary trajectory taken by the globally disseminated ST131 clone.

Conclusions.
Overall, our work highlights how careful reanalysis of publicly available genomic data sets from heterogeneous sources can greatly improve the resolution of evolutionary history. Here, we have characterized the evolution of ST131 with unprecedented detail, from the acquisition of prophages and modification of the O antigen region ca. 1946, to the acquisition of GI-pheV, GI-leuX, fimH30, and ISEc55 around 1980, several years before the acquisition of mutations in gyrA and parC that led to FQR and the acquisition of the clade C2-defining CTX-M-15 ESBL gene. Whereas the development of FQR was accompanied by a large surge in the ST131 population globally, we propose that the acquisition of virulence factors by ST131 was a necessary precursor to this success. These events describe the "perfect storm" for the evolution of a multidrug-resistant pathogen: the acquisition of virulence-associated genes followed by the development of antibiotic resistance.

MATERIALS AND METHODS
Genome data. Two E. coli ST131 strain data sets from previously published work were used in this study, under the designations data set_1 and data set_2 (2,5). Strain names, sources, and available strain metadata are summarized in Data Set S1 in the supplemental material. Data set_1 comprised Illumina 101-bp paired-end genome sequence data from 95 ST131 strains isolated from 2000 to 2011, mostly in Europe and Oceania (2) (study accession no. ERP001354 and ERP004358). Data set_2 comprised Illumina 101-bp (76 samples), 76-bp (19 samples), and 50-bp (10 samples) paired-end whole-genome sequence data from 105 ST131 strains isolated from 1967 to 2011, mostly in North America (5) (study accession no. SRP027327). Additionally, reference strains of 11 published complete genomes were also used: namely, E. coli ST131 strains SE15, JJ1886, and EC958 plus non-ST131 B2 phylogenetic group E. coli strains CFT073, UTI89, E2348, ED1a, 536, S88, and APEC-01 and non-ST131 D phylogenetic group E. coli strain UMN026 (see Data Set S1). E. coli strain NA114 was excluded from the analysis due to poor assembly quality (2,7,26). To integrate reference genome data into our phylogenomic analyses, errorfree 101-bp paired-end Illumina reads were simulated to 60ϫ coverage with an insert size of 340 Ϯ 40 bp as previously described (2). Quality control, de novo genome assembly, and variant detection. Quality control (QC) was performed for all raw read data sets. Briefly, raw reads were analyzed using PRINSEQ v0.20.3 (27) and trimmed with a mean base pair quality score (Q) of Ն20 and a read length of Ն70% of the original read length. Additionally, it was necessary to correct 35 sets of raw read data from data set_2 that had heterogeneous Illumina encoding and/or erroneous paired-end length encoding (see Data Set S1 in the supplemental material). QC and assembly metrics for data set_1 have been previously reported by Petty et al. (2). Lastly, contaminant searches were performed for each sample using Kraken on a subset of 100,000 randomly chosen reads (28).
Quality-filtered Illumina paired-end reads were assembled de novo using Velvet v1.2.07 (29) with k-mer ranges of 45 to 85 for 101-bp reads, 29 to 61 for 76-bp reads, and 29 to 47 for 50-bp reads. An optimal k-mer value for each assembly was selected on the basis of best assembly metrics, including N 50 (i.e., 50% of bases are incorporated in contigs of this length or above), number and size of contigs, number and continuity of uncalled bases, and peak coverage. Contigs that were Ն200 bp at an optimal k-mer were then ordered against E. coli EC958 (7) using Mauve v2.3.1 (30). QC and assembly statistics for data set_2 are summarized in the supplemental material.
Quality-filtered Illumina reads for data set_1 and data set_2, as well as error-free simulated reads of complete genomes, were mapped on the reference strain EC958 using SHRIMP v2.0 (31). Nesoni v0.108 (32) was used to call and annotate substitution-only SNPs, with a consensus cutoff and majority cutoff of 0.90 and 0.70, respectively. SNPs were also determined in parallel using the reference-free k-mer-based approach developed in kSNP v2.0 (33). Default parameters as well as a k-mer value of 19 selected as the optimal value predicted by the kSNP-associated Kchooser script were applied. Exclusion of suboptimal genome data sets. We devised a statistical approach that excluded outliers based on several non-Gaussian metrics that could be determined from mapped and assembled Illumina genome data (see Data Set S1 in the supplemental material). Specifically we examined five metrics: (i) sequence coverage, (ii) number of unmapped bases (in the mapping reference EC958 genome), (iii) number of uncalled bases due to low coverage or mixed-base calls), (iv) the number of scaffolds that were Ն200 bp, and (v) estimated genome size. Metrics i, ii, and iii were based on read mapping data, and metrics iv and v were based on assemblies. Suboptimal genomes were discriminated quantitatively on the basis of metrics iii, iv, and v, and a total of 14 outliers were identified (based on upper and lower cutoffs at the quartile 3 ϩ 1.5 interquartile range and the quartile 1 to 1.5 interquartile range cutoffs, respectively). Metric i did not identify any outliers with low sequence coverage, and outliers with high sequence coverage were not omitted, whereas metric ii did not discriminate any outliers. This additional QC process resulted in the exclusion of genome data from the following strains: CD301, CD436, JJ1901, JJ1996, JJ2007, JJ2041, JJ2050, JJ2243, JJ2441, JJ2555, MH17102, QU300, QUC02, and ZH193. The R scripts used are available on github at https://github-.com/BeatsonLab-MicrobialGenomics/ST131_200_Rscripts. A final data set of 188 ST131 genomes, comprising 185 strains from data set_1 and data set_2 combined, as well as three complete genomes of EC958, JJ1886, and SE15) were chosen for further study after excluding the 14 genomes with suboptimal data quality.
Recombination detection. To avoid distortion of the phylogenetic signal caused by SNPs acquired through recombination, we used the Bayesian clustering algorithm BRATNextGen (34) to detect recombinant regions among the combined data set. Similar to our previous work (2), we used as an input an SNP-based multiple-genome alignment composed of each strain-specific pseudogenome built by integrating the SNPs predicted for each strain to the reference genome of EC958. To help with the identification of underlying clusters of strains, BRATNextGen initially computes a hierarchical clustering tree relative to the proportion of ancestral sequences shared between all strains. A segregation cutoff of 0.12 was then specified to separate each previously identified ST131 clade (clades A, B, and C) and non-ST131 strains into distinct clusters. Recombination was then evaluated within and between each cluster with the convergence approximated using 20 iterations of the learning algorithm. Significance was estimated using 100 permutations with a statistical significance threshold of 0.05. Using the same initial data set, recombination analysis was also carried out using Gubbins (35), an independent method of recombination detection.
Phylogenetic analysis. SNPs identified through reference-based mapping for the 188 ST131 strains were used to build phylogenies using maximum likelihood (ML), prior to and after removal of SNPs associated with recombinant regions. Phylogenetic trees were generated with RAxML v7.2.8 (36) using the general time-reversible (GTR) GAMMA model of among-site rate variation (ASRV), and validated using 1,000 bootstrap repetitions to assess nodal support. Additionally, reference-free k-merbased phylogenetic trees were constructed using kSNP v2.0 with default parameters (33) and genome assemblies as an input. A k-mer value of 19 was selected as the optimal value predicted by the kSNP-associated Kchooser script. All trees were then viewed using Figtree v1.4.0 (37) or EvolView (38), and further compared using the Tanglegram algorithm of Dendroscope v3.2.10 (39), which generates two rectangular phylograms to allow comparison of bifurcating trees.
Bayesian temporal and geographical analysis. Preliminary estimation of the underlying temporal signal of our data was obtained by performing a regression analysis between the root-to-tip genetic distance extracted from the recombination-free maximum likelihood tree, the isolation year, and lineage information for each sequence, as implemented in Path-O-Gen v1.4 (22). To further investigate the divergence of clade C from clade B, we performed a temporal analysis on the 3,779-bp nonre-combinant SNPs of the 172 clade B and C strains using BEAST 2.0 (18), a Bayesian phylogenetic inference software, which can estimate the dating of emergence of distinct lineages. We compared multiple combinations of the molecular clock model (strict, constant relaxed log normal, and exponential relaxed log normal), substitution model (Hasegawa, Kishino, and Yano [HKY] model and GTR), and population size change model (coalescent constant, exponential growth, Bayesian skyline, and extended Bayesian skyline). Markov chain Monte Carlo (MCMC) generations for each analysis were conducted in triplicate for 100 million steps, sampling every 1,000 steps, to ensure convergence. Replicate analyses were then combined with LogCombiner, with a 10% burn-in. The GTR nucleotide substitution model was preferred over the HKY model, and was used with four discrete gamma-distributed rate categories and a default gamma prior distribution of 1. The uncorrelated log normal clock model consistently gave better support based on the Bayes factor and Akaike's information criterion-based (AICM) analyses, compared to a strict clock model. The Bayesian skyline population tree model was chosen as the best-fitting tree model. Maximum clade credibility (MCC) trees reporting mean values with a posterior probability limit set at 0.5 were then created using TreeAnnotator.
In order to adequately investigate the biogeographical history of our ST131 collection, we evaluated potential bias in the geographical origin of strains, which could negatively impact our predictions. Statistical significance of the geographical origin distribution in clade B, C1, and C2 was assessed by chi-square test with Bonferroni correction for multiple comparisons. Overrepresented countries were randomly subsampled down to 15 representative sequences, while countries with fewer than 5 representatives had to be excluded from the analysis (South Korea and Portugal). Overall, we constructed 10 independent randomly subsampled data sets with 85 isolates representing 7 countries, each with 5 to 15 representative sequences. Reconstruction of possible ancestral geographical states was then performed using BEAST 1.8.2 on each subsampled data set. In addition to the previous parameters selected for the temporal analysis, a symmetric substitution model, a Bayesian stochastic search variable selection (BSSVS) model, and a strict clock for discrete locations were chosen for the phylogeographical analysis. MCMC generations were conducted for 100,000,000 steps, sampling every 10,000 steps. MCC trees were then generated using TreeAnnotator for each run with a posterior probability limit set at 0.5. Location posterior probabilities of the most recent common ancestor (MRCA) were then collated for clades B and C and for clade C only.
Genomic comparisons and in silico genotyping. Comparative genomic analyses were conducted using a combination of tools, namely, Artemis, Artemis Comparison Tool (40), and Mauve (30). Graphical representations showing the presence, absence, or variation of mobile genetic elements (MGEs) or other regions of interest, virulence factor genes, and antibiotic resistance genes were carried out using BLASTn and readmapping information as implemented in the SeqFindR visualization tool (41). Regions of interest previously described in the genome of ST131 reference strain EC958 (2, 7) and virulence factors, including autotransporters, fimbriae, iron uptake, toxins, UPEC-specific genes, and other virulence genes, were screened in all ST131 strains with SeqFindR using a cutoff of Ն95% nucleotide identity over the whole length compared to the assembly or the consensus generated from mapping. Additionally, the prevalence of antibiotic resistance-associated genes was also investigated using Srst2 (42) against the ARGannot database, with a minimum depth of 15ϫ read coverage.

FUNDING INFORMATION
This work was supported by a grant from the National Health and Medical Research Council (NHMRC) of Australia (APP1067455). SAB is supported by an NHMRC Career Development Fellowship (APP1090456) and MAS is supported by an NHMRC Senior Research Fellowship (APP1106930).