Recombination of ecologically and evolutionarily significant loci maintains genetic cohesion in the Pseudomonas syringae species complex

Background Pseudomonas syringae is a highly diverse bacterial species complex capable of causing a wide range of serious diseases on numerous agronomically important crops. We examine the evolutionary relationships of 391 agricultural and environmental strains using whole-genome sequencing and evolutionary genomic analyses. Results We describe the phylogenetic distribution of all 77,728 orthologous gene families in the pan-genome, reconstruct the core genome phylogeny using the 2410 core genes, hierarchically cluster the accessory genome, identify the diversity and distribution of type III secretion systems and their effectors, predict ecologically and evolutionary relevant loci, and establish the molecular evolutionary processes operating on gene families. Phylogenetic and recombination analyses reveals that the species complex is subdivided into primary and secondary phylogroups, with the former primarily comprised of agricultural isolates, including all of the well-studied P. syringae strains. In contrast, the secondary phylogroups include numerous environmental isolates. These phylogroups also have levels of genetic diversity typically found among distinct species. An analysis of rates of recombination within and between phylogroups revealed a higher rate of recombination within primary phylogroups than between primary and secondary phylogroups. We also find that “ecologically significant” virulence-associated loci and “evolutionarily significant” loci under positive selection are over-represented among loci that undergo inter-phylogroup genetic exchange. Conclusions While inter-phylogroup recombination occurs relatively rarely, it is an important force maintaining the genetic cohesion of the species complex, particularly among primary phylogroup strains. This level of genetic cohesion, and the shared plant-associated niche, argues for considering the primary phylogroups as a single biological species. Electronic supplementary material The online version of this article (10.1186/s13059-018-1606-y) contains supplementary material, which is available to authorized users.


Introduction
Pseudomonas syringae is a globally significant, gramnegative bacteria that is responsible for causing a wide-spectrum of diseases on many agronomically important crops [1]. However, despite the broad host range of the P. syringae, individual strains are largely considered to be host-specific, causing disease on only a limited range of plant species or cultivars. Furthermore, although the majority of well-characterized strains of P. syringae are pathogens, an increasingly number of isolates have been recovered from non-agricultural habitats that include wild plants, soil, lakes, rainwater, and clouds [2]. The diverse host range, strong host specificity, and ubiquitous distribution of P. syringae strains have made them an excellent model for studying host-pathogen interactions [3][4][5][6].
The taxonomy of P. syringae has changed dramatically over the years [7], and today this diverse group may best be considered a species complex. Species complexes have traditionally been defined as groups of closely related species that are difficult or impossible to distinguish phenotypically, although with microbes this term is more typically applied when recombination between lineages is sufficiently high to blur taxonomic boundaries. Formally, the P. syringae species complex currently includes several closely related plant pathogenic species, including P. amygdali, P. asturiensis, P. avellanae, P. cannabina, P. caricapapayae, P. caspiana, P. cerasi, P. cichorii, P. congelans, P. ficuserectae, P. meliae, P. savastanoi, P. syringae, P. tremae, and P. viridiflava [8,9]. However, some of these species are quite similar at the genetic level, many are not monophyletic [10,11], and distinct names have not historically been assigned based on uniform criteria.
The P. syringae species complex has also been split into approximately 64 pathovars based on host range and pathogenic characteristics, nine genomospecies based on DNA-DNA hybridization assays, and 13 phylogroups based on multilocus sequence and 16S rRNA analyses [12][13][14]. There has also been interest in finding an individual locus that can be used to identify and classify strains in the P. syringae complex. Both the rpoD and cts (also known as gltA) loci have been proposed as useful single locus markers [14,15], and while they are largely concordant with each other and multilocus analyses, they are not perfectly congruent and have relatively low resolution [5,12,13,[16][17][18][19][20][21]. Therefore, while single locus sequence analysis provides a rapid means to discriminate many strains in the P. syringae complex, this approach is not as robust as multilocus sequences analysis, which itself can produce phylogenetic results inconsistent with whole genome phylogenies [21].
Identifying genetic boundaries within and between bacterial species, and the subsequent naming of these groups, provides important insight into fundamental biological processes, as well assisting with "real world" practical decision-making. From the pathologist's perspective, who is concerned with the emergence, spread, and impact of pathogenic clones, understanding natural diversity and population structure is central to determining if a particular strain has the genetic potential to cause a disease on a particular crop variety and the most effective means to control the dissemination of a newly emergent pathogen clone. From a fundamental perspective, understanding natural diversity and population structure provides insight into the ecological and evolutionary pressures that give rise to traits of interest, helps disentangle the roles played by the different evolutionary forces, and identifies specific genes that are required for the success of a strain in a particular ecological context, e.g., host specificity loci.
A significant hurdle to identifying ecologically meaningful genetic boundaries in P. syringae is the lack of correlation between genotypic and phenotypic similarity among strains. While P. syringae strains can be genetically very diverse, there are few if any definitive phenotypic traits that can reliably partition strains into major groups that are congruent with the genetic data [7,14,22]. For example, pathogens causing disease on a single crop are often found in multiple phylogenetic groups [10,13,23,24]. Several non-pathogenic environmental isolates are also closely related to well-established P. syringae pathogens [25,26]. Many of the methods that have been used to classify strains in the P. syringae species complex are thus forced to rely on ad hoc distinctions [27], which can lead to either the artefactual clustering of distinct lineages or splitting of cohesive monophyletic clades [16,28].
The alternative to using ad hoc distinctions or metrics to identify biological groups is to employ a theoretical framework based on evolutionary theory. Species concepts provide a theoretical basis for understanding the evolutionary and ecological forces, such as reproductive isolation, recombination, mutation, selection, and genetic drift, that drive diversification or cohesion of distinct genetic units [5]. Furthermore, unlike ad hoc species delimitation approaches, species concepts can help to define species boundaries for all isolates of a group irrespective of their specific niche or phenotype. In bacteria, the ability to horizontally exchange DNA can be particularly important for limiting the impact of reproductive isolation. Genes, operons, and plasmids can be transferred between strains from distinct lineages through horizontal transfer (HGT), resulting in an influx of genetic material that may or may not be homologous with genetic material already found in that lineage. While non-homologous HGT is critically important for expanding the pan-genome, homologous recombination plays a particularly important role in maintaining genotypic cohesion between lineages as well as breaking down the linkage disequilibrium established through vertical inheritance of de novo mutations.
One class of models that have proven useful for understanding bacterial species are based on the concept of ecotypes. An ecotype is a genetic lineage occupying a defined niche. The basic ecotype model describes how genotypes carrying advantageous mutations arise periodically through mutation and sweep through a population as selection enables them to outcompete other members of the population [29][30][31][32][33]. The extent of spread of these beneficial mutations defines the boundaries of the ecotype. These recurrent selective sweeps, in combination with the accumulation of neutral mutations through genetic drift, purge genetic diversity within distinct populations, while increasing the genetic divergence between ecotypes, ultimately resulting in genetic isolation. When it is sufficiently strong, homologous recombination helps to pump the brakes on this divergence process by transferring beneficial (as well as neutral) variation between distinct ecotypes, thus maintaining genetic cohesion between ecotypes [28,[34][35][36][37][38][39][40][41][42][43]. Ultimately, the ability of recombination to disseminate advantageous mutations among ecotypes defines the ecological boundaries of the species. The strength of homologous recombination relative to the rate of neutral mutation and genetic drift will determine if distinct ecotypes evolve. Any decline in the frequency of homologous recombination between ecotypes, whether due to physical barriers and/or ecological partitioning, will help solidify the genetic isolation between ecotypes and formation of species. Countering this, the transfer of important genes that are critical for the exploitation of a specific niche (e.g., the interaction between a microbe and its host) may prove to be especially important for maintaining genetic cohesion in pathogenic bacterial populations like P. syringae.
Despite its potentially critical importance for defining species boundaries in bacteria, relatively little is known about the genome-wide extent of recombination between strains from different phylogroups of the P. syringae species complex because prior studies have primarily focused on a small set of housekeeping genes in the core genome [13,44,45]. However, we do know that at least some strains of P. syringae undergo relatively high rates of recombination, and this limited sample size of genes suggests that inter-phylogroup homologous recombination is considerably more rare than intra-phylogroup homologous recombination [45]. This could mean that there is no cohesive P. syringae species complex and each phylogroup represents a separate species. Alternatively, it is possible that the majority of inter-phylogroup recombination is occurring in the accessory genome, which would still maintain the genetic cohesion between phylogroups. It is currently not possible to distinguish between these possibilities based only on recombination analyses of a small set of core genes given that most ecologically and evolutionarily relevant genes are in the accessory genome and, by definition, only shared by a subset of strains in the species complex [6,23]. Clearly, a more thorough analysis of the rates of recombination for ecologically and evolutionarily relevant loci in the accessory genome is required to determine whether clear species barriers exist within the P. syringae species complex.
Here, we performed the whole-genome comparative and evolutionary analyses of 391 genomes from the P. syringae complex, including pathogenic isolates from diseased crops and isolates from environmental sources. In total, our collection of whole-genome sequences contains representatives from 11 of the 13 distinct phylogroups, including all seven late-branching canonical phylogroups that we consider to be primary (1, 2, 3, 4, 5, 6, and 10), and four of the six early-branching non-canonical phylogroups that we consider to be secondary (7, 9, 11, and 13) [46]. These strains enabled us to describe the phylogenetic distribution of all orthologous gene families in the pan-genome of the P. syringae species complex, refine the phylogenetic relationships between P. syringae strains using whole-genome data, predict ecologically and evolutionary relevant loci, and evaluate the impact of recombination, selection, and genetic drift on each ortholog family. Taken together, the analyses allowed us to investigate the evolutionary mechanisms that maintain genetic cohesion between P. syringae strains and attain an enhanced understanding of the species barriers that exist in the species complex.

Genome assemblies and annotations
In addition to the 135 publicly available genome assemblies of P. syringae, we performed whole-genome sequencing and assembly on 256 new strains, most of which were obtained from the International Collection of Microorganisms from Plants (ICMP) and other collaborators. The ICMP strains included 62 type and pathotype strains of P. syringae (BioProject Accession: PRJNA292453) [47]. Type strains are the isolates to which the scientific name of that organism is formally attached under the rules of prokaryote nomenclature. Pathotype strains have the additional requirement of displaying the pathogenic characteristics of the specific pathovar (i.e., causing specific disease symptoms on a particular host) [48]. Twenty-two non-P. syringae strains (twelve newly sequenced, ten from public databases) belonging to the Pseudomonas genus were also used as outgroups when required. In total, we analyzed whole-genome assemblies of 391 P. syringae strains representing 11 of the 13 phylogroups in the P. syringae species complex, thus enabling the most comprehensive analyses of the diversity that exists in this species to date (Additional file 1: Figure S1 and Additional file 2).
All whole-genome sequencing performed in this study was accomplished using either the Illumina GAIIx platform, resulting in 36-bp or 75-bp paired-end reads, or the Illumina MiSeq platform, resulting in 152-bp paired-end reads. In sum, we generated between 614,546 to 42,765,634 paired-end reads for each genome, for an average depth of coverage ranging between 15 and 700×. Adapters and low-quality bases were trimmed from the raw reads using Trimmomatic [49], and de novo assembly and quality filtering were performed using CLC Genomics Workbench (CLC Genomics Work Bench 2012). After quality filtering, the final N50 value for each assembly was between 1457 and 316,542 bps, the number of contigs was between from 59 to 5196, and the size of each P. syringae genome was between 5,097,969 and 7,217,414 bps (Additional file 3). These values represent high-quality assemblies that are consistent with the draft genome assemblies that we obtained from public database (Additional file 1: Figure S1, Additional file 2).
De novo gene prediction and annotation was performed on all newly assembled and publicly available genomes using a consensus approach based on Glimmer, Gene-Mark, FragGeneScan, and Prodigal, as implemented by DeNoGAP (Additional file 4; see the "Methods" section) [50][51][52][53][54]. Reliable calls that overlapped by more than 15 bps were merged into a single coding sequence, and all genes were functionally annotated by blasting against the UniProtKB/SwissProtKB database [55]. Gene ontology terms, protein domains, and metabolic pathways were also assigned to each coding sequence using InterProScan [56], while COG categories were assigned by blasting predicted genes against the Cluster of Orthologous Groups (COG) Database [57]. These methods predicted an average of 5491 ± 25.69 (SEM) genes per de novo P. syringae draft assembly (Additional file 2), and in cases where a corresponding annotation was publicly available, the two annotations were largely in agreement. However, among the 135 publicly available genomes, we did predict an additional 29,748 genes, for an average of 220.36 ± 11.81 (SEM) additional genes per genome (Additional file 5). This is likely due to the variable quality of the publicly available genomes.

Evolutionary relationships between strains
Core and accessory genetic content Using all 413 genome assemblies (391 P. syringae, 22 outgroups), we clustered and differentiated homologous families using the DeNoGAP comparative genomics pipeline [50]. The 2,294,719 protein sequences present across all genomes were first clustered into 241,678 HMM families based on the stringent percent identity and alignment coverage thresholds of 70%. Similar HMM families connected via single-linkage clustering (i.e., sharing at least one sequence between the different families) were then combined, resulting in a total of 83,373 homolog families. Finally, these homolog families were split into orthologous and paralogous families using the reciprocal smallest distance approach and the MCL algorithm, resulting in a total of 98,567 ortholog families. Of the 98,567 ortholog families, 77,728 were present in at least one P. syringae strain, representing both the core and accessory genome content of the P. syringae species complex.
Despite the fact that the total number of protein-coding genes in each P. syringae genome is similar, the composition of each genome, with respect to the specific complement of genes, is remarkably divergent. Specifically, we estimate that only 2410 of the 77,728 P. syringae ortholog families (3.10%) are part of the soft-core genome, based on the presence of a given ortholog family in at least 95% of strains. This soft-core genome cutoff is justified by the fact that core genome cutoffs that are overly strict eliminate a number of genuine core ortholog families because of assembly and annotation errors. Indeed, as we incrementally increase the frequency of strains that an ortholog must be present in for it to be considered part of the core genome from 50 to 100%, we find that there is a sharp drop-off in the core genome size at~95% (Additional file 1: Figure S2), representing the point at which we expect a number of genuine core genome ortholog families to be lost due to assembly and annotation errors. The number of orthologs that are part of the hard core genome (present in 100% of strains), for example, is only 124. As more genomes are sampled, we expect the core genome size to decrease incrementally, but that this effect will diminish as a more representative sample of the P. syringae complex is obtained. We asked whether we would expect further declines in the core genome size of P. syringae species if we sampled more genomes using a gene accumulation rarefaction curve, which characterizes the exponential decay of the core genome as each new genome is added to the analysis [58]. The soft-core genome curve plateaus as it approaches the core genome size of 2410 when only approximately 50 genomes have been sampled (Fig. 1a), suggesting that the core genome of the P. syringae species complex would be unlikely to change significantly by sampling more P. syringae genomes.
The small size of the core genome in the P. syringae species complex results in an expansive accessory genome, comprising 75,318 of the 77,728 P. syringae ortholog families (96.90%). Unlike the core genome, the accessory genome is expected to increase as more genomes are sampled until sufficient genomes have been sampled to capture all of the gene content diversity of the species. Only 30,622 (39.40%) of all ortholog families in P. syringae were present in more than one strain, while the remaining 47,106 (60.60%) ortholog families were singletons present in only a single strain. We used the micropan package [59] to assess if the pan-genome of P. syringae is open or closed. A closed pan-genome indicates that sampling of ortholog families has neared saturation, while an open pan-genome indicates that there is still a large pool of as yet undiscovered ortholog families. Micropan estimated a decay parameter (alpha) of 0.43 using Heap's law model [59], which is well below the critical threshold of alpha = 1.0 that distinguishes open from closed genomes. These findings are in agreement with a gene accumulation rarefaction analysis of the accessory genome, which has not plateaued ( Fig. 1b) and demonstrates that each strain introduces~193 new ortholog families into the P. syringae pan-genome. Taken together, these analyses suggest that P. syringae possesses an open pan-genome and that we are likely to continue to identify novel accessory ortholog families as additional P. syringae strains are sampled. However, it is notable that when singletons are excluded from this analysis, we do see a plateau in the gene accumulation rarefaction curve, suggesting that most undiscovered genes are likely not broadly distributed.
We also investigated the core and pan-genome profiles for strains at the level of phylogroup to explore the nature of genome evolution in these distinct monophyletic groups of the P. syringae species complex. As expected, the phylogroup hard core genome size was inversely proportional to the number of strains sampled from the phylogroup. Phylogroups 7, 9, 10, 11, and 13, where fewer than five genomes were analyzed, had particularly large core genomes, but their core genome sizes are expected to drop dramatically as more diverse strains from these phylogroups are sampled (Table 1, Additional file 1: Figure S3). The size of soft-core genomes is more consistent across phylogroups and it appears as though the size of the soft-core genome in several phylogroups is unlikely to dramatically change by sequencing more strains given that their rarefaction curves have begun to plateau. In contrast, the pan-genome sizes vary proportionally to the number of strains analyzed in each phylogroup, with larger phylogroups having considerably larger pan-genomes (Table 1, Additional file 1: Figure S4). This was expected given our observation that each strain introduces nearly 200 novel ortholog families to the P. syringae pan-genome in the cumulative analysis. Although we do not observe that any of these phylogroups have closed pan-genomes (alpha > 1.0), the pan-genome rarefaction curve has begun to plateau in some of the more broadly sampled primary phylogroups, at least in the non-singleton analysis. This suggests that much of the remaining novel genetic content in the P. syringae species complex likely lies in the under sampled phylogroups (7,9,10,11,13) or phylogroups that have yet to be discovered. To conduct a more comprehensive comparative genomics analysis of the P. syringae species complex, future sampling should be focused on these under sampled phylogroups, though there is undoubtedly some undiscovered genetic content in all phylogroups.
Overall, the distribution of ortholog families among P. syringae strains shows that the vast majority of families are either very common or very rare (Additional file 1: Figure S5). This pattern is a strong indicator that the introduction of novel genetic material through horizontal gene transfer is common throughout the P. syringae complex and may explain the expansive accessory genome consisting of mostly singleton orthologs. While a number of these singleton orthologs were functionally A B Fig. 1 Rarefaction curves for the core (a) and accessory (b) genome of P. syringae, as estimated using PanGP. a Families present in 95% (soft-core genome) and 100% (hard core genome) of P. syringae strains exponentially decays as each new genome is added to the analysis. b The total number of gene families identified continues to increase indefinitely as each new genome is added to the analysis when singleton gene families (families that are only present in one strain) are included, suggesting that P. syringae has an open pan-genome annotated, signifying that they are genuine genes, 68.47% of singleton ortholog families were annotated as hypothetical proteins, compared to only 43.83% of other ortholog families (chi-squared test; χ 2 = 1.16 × 10 −4 , df = 1, p < 0.0001). This suggests that these genes may represent a diverse collection of yet unexplored niche-specific genes in P. syringae, although some of these singleton ortholog families are likely the result of annotation errors associated with draft genome sequencing [60].

Phylogenetics
Based on multilocus sequence analysis (MLSA), the P. syringae species complex has currently been separated into 13 distinct phylogroups [14], seven of which we consider to be late-branching "primary" phylogroups (phylogroups 1, 2, 3, 4, 5, 6, and 10) as they are monophyletic and quite genetically distinct from the more divergent early-branching "secondary" phylogroups. The primary phylogroups also include the traditionally recognized diversity of the species complex, and nearly all of the type and pathotype strains. Finally, almost all of the strains in the primary phylogroups carry the canonical P. syringae type III secretion system (discussed below) [46]. The remaining six "secondary" phylogroups (phylogroups 7, 8, 9, 11, 12, and 13) include a number of species not traditionally associated with the P. syringae complex such as P. viridiflava and P. cichorii, and rarely carry a canonical P. syringae type III secretion system. Additionally, many of the strains from the secondary phylogroups have been isolated from environmental (e.g., water and soil) sources, whereas the vast majority of strains from the primary phylogroups were isolated from aerial plants surfaces. We first sought to refine the phylogenetic relationships between strains in the P. syringae species complex using a core genome alignment of the 391 strains analyzed here. The core genome tree was constructed based on a concatenated multiple alignment of the 2410 soft-core genes using FastTree with an SH-TEST branch support cutoff of 70% (Fig. 2a). The core genome tree delineates these 391 strains into distinct clades representing 11 of the 13 phylogroups in the P. syringae species complex. Therefore, our phylogroup assignments agree with those described earlier based on a smaller collection of type strains analyzed by MLSA [12][13][14]. However, the clustering of strains within each phylogenetic group does differ somewhat from earlier MLSA-based phylogenetic analyses [21]. This suggests that some of the more fine-scale phylogenetic relationships were not resolved, or improperly resolved due to recombination in the MLSA analysis, which were performed on a smaller collection of strains and with seven or less MLSA loci. Phylogenetic inferences based on the entire core genome should average out the majority of gene-specific biases that result from the distinct evolutionary histories of individual genes, thus providing a more accurate phylogenetic picture of the clonal relationships in the P. syringae species complex and enhancing our ability to explore phylogenetic relationships within and among phylogroups. We also assessed P. syringae strain relationships based on gene content by hierarchical clustering phylogenetic profiles, which are simply binary vectors describing the presence or absence of each ortholog family in each strain. Hierarchical clustering of the phylogenetic profiles effectively delineated P. syringae strains into their respective phylogroups in most cases (Fig. 2b), but some key differences exist between the gene content and core genome trees. The most obvious case of incongruence between the core genome and gene content trees involves the relationship between phylogroup 2 and phylogroup 10. In the gene content tree, phylogroups 2 and 10 cluster together with all strains from these phylogroups forming a monophyletic group. This branching pattern is inconsistent with the core genome tree, where phylogroup 2 clusters with phylogroups 3 and 6, and phylogroup 10 clusters with phylogroup 5. The clustering of phylogroups 2 and 10 in the gene content tree can be traced back to their shared ortholog content. Strains from phylogroup 10 share an average of 3918 orthologs with strains from phylogroup 2, which is more than they share with any other phylogroup, including phylogroup 5 (3684 orthologs). There are also a number of finer scale differences between the core genome and gene content trees that involve the clustering of strains within each phylogroup. Overall, these examples of phylogenetic discordance between the core genome and gene content trees suggests that while horizontal gene transfer between strains of P. syringae is not sufficiently strong to consistently overwhelm the signal of vertical gene inheritance, recombination events that result in shared genome content between distantly related strains are occurring regularly between strains of the P. syringae species complex [61].

Genetic diversity
The level of divergence between phylogroups, the extremely large accessory genomes, and the diversity of phenotypes within the P. syringae species complex has led some to propose that individual phylogroups or even specific pathovars should be considered incipient or even fully distinct species [4]. For example, Nowell et al. [61] stated that "the three P. syringae phylogroups [phylogroups 1, 2, and 3] are as diverged from each other as other taxa classified as separate species or even genera." Using our expanded whole-genome dataset of P. syringae strains, we tested this hypothesis by quantifying the average genetic divergence between strain pairs within the same phylogroup and between strain pairs from different phylogroups. We then compared these divergence values to the pairwise divergence between three species pairs from the same genus (Aeromonas hydrophila-Aeromonas salmonicida; Neisseria meningitides-Neisseria gonorrhoeae; Pseudomonas aeruginosa-Pseudomonas putida), and one species pair from different genera (Escherichia coli-Salmonella enterica). For P. syringae strains, we calculated average synonymous (Ks) and non-synonymous (Ka) substitution rates across the 2410 core genes using the "SeqinR" package in R [62]. Similarly, we calculated Ks and Ka for the distinct species pairs using 3288 core genes for A. hydrophila-A. salmonicida, 1423 core genes for N. meningitides-N. gonorrhoeae, 1971 core genes for P. aeruginosa-P. putida, and 2688 core genes for E. coli-S. enterica.
As expected, the lowest average Ks and Ka values in P. syringae were obtained when comparing strains within the same phylogroup, and the second lowest values were obtained when comparing strains that were from different primary phylogroups. Comparisons between P. syringae strains from different secondary phylogroups and between strains from primary phylogroups and secondary phylogroups yielded the highest Ks and Ka values, which are comparable to those that we obtained for distinct species (Additional file 1: Figure S6). Specifically, the average Ka values within P. syringae phylogroups were all less than 0.02, and the average Ks values were all less than 0.20. The average Ka values between primary P. syringae phylogroups were between 0.02 and 0.04, and the average Ks values were between 0.30 and 0.60. With one exception, all Ka values between primary and secondary phylogroups, or between separate secondary phylogroups were greater than 0.05 and less than 0.10, while Ks values were between 0.60 and 1.00. In comparison, the Ka values for distinct species were 0.06, 0.15, and 0.06 for A. hydrophila-A. salmonicida, P. aeruginosa-P. putida, and E. coli-S. enterica, respectively, and their Ks values were 0.46, 0.74, and 0.92. The N. meningitides-N. gonorrhoeae pair was an outlier in the distinct species pairs, having a Ka value of 0.02 and a Ks value of 0.14. However, these low Ka and Ks values may be misleading because of rampant recombination between the species in this genus [63,64]. Specifically, approximately 62.70 to 98.40% of core genes in Neisseria are reported to be undergoing recombination and only (See figure on previous page.) Fig. 2 Core (a) and pan (b) genome phylogenies of Pseudomonas syringae strains. The core genome, maximum-likelihood tree was generated from a core genome alignment of the 2410 core genes present in at least 95% of the P. syringae strains analyzed in this study. The pan-genome tree was generated by hierarchical clustering of the gene content in each strain using the Jaccard coefficient method for calculating the distance between strains and the Ward hierarchical clustering method for clustering. Strain phylogroups, hosts of isolation, and whether the strain is a type or pathotype strain are shown outside the tree 1% are under positive selection [65], suggesting that the low Ka values in the genus are due to the elevated recombination rates that distort the molecular clock. In summary, it is clear that most P. syringae strains within the primary phylogroups are considerably more similar than well characterized distinct species pairs. On the other hand, most secondary phylogroups are sufficiently diverged in their core genomes to potentially warrant their separation into distinct species.

Ecologically significant genes
We explored the phylogenetic distribution and diversity of what we refer to as "ecologically significant" ortholog families to better understand how these critical gene families define the ecological niche of the species complex. Specifically, we focused on any gene family previously shown to play a direct role in host-microbe or microbe-microbe interactions, such as toxins, effectors, and resistance factors. These genes included those associated with the type III secretion system (T3SS), type III secreted effectors (T3SEs), phytotoxins, and virulence-associated proteins identified using the Virulence Factors of Pathogenic Bacteria Database (VFDB) [66].
The canonical T-PAI T3SS is widely distributed and is found at very high frequency among strains in the primary phylogroups, but is absent from the majority of strains in the secondary phylogroups ( Fig. 3, Additional file 1: Figure  S8). In contrast, the alternate T-PAI T3SS is only found in three strains, PvrICMP3272 and PvrICMP11296 within phylogroup 3, and PvrICMP19473 within phylogroup 7. These strains all lack the canonical T-PAI T3SS, suggesting that the alternate T-PAI acts as a replacement T3SS in these strains. Although the broad distribution of the canonical T-PAI T3SS in P. syringae pathogens is widely known, it is somewhat surprising that it was also present in all strains from phylogroups 9 and 10 given that these phylogroups consist of non-agricultural, environmental strains. Interestingly, some strains in phylogroup 10 have been reported to cause disease or induce a hypersensitive Fig. 3 Prevalence of different forms of type III secretion systems (T3SSs) and phytotoxin biosynthesis genes in each of the P. syringae phylogroups. A given T3SS was considered present if all full-length, core, structural genes of the T3SS were present in the genome, while phytotoxins were considered present if more than half of the biosynthesis genes for a given phytotoxin were present in the genome response (HR) in plant hosts [14], but phylogroup 9 strains have yet to be associated with any plant hosts [76]. The presence of canonical T-PAI T3SS structural genes in both of these non-agricultural phylogroups may suggest that strains in these phylogroups have the capacity to efficiently deliver effectors and cause disease in plant hosts that have yet to be examined.
Unlike the T-PAI T3SS, the A-PAI and S-PAI T3SSs are only present in a small subset of the P. syringae strains sequenced in this study. The only two homologs for the A(A)-PAI T3SS are found in phylogroup 2c, where they likely function as a replacement for the canonical T-PAI T3SS. Strains from phylogroup 2c have primarily been isolated from phyllosphere of grasses and have been widely described as non-pathogenic. However, past studies have suggested that some of these strains can efficiently deliver effectors into host cells and induce a hypersensitive response [77]. Two closely related A(B)-PAI T3SS homologs were also found in phylogroup 13. However, the A(B)-PAI T3SS in these strains is located in a different genomic region from the A(A)-PAI T3SS in strains from phylogroup 2c. Specifically, strains from phylogroup 2c contain the A-PAI T3SS between a sodium transporter and a recombination-associated protein [74], while in phylogroup 13 the A-PAI T3SS is located between a transcriptional regulator and a lytic murein transglycosylase (Additional file 1: Figure S7). The lack of synteny between the location of the A-PAI T3SS in these two phylogroups suggests that they were independently acquired via horizontal gene transfer [72]. The S-PAI T3SS was also only identified in a small subset of the strains that we sequenced in this study, three of which are part of phylogroup 11, where they are the only T3SS in the genome, and two of which are part of phylogroup 7, where they also contain an R-PAI T3SS (Fig. 3, Additional file 1: Figure S8). Despite lacking the exchangeable and conserved effector loci (EEL and CEL, respectively) regions of the canonical T-PAI T3SS, and containing a 10-kb insertion in the middle of the Hrc/Hrp cluster [70], we expect that these strains will be capable of successfully delivering effectors into some plant hosts.
The R-PAI T3SS, which closely resembles the T3SS found in Rhizobium species [75], is distinguished from other T3SS families based largely on the splitting of the hrcC gene, which codes for an outer membrane secretin protein [75]. Specifically, the hrcC gene is typically split into the hrcC1 and hrcC2 genes, separated by TPR domain (Additional file 1: Figure S7), and in some strains, the hrcC2 gene is split again into two additional fragments. The R-PAI T3SS is found in a large fraction of P. syringae strains from phylogroups 1, 2, 3, 4, 7, and 10 ( Fig. 3, Additional file 1: Figure S8), but it is always present in concert with at least one other type of T3SS in P. syringae strains. All of these strains contain the characteristic split in the hrcC gene, but only seven strains, all from phylogroup 3, also contain a second split in the hrcC2 gene. The similarity in GC-content between the P. syringae R-PAI T3SS genes and the rest of the P. syringae genome [75], the broad distribution of the R-PAI T3SS across P. syringae strains (Additional file 1: Figure S8), and the ability of R-PAI HrcV protein phylogeny to effectively resolve distinct phylogroups (Additional file 1: Figure S9) suggest that the R-PAI T3SS was likely present in the most recent common ancestor of the P. syringae complex. However, there is some disagreement between the inter-phylogroup relationships revealed by the HrcV protein tree and the core genome tree, with phylogroup 2 clustering with phylogroups 4 and 10 instead of phylogroup 3. This suggests that the R-PAI T3SS has also been transferred horizontally between phylogroups during the evolutionary history of the P. syringae species complex. From an evolutionary perspective, the presence of the R-PAI T3SS in such a large number of P. syringae lineages may suggest its selective benefit in nature [5], but the exact function of the R-PAI T3SS has yet to be investigated.

Type III secreted effector proteins (T3SEs)
The role of T3SSs is to deliver T3SEs into host plant cells to subvert the host immune response and promote bacterial growth. Therefore, we also explored the frequency and distribution of known T3SE families across P. syringae strains by blasting representative experimentally validated and predicted T3SEs against our P. syringae genome assemblies [78,79]. We also attempted to identify novel T3SE candidates by searching for the universal N-terminal secretion signal and the hrp-box motif.
The number of known T3SE families per strain varied dramatically, from a minimum of four in strains from phylogroup 9, to a maximum of nearly 50 in some strains from phylogroup 1 ( Fig. 4: Figure S8). By analyzing the distribution of each effector family across P. syringae strains in the primary phylogroups (Fig. 4), we identified three core T3SEs (avrE1, hopAA1, hopAJ2) that were present in some form (full-length ORF or truncated ORFs) in more than 95% of the primary phylogroup strains. Two of these core T3SEs (avrE1 and hopAA1) are found in the CEL of the canonical T-PAI T3SS. In addition, a number of other T3SEs, including a third T3SE from the CEL (hopM1), are also broadly distributed across P. syringae phylogroups (Fig. 4), but did not pass the core genome threshold of 95%. Interestingly, in contrast to the other T3SEs in the CEL, hopN1 is not broadly distributed and is only found in phylogroup 1 strains.
The remaining T3SEs are patchily distributed across the phylogenetic tree and a hierarchical clustering analysis of the total effector content of individual P. syringae strains reveals that strains from the same phylogroup can differ substantially in their T3SE content (Additional file 1: Figure S10A). In the T3SE content tree, phylogroup 6 strains are clustered with phylogroup 1 instead of phylogroup 3, while phylogroup 3 and phylogroup 5 strains are split. Specifically, some phylogroup Prevalence of all known type III secreted effectors (T3SEs) in each of the P. syringae phylogroups analyzed in this study. T3SEs were identified using a tblastn of 1215 experimentally verified or computationally predicted effector sequences from the BEAN 2.0 database and were considered present if a significant hit was found in the genome (E value < 1 −5 ). Gray scaling indicates the prevalence of each T3SE family within the respective phylogroups 3 strains cluster with phylogroup 1 and others cluster with phylogroup 2, while distinct clusters of phylogroup 5 strains are also found on distant regions on the T3SE content tree. Finally, while all secondary phylogroups strains, which contain considerably fewer T3SEs than primary phylogroup strains, cluster separately from primary phylogroups in the T3SE content tree, these phylogroups are often not resolved based on their T3SE contents and are monophyletic with the two low T3SE content strains from phylogroup 2c.
We also performed a separate analysis focusing only on variation in the exchangeable effector locus (EEL) in each of our P. syringae strains, which is known to be located between the tRNA-Leu and hrpK1 genes bordering the hrp/hrc cluster of genes encoding the type III secretion apparatus. An EEL region was identified in all 380 primary phylogroup strains with the exception of the two strains in phylogroup 2c, but was only identified in four out of the 11 secondary phylogroup strains. As expected, the content of the EEL region was highly variable across strains, and a hierarchical clustering analysis of the EEL content revealed that this region does a poor job of resolving even primary phylogroup relationships (Additional file 1: Figure  S10B). For this analysis, we only included the 211 P. syringae strains that contained intact EEL on a single contig. Overall, the patchy distribution of T3SEs across the P. syringae phylogenetic tree, particularly those in the EEL, demonstrates that T3SEs are highly dynamic genes that are acquired and lost with high frequency, presumably in response to host-mediated selection.
In addition to the members of known effector families that we identified in this study, 6264 additional protein sequences from our 391 P. syringae strains contained a characteristic T3SE N-terminal secretion signal and an upstream hrp-box promoter (Additional file 6). We re-annotated these protein sequences using the Gene Ontology and Uniprot databases (Additional file 1: Table S1) and found that 5325 (85.01%) of these putative effectors were either from known T3SE families and were missed in our blast similarity analysis or were sequences associated with the T3SS. The remaining 939 proteins, which come from 282 distinct families, were annotated with a diverse array of predicted functions relating to metabolic processes, protein transport, signal transduction, peptidase activity, and pathogenesis, are candidates for novel T3SEs. However, these proteins may also represent non-effector proteins that are expressed under the control of a hrp-box promoter and have similarity in the N-terminal region to true T3SEs [80,81]. Further computational and experimental verification of these candidate T3SEs will ultimately be required to determine if these are in fact T3SEs. We recommend that the 458 putative T3SEs from 111 families with a hrp-box between 15 and 265 base-pairs from their start codons be prioritized for these studies, as has been suggested previously [82][83][84].

Phytotoxins
Phytotoxins are secondary metabolites that play a non-host-specific role in pathogenesis as well as having generalized antibacterial and antifungal properties [85]. We studied the distribution of eight well-known phytotoxin biosynthesis pathways in P. syringae, including auxin, mangotoxin, syringopeptin, syringolin, syringomycin, tabtoxin, phaseolotoxin, and coronatine by using a protein blast search of their known biosynthesis genes (Fig. 3, Additional file 1: Figure S11). Specifically, we considered phytotoxin pathways present if we identified more than half of the proteins involved in the biosynthetic pathway in a strain. Auxin appears to be the only broadly distributed phytotoxin, as genes for auxin production were found in all strains of the P. syringae species complex, with the exception of PziICMP8959 from phylogroup 4. Genes for the production of phaseolotoxin and coronatine were also found in strains from a number of phylogroups but are still missing from many P. syringae strains. Mangotoxin, syringomycin, and syringopeptin are mostly restricted to two phylogroups, with mangotoxin production being restricted to strains from phylogroups 2 and 11, while syringomycin and syringopeptin production were restricted to strains from phylogroups 2 and 10. Interestingly, there was a perfect overlap between strains that produced syringomycin and strains that produced syringopeptin. Finally, both tabtoxin and syringolin are only produced by a high frequency of strains from a single phylogroup (phylogroups 4 and 2, respectively). Overall, the majority of P. syringae strains only possess genes necessary to produce one or two phytotoxins; however, strains from phylogroup 2 can synthesize as many as five phytotoxins. Interestingly, phylogroup 2 strains harbor fewer T3SE genes, which suggests that phylogroups 2 strains may have evolved a unique strategy to interact with their hosts or associated microbiomes that relies more on generalized toxins as opposed to specialized T3SEs [23,[86][87][88].

Miscellaneous virulence-associated systems
Finally, we performed a search for all putative virulence factors in P. syringae by scanning the proteome of each strain using a BLAST search against the Virulence Factors of Pathogenic Bacteria Database (VFDB) [66]. Eight hundred eighty-five out of 17,807 orthologous protein families that were present in at least five P. syringae strains (4.97%) were identified as predicted virulence factors and were significantly associated with 36 different biological process (FDR p value < 0.05) [89,90]. These pathways included a high frequency of families involved in cellular localization, pathogenesis, flagellar movement, protein secretion, regulation of transport, siderophore biosynthesis, secondary metabolite biosynthesis, and other metabolic processes (Additional file 1: Table S2).

Evolutionarily significant genes
We explored the phylogenetic distribution and diversity of what we refer to as "evolutionarily significant" ortholog families to identify which gene families are significantly impacted by natural selection and recombination. We focused on those gene families showing genetic signatures consistent with positive selection and/or recombination. We were particularly interested in identifying loci which recombine between distinct phylogroups since these have the potential to reinforce the genetic cohesion in this diverse species complex.

Positive selection
We performed a codon-level analysis of natural selection using FUBAR [91] on all 17,807 ortholog families that were present in at least five P. syringae strains to identify families with significant evidence of positive selection at one or more residues (Bayes Empirical Bayes p value ≥ 0.9; dN/ dS > 1). Recombination was accounted for in this analysis by using a partitioned sequence alignment and the corresponding phylogenetic tree from the output of GARD (see below), which identified 1649 ortholog families with signatures of homologous recombination (p ≤ 0.05). A total of 3888 ortholog families had significant evidence of positive selection at one or more codons (21.83%), with 931 of these families (23.95%) coming from the core genome and 2957 (76.05%) coming from the accessory genome. Interestingly, this suggests that there is a significant bias for genes in the core genome to contain individual sites under positive selection (chi-squared test; χ 2 = 5670.60, df = 1, p < 0.0001), despite the fact that overall these genes are constrained by purifying selection and conserved across the P. syringae species complex.

Recombination
We searched for different signatures of homologous recombination in the 17,807 ortholog families that were present in at least five P. syringae strains using four programs: GARD [92], CONSEL [93], GENECONV [94], and PHIPACK [95]. These four methods use different underlying principles to identify recombination. GARD uses genetic algorithms to assess phylogenetic incongruence between sequence segments. CONSEL employs the Shimodaira-Hasegawa test to assess the likelihood of a dataset given one or more trees. GENECONV looks for imbalances in the distribution of polymorphism across a sequence (i.e., clusters of polymorphisms). PHIPACK calculates a pairwise homoplasy index (PHI statistic) based on the classic four gamete test [96] that assesses the minimum number of homoplasies needed to account for the linkage between two sites. Our analysis identified a total of 11,533 (64.77%) ortholog families with signatures of homologous recombination in at least one of these analyses. Specifically, GARD, CONSEL, GENECONV, and PHIPACK identified 1616, 1681, 4433, and 7379 ortholog families respectively (Bonferroni corrected p ≤ 0.05), with relatively little overlap between these packages (Additional file 1: Figure S12). Not surprisingly, those ortholog families that displayed evidence of recombination had significantly greater average lengths (1010.09 bps ± 8.70 (SEM)) than those that did not display evidence of recombination (683.49 bps ± 10.55 (SEM)) (Welch's two sample t test; t = 23.87, df = 14,148, p < 0.0001). This is consistent with the expectation that shorter genes are less likely to be involved in recombination because of their decreased target size and/or the decreased power of analyses of recombination on shorter genes [95,97,98]. We additionally partitioned the GENECONV analysis results into intra-and inter-phylogroups recombination events, demonstrating that ortholog families that recombine within phylogroup (2476; 55.85%) are more common than ortholog families that recombine between phylogroups (1957; 44.15%).
Using all 11,533 ortholog families with signatures of homologous recombination, we first asked whether the well-established negative correlation between the frequency of homologous recombination and evolutionary rate could explain the reduced recombination rate between phylogroups [99,100]. Given the wide range in strain numbers and overall diversity among phylogroups, we normalized the number of recombination events occurring between phylogroups in a number of different ways, including: recombination events per gene per strain, events per gene adjusted by branch length, events per strain adjusted by branch length, and others. The general pattern was the same regardless of the means of normalization, so we report here the analysis after normalizing recombination events per strain adjusted by branch length. Our analysis revealed a significant negative log-linear relationship between normalized recombination frequency and non-synonymous substitution rates (Ka) for strains within the same phylogroup and between different primary phylogroups, as predicted (Linear regression; F = 49.51, df = 30, p < 0.0001, r 2 = 0.6227) (Fig. 5a). A significant negative log-linear relationship was also observed between normalized recombination frequency and synonymous substitution rates (Ks) for the same strain pairs (linear regression; F = 54.53, df = 30, p < 0.0001, r 2 = 0.6451) (Fig. 5b). In contrast, recombination events between strains from different secondary phylogroups and between strains in primary versus secondary phylogroups displayed a significant negative log-linear relationship between normalized recombination frequency and Ka (linear regression; F = 10.58, df = 32, p = 0.0027, r 2 = 0.2485) (Fig. 5a). Again, this relationship was supported by comparisons of normalized recombination frequency with Ks for the same strain pairs (linear regression; F = 11.40, df = 32, p = 0.0019, r 2 = 0.2627) (Fig. 5b). One of the reasons why we might not find a negative relationship between recombination rates and evolutionary rates of more distantly related strains is that other factors, like environmental isolation, are confounding recombination biases that are associated with sequence similarity.
We then applied hierarchical clustering analysis to assess the relationship between phylogroups based on the frequency of recombination between them (Fig. 5c) and identified two distinct clusters. One cluster contains all but one of the primary phylogroups (phylogroup 10), and therefore includes the vast majority of strains that have been isolated from agricultural environments (phylogroups 1, 2, 3, 4, 5, and 6). The second clade contains all of the secondary phylogroups and therefore includes many strains with environmental origins (phylogroups 7, 9, 10, 11, and 13). The only exception to a clean split between primary and secondary phylogroups is phylogroup 10, which clusters with the primary phylogroups in the core genome phylogeny, but clusters with the secondary phylogroups in this analysis. This finding is interesting since two of the three strains from phylogroup 10 in our collection come from environmental sources, while the third was isolated off a non-diseased plant. These results suggest that ecological differences may also play a role in establishing recombination barriers within the P. syringae species complex [101]. While these relationships are robust to different methods of normalizing the number of recombination events, it is important to note that we also have much better sampling of nearly all the primary phylogroups relative to the secondary phylogroups, and therefore, much more confidence in the overall patterns of diversity found in these groups.
Previous studies have also reported significant horizontal gene transfer (HGT) between the P. syringae complex and other bacterial species [61]. Therefore, we A C B Fig. 5 Recombination analysis between P. syringae strains from different phylogroups (PGs). Pairwise phylogroup recombination events were normalized based on the pan-genome size, the number of strains, and the total branch length for each phylogroups pair. a Regression analysis of recombination rates and corresponding non-synonymous substitution rates (Ka). There is a significant negative log linear relationship between recombination rates and Ka for strains within the same phylogroup and between different primary phylogroups (F = 49.51, df = 30, p < 0.0001, r 2 = 0.6227); however, the inverse relationship exists when comparing more distantly related strains from different secondary phylogroups and strains from primary and secondary phylogroups (F = 10.58, df = 32, p = 0.0027, r 2 = 0.2485) b Regression analysis of recombination rates and corresponding synonymous substitution rates (Ks). The same significant negative (F = 54.53, df = 30, p < 0.0001, r 2 = 0.6451) and positive (F = 11.40, df = 32, p = 0.0019, r 2 = 0.2627) log linear relationships were observed for strains within the same phylogroup and between different primary phylogroups, and more distantly related strains from different secondary phylogroups and strains from primary and secondary phylogroups, respectively c Hierarchical clustering of homologous recombination frequency between phylogroups of the P. syringae species complex. Pairwise distances between phylogroups were calculated using the Jaccard coefficient method, based on the normalized pairwise recombination rates. Note that phylogroup 10 (PG10) is a primary phylogroup that is more closely related to phylogroups 1, 2, 3, 4, 5, and 6. Agricultural vs. Environmental labeling indicates that the bulk of the strains in these phylogroups come from these sources performed a blastp search for all protein sequences from all 391 P. syringae genomes (2,176,750 sequences) against the NCBI-GenBank non-redundant protein database to identify candidate genes that have recently undergone cross-species horizontal transfer. Specifically, we considered any protein sequence with a significant match from another species in the first three blast hits to be a candidate for recent cross-species horizontal transfer. This allows us to in minimize false negatives resulting from the best matches being from the query strain or other closely related P. syringae strains that are present in the database. Based on these criteria, we identified 31,409 (1.44%) candidate horizontally transferred genes, and another 55,765 (2.56%) genes with no similarity matches in the non-redundant database. The most common genera involved in the putative horizontal transfer events include Pseudomonas, Xanthomonas, Burkholderia, Klebsiella, Enterobacter, Serratia, Legionella, Pectobacterium, Pantoea, Escherichia, Salmonella, Ralstonia, Azotobacter, Achromobacter, Erwinia, Rhizobium, Bordetella, and Stenotrophomonas (Additional file 1: Figure S13A). After normalizing for the number of strains in each phylogroup, it appears as though three non-agricultural, environmentally isolated phylogroups (in rank order: phylogroups 13, 7, and 11) undergo the most HGT (Additional file 1: Figure S13B). This remains the case for phylogroups 11 and 13 when Pseudomonas is not included as a donor genus, but phylogroup 7 does not appear to undergo higher rates of HGT with non--Pseudomonas donors. In any event, this finding suggests that environmental P. syringae strains may retain more loci obtained via HGT with other bacterial species because of increased opportunities to interact with a more diverse community of microbes, many of which could be unrelated pathogenic strains.

Maintenance of genetic cohesion
In clonally reproducing bacteria, recombination is the only evolutionary process that can counter lineage diversification driven by mutation, genetic drift, and selection, thereby maintaining the overall genetic cohesion of the species. As discussed above, inter-phylogroup recombination occurs less frequently than intra-phylogroup recombination. This relationship is predicted based on the well-established log-linear relationship between sexual isolation (i.e., inverse of the recombination rate) and the level of sequence divergence due to increased difficulty of forming a DNA heterduplex as sequence divergence increases [99]. Despite this, we did find evidence that a considerable proportion of ortholog families participate in inter-phylogroup recombination, which could be an important force for maintaining genetic cohesion in the P. syringae species complex. We therefore wished to know the relationship between inter-phylogroup recombination and ecologically and evolutionarily significant genetic loci. Specifically, we examined whether inter-phylogroup recombination disproportionately occurred at these critical loci. To study this relationship, we focused on 17,807 orthologous gene families present in at least five P. syringae strains so that we could reliably detect signatures of recombination in all families included in the analysis. We then classified all of these families based on whether they display evidence of inter-phylogroup recombination (GENECONV), whether they were identified as ecologically significant (VFDB), and whether they were identified as evolutionarily significant (FUBAR positive selection analysis).
We first asked if there was a higher frequency of ecologically significant, virulence-associated loci among the evolutionarily significant, positively selected loci (Fig. 6a). 23.50% (208) of the 885 virulence-associated ortholog families were found to have a signal of positive selection compared to 21.75% (3680) of the 16,922 non-virulenceassociated ortholog families (chi-squared proportions test; χ 2 = 1.58, df = 1, p = 0.2081), indicating that positive selection is not more likely to operate on virulence-associated loci in general. Second, we asked if inter-phylogroup recombination disproportionately acted on virulence-associated ortholog families (Fig. 6b). 15.25% (135) of the 885 virulence-associated families were found to recombine between phylogroups compared to only 10.77% (1822) of the 16,922 non-virulence-associated families (chi-squared proportions test; χ 2 = 19.08, df = 1, p < 0.0001), indicating that virulence-associated loci are significantly more likely to recombine between phylogroups than non-virulence-associated loci. Third, we asked if inter-phylogroup recombination disproportionately acted on positively selected ortholog families (Fig. 6c). 13.32% (518) of the 3888 positively selected families were found to recombine between phylogroups compared to only 10.34% (1439) of the 13,919 non-positively selected families (chi-squared proportions test; χ 2 = 51.40, df = 1, p < 0.0001), indicating that positively selected loci are also significantly more likely to recombine between phylogroups than non-positively selected loci. Fourth, we asked if inter-phylogroup recombination disproportionately acted on the small set of loci that are both positively selected and virulence-associated (Additional file 7). 20.19% (42) of the 208 positively selected, virulence-associated ortholog families were found to recombine between phylogroups as opposed to 10.88% (1915) of the 17,599 other ortholog families (chi-squared proportions test; χ 2 = 17.86, df = 1, p < 0.0001). This set of orthologs include some of the most widely studied loci associated with host-microbe interactions, including numerous T3SEs, components of the flagellar system (fliC, flg22), phytotoxins, chemotaxis proteins, and an alginate regulatory protein (Additional file 7). We also performed this same suite of analyses focusing exclusively on primary phylogroups (1, 2, 3, 4, 5, and 6) to examine the strength of recombination to maintain genetic cohesion in this cluster of more closely related P. syringae strains. Indeed, although there is still no significant correlation between ecologically and evolutionarily significant genes in the primary phylogroups, the frequency with which both ecologically and evolutionarily significant genes are transferred between primary phylogroups is even greater than it was when we considered all phylogroups (Additional file 1: Figure S14, Additional file 1: Table S3).
Taken together, these results demonstrate that inter-phylogroup recombination disproportionately involves ecologically relevant (virulence-associated) and evolutionarily significant (positively selected) ortholog families in P. syringae. This may be because these families are individually recombined across phylogroups at a higher rate or because the recombination events involving these families are larger and involve multiple ecologically relevant or evolutionarily significant genes. Therefore, while inter-phylogroup recombination may be less common than intra-phylogroup recombination, it plays a critical role in circulating genes important for maintaining the ecological niche of the species complex and thus maintains the genetic cohesion on between all P. syringae strains.

Discussion
In this study, we analyzed the genomes of a diverse collection of 391 P. syringae strains representing 11 of the 13 P. syringae phylogroups to gain insight into the genome dynamics and evolutionary history of the P. syringae species complex. We reveal that P. syringae has a large and diverse pan-genome that will likely continue to expand with the sampling of more strains. We also demonstrate strong concordance at the phylogroup level between the refined core genome and gene content trees of P. syringae strains with a few exceptions, suggesting that while horizontal gene transfer between P. syringae phylogroups is typically insufficient to distort the phylogenetic signal from vertical inheritance of gene content, there are cases where it has distorted relationship among subgroups. Furthermore, by investigating the distribution of ecologically and evolutionary relevant loci in the P. syringae species complex and the rates of intra-and inter-phylogroup recombination of these genes, we also demonstrate that despite its relative rarity, inter-phylogroup recombination is a critical cohesive force that disproportionately facilitates the spread of ecologically and evolutionarily significant loci across P. syringae phylogroups.
Core and accessory genetic content in the P. syringae pan-genome The P. syringae pan-genome is vast and extremely diverse, comprising a total of 77,728 ortholog families. Yet, very few of these ortholog families are present at high frequency in the P. syringae species complex. A rarefaction analysis demonstrates that the composition and size of core genome stabilizes after sampling approximately 50 strains at~2500 genes. This is slightly smaller than estimates from three prior studies that identified core genome sizes of 3397 [23], 3364 [61], and 3157 [102]. However, these prior studies were mostly restricted to the primary phylogroups, and only the Mott et al. study [102] was performed with more than 50 strains. The higher core genome sizes that we observed when analyzing single primary phylogroups are more consistent with these earlier studies, thus supporting the notion that these earlier studies overestimated the core genome size of P. syringae due to insufficient sampling. The P. syringae core genome size is also comparable to the core genome sizes of other pathogenic Proteobacteria, including P. aeruginosa (2503) [103], Erwinia amylovora (3414) [104], and Ralstonia solanacearum (2543) [105]. This raises the possibility that different pathogenic bacteria may have similar core metabolic requirements; however, the extent to which the core genome content is conserved across species will require further investigation. Our analysis further clarifies and expands our understanding of the highly dynamic nature of the P. syringae accessory genome. The gene family size distributions (Additional file 1: Figure S5) suggest that a relatively small number of gene families are found in more than ten strains (16.36%), while the majority of families (60.60%) are only found in a single strain. The pan-genome rarefaction curve (Fig. 1b) demonstrates that the pan-genome of P. syringae remains open after sampling 391 strains and will therefore continue to increase in size as more diverse P. syringae strains are added to the analysis at a rate of~193 new ortholog families for each new strain analyzed. However, the rate at which the pan-genome size will increase will clearly be affected by the phylogroup from which new strains are sampled given that the secondary phylogroups remain severely under sampled. The tendency of gene families to be present in only a single strain is often attributed to a species' ability to acquire novel DNA through horizontal gene transfer [106]. However, the ubiquitous distribution of P. syringae strains across the globe is likely also a key contributor to the diverse gene content of different strains, as many strain-specific genes may be under selection only in specific environments. A large number of the strain-specific gene families that were identified in this study are annotated as hypothetical genes with no similar sequences in any database that we searched, and thus may represent a diverse collection of niche specific genes in P. syringae that are entirely unexplored. However, as we have already acknowledged, it is also important to recognize that some of these strain specific genes may be artifactual due to sequencing and assembly errors [60]. Furthermore, although the P. syringae pan-genome remains open, we believe we have sampled the majority of higher-frequency genes, at least in primary phylogroups, since our rarefaction analysis on non-singleton orthologs did plateau (Fig. 1b).

Phylogenetic relationships and diversity among P. syringae strains
Investigating the relationship between core genome and gene content trees can shed important insight into the lifestyle and evolutionary history of bacterial species. Specifically, strong discordance between core genome and pan-genome trees is suggestive of extensive genomic flux among lineages [107], which obscures the clonal relationship between strains in the gene content tree. For example, genome analyses of core genome and gene content in the marine bacteria Vibrio have shown strong discordance, suggesting extensive horizontal transfer between lineages [108]. However, other species like the marine bacterium Prochlorococcus have concordant core genome and gene content phylogenies [109], suggesting that horizontal transfer has played a lesser role in their evolutionary history.
In P. syringae, the core genome and gene content trees are largely concordant at the level of phylogroups. The one major exception to this concordance is the relationship between phylogroups 2 and 10, which cluster more closely in the gene content tree than they do in the core genome tree. Previous studies have shown that phylogroups 2 and 10 have similar virulence repertoires [21] and that almost all strains from these phylogroups have high ice nucleation activity [14,76,110]. This elevated gene content and phenotypic similarity likely reflects similarity in the lifestyles and ecology of strains from these phylogroups, which may be the result of increased horizontal transfer, convergent evolution, or both.
Indeed, we find that the 2832 gene families that are in the soft-core genome (> 95% of strains) of both phylogroups 2 and 10 are significantly more likely to be evolutionarily significant (chi-squared proportions test; χ 2 = 832.31, df = 1, p < 0.0001) and ecologically significant (chi-squared proportions test; χ 2 = 9.72, df = 1, p = 0.0018) than the remaining 14,975 non-core families. However, gene families in the soft-core genome of phylogroups 2 and 10 are significantly less likely to be involved in inter-phylogroup recombination events than other genes (chi-squared proportions test; χ 2 = 15.22, df = 1, p < 0.0001). This suggests that phylogroups 2 and 10 strains do not exchange more genes than the rest of the P. syringae species complex through recombination. Consequently, convergent evolution likely plays a key role in the increase of shared genes between these two phylogroups. It is nevertheless important to emphasize that the P. syringae core genome and gene content trees are largely concordant at the level of phylogroup, which suggests that although we do find some evidence of genomic flux, the rate of inter-phylogroup horizontal transfer is not sufficient to obscure the phylogenetic signature of vertical gene inheritance.
The P. syringae species complex is unquestionably highly diverse, and claims have been made that the diversity between phylogroups is actually greater than the observed diversity between well-established species [61]. We used the entire soft-core genome alignment to estimate the level of genetic divergence between all phylogroups to explore whether distinct phylogroups do in fact have consistently higher genetic divergence than distinct species pairs (Additional file 1: Figure S6). We determined that average Ka and Ks values among strains in the primary phylogroups were less than the average values between P. aeruginosa and P. putida strains, and E. coli and S. enterica strains. The average among primary phylogroup Ka values was also lower than the average values between strains of A. hydrophila and A. salmonicida, although the Ks values were roughly similar. Estimates of Ka and Ks between N. gonorrhoeae and N. polysaccharea are considerably lower than those of both P. syringae phylogroups and other distinct species pairs, but the Neisseria genus is known to be highly recombinogenic, which can distort evolutionary rates, making this species pair a likely outlier [65]. In contrast, both the average Ka and Ks values obtained when comparing strains between primary and secondary phylogroups or those between different secondary phylogroups are more consistent with the distinct species pairs, with a few exceptions. Overall, these analyses suggest that the primary phylogroups are not excessively divergent relatively to other bacterial species, in contrast to the secondary phylogroups, which may be sufficiently divergent to be considered distinct species.

Phylogenetic distribution ecologically significant genes
A unifying feature among all strains in the P. syringae species complex included in this study is the presence of at least one T3SS. The most common T3SS in the P. syringae species complex is the canonical T-PAI T3SS, and consistent with prior studies, we found that nearly all agriculturally associated strains carry one. In addition, we also found that a number of non-agricultural strains from phylogroups 9 and 10 possess a canonical T-PAI T3SS. These data are consistent with an earlier report of the presence of a canonical T-PAI T3SS in non-agricultural strains from phylogroup 1A [25,26], some of which were shown to cause disease on tomato. Although the host-range of these non-agricultural strains from phylogroups 9 and 10 has yet to be studied experimentally, it raises the interesting possibility that they may be pathogens of wild plant species and act as a reservoir for the recurrent emergence of crop pathogens.
In addition to the canonical T-PAI T3SS, we also found that many P. syringae strains possess an R-PAI T3SS, while the A-PAI and S-PAI T3SSs are found in a small number of strains isolated in discrete phylogroups. The A-PAI and S-PAI T3SSs are always present in the absence of the canonical T-PAI, suggesting that they may serve as a replacement T3SS in a different niche. In contrast, the R-PAI T3SS is always present in concert with at least one other T3SS. Bacteria with multiple T3SSs that have complementary functions have been reported previously [111,112]. For example, Salmonella species contains two different T3SSs known as SPI-1 and SPI-2 [111]. SPI-1 promotes bacterial pathogenicity by facilitating host invasion, while SPI-2 is critical for survival, replication and dissemination of the bacteria after it enters the host cell [113]. This is also not the first study report of the presence of the R-PAI T3SS outside of Rhizobium species. A wide array of symbiotic and non-pathogenic bacteria, including Photorabdus luminescens, Sodalis glossindicus, Pseudomonas fluorescens, and Desulfovibrio vulgaris, have also been reported to harbor the R-PAI T3SS [113]. Although its expression in P. syringae is low and its function outside of Rhizobia remains unclear [75], the broad distribution of this the R-PAI T3SS across P. syringae strains implies that it is likely of functional importance for a number of strains in the complex.
The phylogenetic distribution of the different T3SSs and our phylogenetic analysis of the conserved HrcV protein from all T3SSs also sheds critical light on the evolutionary history of each T3SS in the P. syringae species complex. The broad phylogenetic distribution of the T-PAI T3SS has led some previous studies to conclude that it was present in the most recent common ancestor of the P. syringae species complex [114,115], while others have suggested that the canonical T-PAI may have been acquired after the divergence of the primary and secondary phylogroups [72,76]. Indeed, the patchy distribution among strains in the secondary phylogroups (i.e., found in only 37.50% of secondary phylogroup strains vs. 97.91% for primary phylogroup strains) observed here provides evidence that the canonical T-PAI was acquired after the divergence of the primary and secondary phylogroups. However, acquisition by the common ancestor of all P. syringae and subsequent loss by some secondary phylogroup lineages is also a possibility.
Two additional lines of evidence support the early acquisition of both the T-PAI and the R-PAI T3SSs. First, the genomic region encoding these T3SSs shares the same %GC as the rest of the genome [6,75]. Second, the HrcV genealogies from both the T-PAI and the R-PAI T3SSs are generally congruent with the core genome tree (Fig. 2a, Additional file 1: Figure S8), indicating a common evolutionary history. In contrast, the rarity of the A-PAI and S-PAI T3SSs in the P. syringae complex suggest later horizontal transfer into only a few P. syringae lineages. Specifically, the A-PAI T3SS appears to have been acquired independently in phylogroup 13 and a small group of phylogroup 2 strains (phylogroup 2c), as evidenced by the unique location of the A-PAI T3SS in these two genomes. The S-PAI T3SS, which is most closely related to the T3SS found in Erwinia and Pantoea species, is also present in two distantly related phylogroups (7 and 11) which are reported to be pathogenic on some plants [14].
As shown in previous studies [6,23,116], T3SEs that are delivered by the T3SS are patchily distributed across the P. syringae species complex with a few exceptions. The presence of these T3SEs in only a small but diverse suite of strains suggests that horizontal gene transfer is common in these families and that they are subject to strong diversifying selection. Specifically, T3SEs are known to experience frequent gain/loss events and rapid sequence diversification to obtain new functional capabilities or avoid host immune recognition [23,[117][118][119]. The phylogenetic distribution and diversification of the effectors analyzed in this study suggests that both of these evolutionary forces are at play in a large number of the P. syringae T3SE families. Despite the patchy distribution of most T3SEs, prior studies have identified a set of four core T3SEs, which include avrE1, hopAA1, hopM1, and hopI1 [116,6]. We confirmed this characterization for the avrE1 and hopAA1 families, but the hopM1 and hopI1 effectors are not present in more than 95% of the strains analyzed in this study, even though they are present in the majority of strains from the primary phylogroups. In addition to avrE1 and hopAA1, we also identified a third core T3SE, hopAJ1, and two other T3SE families, hopAN1 and hopJ1, that are present at some frequency in all eleven phylogroups. However, these gene families have more recently been discontinued as T3SE families or reclassified as T3SS helpers because they are not translocated into the host cytoplasm by the T3SS. Finally, using an HMM-modeling approach that searches for the conserved N-terminal secretion signal and the hrp-box promoter of known T3SEs, we have also proposed a new set of novel T3SEs in the P. syringae species complex that are strong candidates for functional assays (Additional file 1: Table S1). However, given prior evidence that several candidate T3SEs that are expressed under the control of hrp-boxes are not translocated [23,80,81], a number of these candidates will likely not be functional T3SEs.
Recombination and genetic cohesion in the P. syringae species complex Recombination plays a significant role in the evolution of bacteria [100,120], and while it can lead to either genetic diversification or homogenization depending on the population structure of the donor and recipient strains, the latter role is particularly important in maintaining genetic cohesion within a species [35,38,120,121]. Previous studies in P. syringae have reported that recombination between phylogroups is relatively rare [13,61,122]. However, these studies were based on analyses of a small set housekeeping genes were performed with a limited collections of strains, so they lacked a sufficient genomic and sampling depth to draw firm conclusions about the extent of recombination across the pan-genome. This is particularly important because it has been suggested that horizontal transfer occurs at a relatively high rate in the accessory genome and has a disproportionate effect on strain adaptation in nature [5,23,61]. Our analysis found a signature of recombination in 11,533 (64.77%) of the 17,807 ortholog families that were present in at least five P. syringae strains. Among the 4433 recombination events identified by GENECONV, 2476 (55.85%) of these events were intra-phylogroup recombination events, while the remaining 1957 (44.15%) were inter-phylogroup recombination events. These findings reaffirm that recombination within phylogroups is more common than recombination between phylogroups, likely as a result of the well-established linear relationship between sequence divergence and the logarithm of the recombination rate [99,100]. However, while sequence similarity appears to be the key factor determining the rate of recombination between relatively closely related strains within the primary phylogroups, our data suggest that recombination between more distantly related strains appears to be governed by other forces (Fig. 5). A particularly intriguing finding is that phylogroup 10 strains cluster with secondary phylogroup strains in terms of their pairwise recombination frequency, despite the fact that phylogroup 10 is a primary phylogroup in the core genome tree (Fig. 2). The major distinction between phylogroup 10 strains and the bulk of the primary phylogroup strains is that, like secondary phylogroup strains, they were isolated from non-agricultural sources. This may indicate that ecology plays a more important role in determining the extent of recombination than sequence similarity, at least for long-distance (e.g., between phylogroup) genetic exchange.
lthough inter-phylogroup recombination is rarer than intra-phylogroup recombination overall, we also used our expanded dataset to explore whether specific evolutionarily and ecologically important gene families more frequently undergo inter-phylogroup recombination than other gene families. For ecologically important genes, we used all virulence associated orthologous gene families that were identified by the VFDB (885/ 17,807; 4.97%). For evolutionarily important genes, we used all orthologous gene families determined to be positively selected at least one site by FUBAR (3888/ 17,807; 21.83%). The analysis showed that both ecologically and evolutionarily important gene families are more likely to recombine between phylogroups than other gene families (Fig. 6). This finding is consistent with the observation that ecologically adaptive genes are successfully transferred at high rates among diverse strains in a species complex [123], and suggests that inter-phylogroup recombination disproportionally spreads ecologically and evolutionarily important genes across phylogroups, which may help maintain genetic cohesion within the P. syringae species complex.
Fundamental evolutionary principles for delimiting P. syringae species There is a long history to the debate over the appropriate way to delimitate species within the P. syringae complex [17], stemming from the use of largely arbitrary and ad hoc species delimitation cutoffs in DNA-DNA hybridization assays, MLST analyses, and pathotype designations [7,17,124,125]. Most of these prior studies have been poorly-powered in terms of both the number of strains and the number of genes analyzed. A more recent study by Gomila et al. has employed comparative genomics approaches to a more diverse collection of 139 P. syringae complex strains to rectify these issues [9]. In this study, the authors suggest the presence of a total of 19 nomenspecies in the P. syringae species complex. However, their analyses do not consider the role of genome-wide recombination in maintaining genetic cohesion between these nomenspecies. Because the current study dramatically increases both the number and diversity of P. syringae strains sampled and considers the role of recombination in creating species barriers between P. syringae strains, we obtain a unique perspective into the ecological and evolutionary forces operating in the P. syringae species complex and suggest that future work to delimit the complex should be founded with consideration of these fundamental evolutionary processes.
From an ecological perspective, species differentiation results from the adaptation of two or more subpopulations to different environments or niches [101,126]. Here, diversifying selection among a few loci that are essential for differential adaptation to alternative environments can drive speciation in the absence of barriers to recombination. There is evidence that this has occurred in P. syringae, given the broad global distribution and diverse disease-causing capabilities of P. syringae strains [1]. Specifically, Monteil et al. show weak ecological differentiation between an agricultural pathogenic P. syringae population and a closely related environmental population of P. syringae, despite there being no barrier to recombination between these populations [26]. However, it is currently unclear what the differentially selected loci in these populations are and whether they have sufficiently diverged to be considered an early speciation event. Furthermore, the lack of correlation between the core genome phylogenetic profile of P. syringae strains and their pathovar designations suggests that there are many different pathways for adaptation to a single host, so ecological differentiation on its own is likely a poor way to speciate the P. syringae species complex [14,23,25,26] . Future studies should focus on expanding the dataset of non-agricultural P. syringae strains so that we can more effectively distinguish and analyze loci that are differentially selected in ecologically divergent strains.
Both sequence clustering and recombination barriers have been used to delimit bacterial species based on evolutionary principles [127]. Yet, even with the growing abundance of genomic data, it is unlikely that any one criterion will adequately resolve species barriers in the P. syringae complex, largely due to the fluid nature of bacterial genomes. Given what we now know about the phylogenetic relationships between strains, the distribution of ecologically and evolutionarily important genes, the disproportionately high rate of inter-phylogroup recombination among ecologically and evolutionarily significant loci, and finally, the common ecology of diverse P. syringae strains, we propose that there is no ecologically or evolutionarily justifiable basis to split the strains of the primary phylogroups of P. syringae into separate species. In fact, P. syringae provides an outstanding example of how recombination, despite being relatively infrequent, maintains genetic cohesion in this very widespread, diverse, and globally significant lineage.

Genome sequencing and assembly
A total of 391 P. syringae strains and 22 outgroup Pseudomonas strains were used in this study (Additional file 2). The genome assemblies and annotations for 145 of these strains were obtained from public sequence databases, including NCBI/GenBank, JGI/IMG-ER, and PATRIC [128][129][130]. The remaining 268 strains were obtained from the International Collection of Microorganisms from Plants (ICMP) and other collaborators, and were sequenced, assembled, and annotated in the Center for the Analysis of Genome Evolution and Function (CAGEF) at the University of Toronto. For these strains, DNA was isolated using the Gentra Puregene Yeast and Bacteria Kit (Qiagen, MD, USA). Purified DNA was then suspended in TE buffer and quantified with the Qubit dsDNA BR Assay kit (ThermoFisher Scientific, NY, USA). Paired-end libraries were generated using the Illumina Nextera XT DNA Library Prep Kit following the manufacturer's instructions (Illumina, CA, USA), with 96-way multiplexed indices and an average insert size of ≈400 bps. All sequencing was performed on either the Illumina MISeq or GAIIx platform using V2 chemistry (300 cycles). Following sequencing, read quality was assessed with FastQC [131] and low-quality bases and adapters were trimmed using Trimmomatic v0. The trimmed paired-end reads for each of the 268 Pseudomonas genomes sequenced at CAGEF were de novo assembled into contigs using the CLC assembly cell v4.2 program from CLCBio (Mode = fb, Distance Mode = ss, Minimum Read Distance = 180, Maximum Read Distance = 250, Minimum Contig Length = 200). All contigs that were less than 200 bps long were then removed from each assembly and the raw reads from each strain were re-mapped to the remaining contigs using clc_mapper. Next, using clc_mapping_info and clc_info, we calculated the read coverage for each contig in each assembly and compared that with the average contig coverage of the genome assembly to identify contigs with atypical coverage (> 2 standard deviations from the average contig coverage). These atypically covered contigs were then compared to the EMBL plasmid sequence database and the GenBank nucleotide database using BLAST and were removed from the assembly if they were not identified as part of a plasmid sequence.
Gene prediction for these 268 draft Pseudomonas assemblies was performed using DeNoGAP [50], which predicts genes based on the combined output of Glimmer, GeneMark, Prodigal, and FragGeneScan [51][52][53][54]132]. For most genes, these algorithms accurately predicted both the start and the stop positions, but in some instances, genes were incomplete (missing appropriate start and/or start codons). In these cases, we extended the gene as a triplet codon until a stop codon was found at both the 5′ and 3′ end. The first Methionine codon downstream from the 5′ stop codon was considered the start codon, while the first 3′ stop codon was considered the stop codon. This approach allowed us to obtain complete coding sequences for a number of incomplete genes, but for others we were unable to predict a start and stop codons due to a contig break or an assembly gap. These and any other genes that contained runs of N's were considered partial genes and were excluded from the final dataset to avoid complications in downstream comparative and evolutionary analyses. Furthermore, complete coding regions that were only predicted by one program and could not be verified by blasting against the UniProtKB/SwissProt database or pass a minimum length cutoff of 100 bps were discarded. The final collection of coding sequences was then sorted by genome location, and any coding regions that overlapped by more than 15 bases were merged into a single sequence.
All complete genes were then annotated using a blastp search of the corresponding protein sequences for each gene against the UniProtKB/SwissProt database with an E value threshold of 1 −5 [55] . The name and/or description of the best hit was assigned to the corresponding protein and proteins that did not have any significant hits were assigned as hypothetical proteins. Gene ontology terms, protein domains, and metabolic pathways were also annotated in each complete gene using Inter-ProScan v5 (E value < 1 −5 ) [56]. Finally, all complete genes were assigned Cluster of Orthologous Group (COG) categories using a blastp search against the COG database (E value < 1 −5 ) [57,133]. However, COG families were only assigned if the protein query had high sequence identity and coverage (> 70%) with at least three sequences in the family.

Ortholog prediction and phylogenetic analysis
We clustered all complete protein sequences from the 413 Pseudomonas genomes described above, which included 391 P. syringae strains representing 11 of the 13 phylogroups, into putative homolog and ortholog families using DeNoGAP [50]. First, all protein sequences from the closed genome of P. syringae DC3000 were used to construct seed HMM families for DeNoGAP [68], using an all-vs-all pairwise protein sequence comparison with phmmer (E value < 1 −10 ) [134]. Proteins that had greater than 70% identity and 70% coverage for both sequences were clustered together using Markov Chain Clustering (MCL) (inflation value = 1.5) [135]. Proteins that did not pass these criteria with any other protein sequence in the HMM database were clustered separately into a new protein family. The protein sequences from the remaining 412 genomes were then iteratively scanned against the reference HMM database as described above, updating the HMM model and database after each iteration. Following the initial clustering of all proteins from the 413 Pseudomonas genomes into putative homolog families, HMM families were grouped into larger families if at least one member of a family shared more than 70% identity with at least one member of another family. Orthologous protein pairs were then extracted from these homolog families using the reciprocal pairwise distance approach and were clustered into ortholog families using MCL (inflation value = 1.5) [135].
Once all gene families had been clustered, we analyzed the pan-genome of P. syringae using a binary presenceabsence matrix for each ortholog family in the 391 P. syringae genomes, where 1's were used to encode presence and 0's were used to encode absence [136]. We assigned all gene families that were present in at least 95% of the P. syringae strains in our dataset to the soft-core genome and all other gene families to the accessory genome. The more lenient cutoff of 95% is justified because it allows us to limit the artificial reduction in the core genome size that occurs because of disrupted or unannotated core genes in some draft genomes (Additional file 1: Figure S2). We then determined whether the pan-genome of P. syringae was opened or closed using the "micropan" R package [59]. Here, a rarefraction curve of the entire pan-genome was computed using 2000 permutations, each of which was computed using a random genome input order. The curve was then fitted to Heap's law model to calculate the average number of unique ortholog clusters observed per genome and determine whether the pan-genome is opened or closed. For the core and pan-genome analyses that were performed for each individual phylogroup, we simply extracted the portion of the pan-genome matrix containing the strains from the desired phylogroup, then removed families that were not present in any of those strains. All subsequent analyses were performed on these extracted sub-matrices with 100 permutations.
The phylogenetic relationships between the 391 P. syringae strains analyzed in this study were explored using both a soft-core genome tree and a pan-genome content tree. For the core genome tree, we multiple aligned the protein sequences from each soft-core ortholog family using Kalign Version 2, which uses the Wu-Manber pattern matching algorithm [137]. We then concatenated these alignments and removed all monomorphic sites from this alignment using an in-house perl script. The core genome maximum likelihood phylogenetic tree was then constructed using FastTree with default parameters [138]. FastTree uses a combination of maximum likelihood nearest-neighbor interchange (NNIs) and minimum evolution subtree-pruning-regrafting (SPRs) methods for constructing phylogenies [138][139][140]. Local branch support values for the topology of the phylogenetic tree were also calculated in FastTree using Shimodaira-Hasegawa (SH) test [141]. For the genetic content tree, we used the shared gene content information from the "micropan" R package to calculate the genetic distance between each strain and generate a pan-genome distance matrix with Jaccard's method. The topological robustness of the gene content tree was tested by performing average linkage hierarchical clustering with 100 bootstraps. This same method was also employed for the T3SE content and exchangeable effector locus trees.

Identification and analysis of ecologically relevant genes
The first set of ecologically relevant genes that we investigated were the genes that constitute the T3SS, a key virulence determinant in pathogenic P. syringae strains. Specifically, we used the core structural genes of different forms of T3SSs, including the canonical tripartite pathogenicity island (T-PAI) T3SS, the atypical pathogenicity island (A-PAI) T3SS, the single pathogenicity island (S-PAI) T3SS, and the Rhizobium-like pathogenicity island (R-PAI) T3SS to explore the distribution of different T3SSs across the P. syringae species complex. To determine if a particular form of T3SS was present in a given strain, we performed a tblastn search for the core structural genes of each T3SS against each P. syringae genome assembly with an E value cutoff of 1 −5 . All core structural genes for each T3SS were downloaded from NCBI GenBank, using P. syringae DC3000 and P. viridiflava PNA3.3a as references for the T-PAI T3SS, P. syringae Psy642 and P. syringae PsyUB246 as references for the A-PAI T3SS, P. viridiflava RMX3.1b as a reference for the S-PAI T3SS, and P. syringae 1448A as a reference for the R-PAI T3SS. We then chose the top hits for each T3SS structural gene in each genome, translated the region into a protein sequence, and confirmed that there were no premature truncations in the sequence. A given T3SS was considered present if all core structural genes for that T3SS were present and not truncated. These presence/absence data were then used to analyze the distribution of different T3SSs across the P. syringae species complex.
The second ecologically relevant genes that we explored were the T3SEs that are delivered into plant hosts by the T3SS. To analyze the distribution of T3SEs across the P. syringae species complex, we predicted known and novel T3SEs using discrete pipelines. For known T3SEs, we performed a tblastn against each P. syringae assembly using a collection of 1215 experimentally verified or computationally predicted effector sequences downloaded from the BEAN 2.0 database (E value < 1 −5 ) [78]. If a significant hit was identified for a T3SE, the region of the best or only hit was extracted from the genome as a putative T3SE. To identify novel T3SEs, we first constructed an HMM-model using known hrp-box motifs from three completely sequenced P. syringae genomes (Pto DC3000, Pph 1448A, and Psy B728A) [68,73,84,142]. These motif sequences were multiple aligned using Kalign2 [137] and the HMM-model was constructed using hmmbuild [134]. The hrp-box HMM model was then scanned against each P. syringae genome assembly using nhmmer with a high E value (10,000) and low bit score (4) threshold, given the likelihood that this model would yield false positives as a result of the short sequence length. Because a number of T3SEs are known to reside in operons, we then inspected the ten genes downstream of each predicted hrp-box motif for a N-terminal secretion signal using EffectiveT3 [143]. If a gene was both a less than 10 genes downstream of a hrp-box and classified as a T3SE based on their N-terminal secretion signal, we considered them putative T3SEs. The effector repertoire of each P. syringae strain was ultimately used to characterize the core and accessory effector profile of the P. syringae species complex.
A third set of ecologically relevant genes that we studied consisted of eight well-characterized phytotoxins of the P. syringae species complex, including coronatine, phaseolotoxin, tabtoxin, mangotoxin, syringolin, syringomycin, syringopeptin, and auxin [79]. To determine if these pathways were present in each genome, we performed a tblastn search (E value < 1 −5 ; percent identity > 0.80) using known proteins that are involved in the synthesis of each phytotoxin against each P. syringae genome assembly. Representative query sequences that are involved in the biosynthesis of each phytotoxin were obtained from GenBank, using strain PtoDC3000 for coronatine, PsyBR2R for tabtoxin, PsyB728A for syringomycin, and PsyUMAF0158 for phaseolotoxin, mangotoxin, syringolin, syringopeptin, and auxin. If significant hits were found in a given genome for more than half the of the biosynthesis genes of a phytotoxin, it was considered present, and if not, the phytotoxin was considered absent. These presence/absence data were ultimately used to study the distribution of phytotoxins across the P. syringae species complex.
Finally, we also identified the complete collection of known virulence factors in each genome using the virulence factor database (VFDB, version R3), a reference database of bacterial protein sequences that contains more than 1798 virulence factors from a total of 932 bacterial strains that represent 75 bacterial genera [66,144,145]. Specifically, we predicted virulence factors in each P. syringae genome by blasting the proteome of the genome against the entire VFDB (E value < 1 −5 ). A protein sequence was considered a virulence factor if a hit was found that had more than 70% identity with a sequence in the VFDB database.

Identification and analysis of evolutionarily significant genes
We classified any orthologous gene families that had one or more sites under positive selection as evolutionarily significant. To identify these ortholog families, we used the Fast Unconstrained Bayesian Approximation (FUBAR) pipeline to measure the ratio of non-synonymous substitution rates to synonymous substitution rates (Ka/Ks) at each site in each ortholog family [91]. The FUBAR pipeline was chosen because in implements a Markov Chain Monte Carlo (MCMC) sampler for inferring sites under positive selection, which makes it more efficient for inferring sites under positive selection in large alignments than other methods and allows us to account for the effects of recombination on signatures of selection [146]. For this analysis, we used the output of the GARD recombination analysis to partition ortholog families into non-recombinant fragments. We then analyzed both the partitioned and un-partitioned datasets using FUBAR with 10 MCMC chains, where the length of each chain was equal to 5,000,000, the burn-in was equal to 2,500,000, the Dirichlet Prior parameter was set to 0.1, and 1000 samples were drawn from each chain. Evolutionarily significant genes were extracted from each genome if they were part of an orthologous family that had one or more sites under positive selection in the partitioned analysis.

Detection of genetic recombination
We searched for signatures of homologous recombination within the P. syringae species complex using GARD [92], CONSEL [93], GENECONV [94], and PHI-PACK [95] in all 17,807 ortholog families that were present in at least five strains. Using only ortholog families that are distributed across a larger collection of strains prevents us from failing to detect recombination in a broad array of families simply because we lack power. First, to generate input alignments for the recombination software, we independently aligned the nucleotide sequences for all ortholog families using translatorX [147], then heuristically removed sequences with a high frequency of gaps using the heuristic algorithm option (t = 50) in MaxAlign [148]. For GARD, we analyzed the codon alignment of each family using default parameters, then parsed significant recombination breakpoints in the GARD results file. For CONSEL, we first constructed a protein tree and corresponding core genome tree for all strains in each ortholog family using FastTree [138]. CONSEL was then used with default settings to calculate and compare the per-site likelihood values for these two trees with the gamma option, and ortholog families that were significantly incongruent were identified as recombinant families. For GENE-CONV, we used a gscale parameter of 1 and otherwise default settings to detect significant signatures of recombination in each family based on the polymorphic sties in the multiple alignment. Lastly, for PHIPACK, we employed default settings to test for signatures of recombination based on the maximum chi-square (MaxChi2), the neighbor similarity score (NSS), and the pairwise homoplasy index (PHI) statistical frameworks [95]. The MaxChi2 method classifies ortholog families as recombining if a non-uniform distribution of sequence differences exists along the alignments. The NSS method classifies recombination when adjacent sites show significant incongruence compared to other sites. The PHI method computes an incompatibility score over a sliding window in the alignment using only parsimoniously informative sites, then calculates a p value for recombination in the alignment by column permutation [95]. In all tests, recombination was considered significant if the p value was less than 0.05 after correcting for multiple comparisons. Ortholog families with significant signatures of recombination in the GARD, CONSEL, GENECONV, and PHIPACK analyses were then combined to estimate recombination rates within the P. syringae species complex, after normalizing for the number of orthologs, the number of strains, and the branch lengths in each phylogroup. We also differentiated between intra-and inter-phylogroup recombination events for recombination events identified by GENECONV using their pairwise recombination rates.
In addition to assessing which gene families appear to be undergoing recombination within and between P. syringae phylogroups, we explored HGT between P. syringae and more distantly related species using a blastp search of all protein sequences in each P. syringae strain against the non-redundant NCBI GenBank database using an E value cutoff of 1 −5 , a percent identity cutoff of 70%, and a percent query coverage cutoff of 70%. The top three blast hits were then extracted for each protein and the results were parsed to retain only matches from non-P. syringae species. Any of these remaining hits were viewed as potential HGT events. Although this approach is unlikely to provide accurate measures of the extent of HGT in the P. syringae species complex, it provides critical information on common donor and/or recipient species that may be sharing a niche and DNA with P. syringae strains.
Estimating relative sequence divergence (Ka/Ks) For each P. syringae strain pair, we used the concatenated soft-core genome alignments to calculate the pairwise rates of non-synonymous (Ka) and synonymous (Ks) substitution using the SeqinR package in R [62]. Average Ka and Ks values were then calculated for all phylogroups and between strains of different phylogroups. For comparison, we also calculated the evolutionary rates of a number of different distinct species pairs, including A.  1, NC_003384.1, NC_003385.1). Here, we identified core genes that were shared by each strain pair using a pairwise protein blast with an E value threshold of 1 −5 , and sequence identity and query coverage cutoffs of 80%. We then aligned these core nucleotide sequences using TranslatorX and MUSCLE, and concatenated the alignments using a custom perl script. The Ka and Ks values for each of these species pairs were calculated using the SeqinR package in R, as was the case with the P. syringae strains.