Genomic characterization and phylogenetic analysis of Salmonella enterica serovar Javiana

Salmonella enterica serovar Javiana is the fourth most reported serovar of laboratory-confirmed human Salmonella infections in the U.S. and in Tennessee (TN). Although Salmonella ser. Javiana is a common cause of human infection, the majority of cases are sporadic in nature rather than outbreak-associated. To better understand Salmonella ser. Javiana microbial population structure in TN, we completed a phylogenetic analysis of 111 Salmonella ser. Javiana clinical isolates from TN collected from Jan. 2017 to Oct. 2018. We identified mobile genetic elements and genes known to confer antibiotic resistance present in the isolates, and performed a pan-genome-wide association study (pan-GWAS) to compare gene content between clades identified in this study. The population structure of TN Salmonella ser. Javiana clinical isolates consisted of three genetic clades: TN clade I (n = 54), TN clade II (n = 4), and TN clade III (n = 48). Using a 5, 10, and 25 hqSNP distance threshold for cluster identification, nine, 12, and 10 potential epidemiologically-relevant clusters were identified, respectively. The majority of genes that were found to be over-represented in specific clades were located in mobile genetic element (MGE) regions, including genes encoding integrases and phage structures (91.5%). Additionally, a large portion of the over-represented genes from TN clade II (44.9%) were located on an 87.5 kb plasmid containing genes encoding a toxin/antitoxin system (ccdAB). Additionally, we completed phylogenetic analyses of global Salmonella ser. Javiana datasets to gain a broader insight into the population structure of this serovar. We found that the global phylogeny consisted of three major clades (one of which all of the TN isolates belonged to) and two cgMLST eBurstGroups (ceBGs) and that the branch length between the two Salmonella ser. Javiana ceBGs (1,423 allelic differences) was comparable to those from other serovars that have been reported as polyphyletic (929–2,850 allelic differences). This study demonstrates the population structure of TN and global Salmonella ser. Javiana isolates, a clinically important Salmonella serovar and can provide guidance for phylogenetic cluster analyses for public health surveillance and response.


INTRODUCTION
were both significantly associated with increased IR in GA and TN. Shaw et al. (2016) found a statistically significant association between increased rates of Salmonella ser. Javiana infection and percentage of housing in rural areas in Georgia. Furthermore, they found a statistically significant association between increased Salmonella ser. Javiana infection rates and the presence of broiler feeding operations in MD (Shaw et al., 2016). Rural areas in GA and MD have a high density of broiler chicken operations and rural areas in TN have a high density of both broiler chicken and cattle operations (Shaw et al., 2016). The higher presence of these operations in rural areas could facilitate transmission of Salmonella ser. Javiana and other serovars in a variety of ways. As these operations are found in high density in these rural areas, they may employ a large number of residents in the area. These employees may be directly exposed to Salmonella occupationally and indirectly expose others in those communities via items like clothes and shoes (Shaw et al., 2016). The high density of these operations could also lead to environmental transmission via contamination of groundwater and surface water with untreated animal waste (Shaw et al., 2016). Shaw et al. (2016) did not find any statistically significant correlations between rurality or presence of broiler, cattle, dairy, or hog operations and IR ratios of Salmonella ser. Javiana in TN.
Though Salmonella ser. Javiana is a prevalent serovar in both the US and TN, little is known about the genomic population structure. The objectives of this study were to retrospectively study isolates of Salmonella ser. Javiana from patients in TN in 2017-2018 in order to identify epidemiologically-relevant trends, determine the genomic population structure, and describe the defining genomic features of TN clades. Additionally, we studied expanded datasets representing global diversity to determine the overall population structure of Salmonella ser. Javiana and to compare it to other Salmonella serovars.

SNP detection and phylogenetic analyses
A reference-free SNP detection analysis was initially performed with the TN isolates to determine the overall population structure free of reference choice bias. The assemblies were analyzed using KSNP3.1 (Gardner, Slezak & Hall, 2015) and the resulting core SNP matrix fasta file was then used to construct a phylogenetic tree in Mega7 (Kumar, Stecher & Tamura, 2016) with 100 bootstrap replicates (Felsenstein, 1985). The evolutionary distances were computed using the number of differences method (Nei & Kumar, 2000) and the evolutionary history was inferred using the Neighbor-Joining method (Saitou & Nei, 1987). The final tree was visualized and annotated using iTOL (Letunic & Bork, 2016). Isolates that weren't serovar Javiana (based on SeqSero results) and were very divergent based on the KSNP analysis were removed from the analysis. TN clades (defined as groups of three or more isolates that were all within 500 SNPs of each other) were identified. Next, reference-based hqSNP analyses were performed for each TN clade independently to determine high-resolution SNP differences between isolates. For the hqSNP analyses, an appropriate internal reference genome assembly (with adequate assembly quality and expected assembly size and G+C content) for each clade was identified (SRS2420927 for TN clade I, SRS2822480 for TN clade II, and SRS3010019 for TN clade  III). Additionally, the Salmonella enterica subsp. enterica serovar Javiana str. CFSAN001992 assembly (GCF_000341425.1) was downloaded from the NCBI RefSeq database for use as an external and closed reference genome. The hqSNP analyses were performed, both with the internal and external references and for the 111 isolates together and for each TN clade individually. For each analysis, high quality single nucleotide polymorphisms (hqSNPs) were identified using the CFSAN SNP Pipeline v1.0.1 (Davis et al., 2015). The resulting hqSNP matrix fasta files were then used to construct phylogenetic trees as described above. The matrices were sorted and clustered using the hclust function (gtools package) in R studio. For the individual clade analyses using internal references, clusters of two or more related isolates were identified at hqSNP distance threshold levels of 5, 10 and 25; isolation date and other epidemiological information were not considered.

Genome annotation and pan-GWAS
TN isolate assemblies were annotated using Prokka v1.14-dev (Seemann, 2014) and RASTtk (Brettin et al., 2015). A pan-genome-wide association study (pan-GWAS) was performed to compare gene content among the isolates using Roary v3.12.0 (with Prokka annotation output files, previously described, used as input files) (Page et al., 2015) and statistical analysis was done using Scoary v1.6.16 (with the following arguments: -c I B BH PW EPW P -p 0.05 -e 100) (Brynildsrud et al., 2016) to identify genes or markers associated with inclusion in each TN clade. Genes predominantly present or absent among isolates in each clade were identified. Genes or clusters of genes (loci) were considered significantly associated with a clade if they had a Bonferroni-corrected P-value <0.05. From the pan-GWAS, the positively and negatively associated genes were classified by having an Odds Ratio of >1 and <1, respectively. To further analyze the results in a genomic context, loci that were adjacent or located in close proximity were combined into a single region.

Global phylogenetic analysis
The Salmonella database (Alikhan et al., 2018) on EnteroBase (Zhou et al., 2019) was queried for isolates with ''human'' listed as source niche in the strain metadata and ''Javiana'' listed as serovar in the strain metadata or experimental data (SISTR1 (Yoshida et al., 2016) or SeqSero2 (Zhang et al., 2019)). Only strains with country (and state for strains from the United States) included in the metadata were retained. A cgMLST + HierCC minimal spanning tree (RapidNJ algorithm) was created with GrapeTree (Zhou et al., 2018) on EnteroBase using all of the resulting strains. Strains that were likely not Javiana (conflicting serovar designations and distant on the tree) were removed from the dataset, leaving 466 strains (Data S2). The dataset was further refined to select representative strain(s) for each HC100 level cluster (≤100 cgMLST allelic differences). For each HC100 cluster, a single strain from each country/state was retained. If there were more than one strain from a country/state, the representative strain was chosen based on assembly quality (N50; and if N50 values were identical or similar, coverage and number of contigs were also considered). At least one TN strain representing each HC100 cluster (if available) was chosen to be included in the final dataset of genomes representing global diversity of of Salmonella ser. Javiana clinical isolates. The final dataset consisted of 162 strains: 29 TN isolates (11 from TN clade I, 3 from TN clade II, 10 from TN clade III, and the five isolates that didn't fall into the main clades), 43 strains from other states in the US, and 90 isolates from other countries (Data S2). Collection date years ranged from 2002 to 2020. Assemblies for the non-TN strains were downloaded from Enterobase. All assemblies were analyzed using KSNP3.1 (Gardner, Slezak & Hall, 2015) and the resulting core SNP matrix fasta file was then used to construct a phylogenetic tree in Mega7 (Kumar, Stecher & Tamura, 2016) with 100 bootstrap replicates (Felsenstein, 1985). The evolutionary distances were computed using the number of differences method (Nei & Kumar, 2000) and the evolutionary history was inferred using the Neighbor-Joining method (Saitou & Nei, 1987). The final tree was visualized and annotated using iTOL (Letunic & Bork, 2016).

Comparison to polyphyletic serovars
Eight strain datasets were created: one for serovar Javiana and one each for other serovars that have been reported as polyphyletic (Derby, Kentucky, Mississippi, Montevideo, Newport, Saintpaul, and Senftenberg). The Salmonella database (Alikhan et al., 2018) on EnteroBase (Zhou et al., 2019) was queried for isolates with the specified serovar listed as serovar in the experimental data (SISTR1 (Yoshida et al., 2016) or SeqSero2 (Zhang et al., 2019). For each, a cgMLST + HierCC minimal spanning tree (RapidNJ algorithm) was created. Strains that were likely not the serovar of interest (conflicting serovar designations and distant on the tree) were removed from the datasets. The final datasets (Data S3) were used to create cgMLST + HierCC minimal spanning trees (improved minimal spanning tree algorithm, MSTree V2) using GrapeTree (Zhou et al., 2018) on EnteroBase. The branch lengths between the cgMLST eBurstGroups (ceBGs) of the other polyphyletic serovars were used for comparison to the branch length between the two Javiana ceBGs. ceBG designations associated with each serovar were retrieved from the EnteroBase documentation (EnteroBase Team, 2018).

Figure 1 Unrooted neighbor-joining KSNP tree of Tennessee clinical Salmonella ser. Javiana isolates.
Tree was constructed based on core SNPs determined by KSNP3 (Gardner, Slezak & Hall, 2015). The optimal tree with the sum of branch length of 5,916.1 is shown. TN clades I (highlighted in purple), II (green), and III (blue) are indicated. The percentage of replicate trees in which the associated taxa clustered together in the bootstrap test are indicated below branches. The tree is drawn to scale, with branch lengths (above branches) representing the number of base differences at core SNP positions per isolate (SNP distance). The analysis involved 112 isolates and 5,870 total SNP positions.

Tennessee Salmonella ser. Javiana population structure
This analysis included a diverse set of 111 Salmonella ser. Javiana clinical isolates from TN (Data S1). On average, the assembled genomes from this study had 74.6x coverage, contained 71.7 contigs (34.8 contigs ≥ 1 kb), were 46.45 kb in length, and had 52.11% GC content (Data S1). Based on the KSNP analysis, the Salmonella ser. Javiana isolates from TN displayed a population structure with three main clades ( Fig. 1). TN clade I contained 54 isolates, TN clade II contained four, TN clade III contained 48 isolates, and five isolates didn't fall into the main clades ( Fig. 1 Fig. 2; Data S1). The most common PFGE patterns were JGGX01.0012 (n = 18), JGGX01.0065 (n = 17), and JGGX01.0072 (n = 10). TN clade I isolates represented 28 different PFGE patterns, with the most common being JGGX01.0012 (n = 18). TN clade II isolates represented three different PFGE patterns. TN clade III isolates represented

High-quality single nucleotide polymorphism (hqSNP) analysis for cluster detection
Using a 5, 10, and 25 hqSNP distance threshold for cluster identification, 9, 12, and 10 potential clusters were identified, respectively (Data S4). The number of clusters decreases from thresholds of 10 to 25, as with the larger threshold, some of the clusters contain multiple subclusters identified at the lower threshold. Within TN clade I, five clusters were identified at each hqSNP distance thresholds of 5 and 10, and four clusters at 25 (Data S4).
Only one cluster was identified at each of the distance thresholds for TN clade II (Data S4). Within TN clade III, three clusters were identified at a distance threshold of 5 hqSNPs, six clusters at 10 hqSNPs, and five clusters at 25 hqSNPs (Data S4).
To evaluate the effects of reference choice and isolate diversity, we ran our hqSNP analyses on the entire TN dataset (111 isolates) and on the three clades individually and with both internal draft assemblies and an external closed assembly as reference genomes. When all isolates were analyzed together using different reference genomes, the average hqSNP distance was lowest when using the internal reference from TN clade I and highest when using the external closed genome and the average percentage of reads mapped differed by up to 1.57% (Table 1 and Data S4). It should be noted that the closed external reference assembly we used (GCF_000341425.1) was most closely related to TN clade II in our original KSNP analysis (Fig. 1) and not representative of the overall population (at the time that this analysis was performed, there was only one closed Salmonella ser. Javiana genome available on NCBI RefSeq). For all three clades, regardless of the reference genome used, hqSNP distances were higher when they were analyzed independently, with the differences being only slightly higher for clades I and III. The average hqSNP distances for clades I and II were lower when using the internal references, but were slightly higher for TN clade III. For all three clades, the average percent reads mapped was higher when using the internal reference genome than the external reference genome, which is to be expected.

Epidemiological trends
The TN Salmonella ser. Javiana isolates were sourced from patients with an average age of 40.0 (range of 1 month to 90 years; standard deviation of 29.1). The highest incidence was in patients ≤4 (6.64 per 100,000) and ≥85 (4.16) years-old. Previous studies have reported that Salmonella ser. Javiana infections are more prevalent in infants and young children than for other serovars (Jones et al., 2008;Shaw et al., 2016;Srikantiah et al., 2004). Overall, 51.4% of isolates were from male patients (incidence of 1.73) and 46.8% from female (incidence of 1.50; Table 2). TN clades I and II contained more isolates from male patients than female, 57.4% and 75%, respectively. Conversely, 52.1% of TN clade III isolates were from female patients.
Most of the TN Salmonella ser. Javiana isolates were taken from stool samples (84.7%, n = 94), followed by urine (6.3%, n = 7) and blood (5.4%, n = 6) ( Table 2). The portion of  isolates taken from blood samples exceeds the 2.8% invasive disease outcome (defined by isolation from blood, cerebrospinal fluid, bone or joint fluid, or another sterile site; does not include urine, wound, abscess cultures) reported for this serovar in FoodNet states (Jones et al., 2008). Of the urine isolates, most (n = 6) belonged to TN clade III and all were from females, with an average patient age of 55.1 (standard deviation of 26.2). All of the isolates recovered from blood samples belonged to TN clades I (n = 3) and III (n = 3) and most were from males (n = 4), with an average patient age of 71.8 (standard deviation of 15.2). The majority of isolates recovered from extraintestinal sites were collected from elderly patients, which may indicate a correlation between invasive infection and age.

Geographical and temporal distribution
A geographical distribution can be seen for the TN isolates, with 65.8% (n = 73) isolated in counties in the western region of TN (Table 2; Data S1). In contrast, only 17.1% (n = 19) and 16.2% (n = 18) were isolated in counties in east and middle TN, respectively. Incorporating county population data, the west region had an IR of 4.69 clinical isolates per 100,000 population, while the east had an IR of 0.79 and the middle had an IR of 0.64, with an overall IR of 1.62 per 100,000 for the state (Fig. S2). Three counties had noticeably higher IR: Madison with 31.0, Crockett with 27.9, and Carroll with 25.0. As was the trend for all isolates, the majority of TN clade I and III isolates originated in the west region (59.3% and 83.3%, respectively; Table 2). However, a sizable amount of TN clade I isolates also originated in the east region (25.9%). Additionally, a temporal distribution was also clear, with 68.5% (n = 76) of isolates collected in July through September (Table 2; Data S1). This trend was also seen within the three TN clades.

Identification of mobile genetic elements (MGEs)
All of the TN isolates were examined for plasmids and 32 putative plasmids (19 unique) were identified in 30 isolates (Table 3). They ranged in size from 23 to 108 kb and included replicon types (a plasmid typing scheme based on replication control regions (Carattoli et al., 2014)) IncFIB, IncFII, IncI1, IncN3, and IncX4. In the representative subset of TN isolates (n = 31), Phaster predicted an average of 8.90 [range of 5-13] prophage regions per isolate (2.29 intact, 4.94 incomplete, and 1.68 questionable) (Data S5). The TN clade I isolates had the highest number of predicted prophage regions (average of 9.64 and range of 8-11), followed by TN clade II (average of 8.25 and range of 8-9) and TN clade III (average of 8.09 and range of 5-10). Most identified virulence genes were present in all of the TN isolates analyzed (Data S6), including the three genes (cdtB, pltA, and pltB) encoding the subunits of the cytolethal distending toxin (CDT) (Miller et al., 2018). Only the TN clade II isolates contained the pefC and pefD genes, which were on plasmid-associated contigs, that are part of the pef (plasmid-encoded fimbriae) operon and associated with fimbrial adherence (Bäumler et al., 1996). Some of the isolates (two from TN clade I and three that were not part of a clade) contained genes from the saf (Salmonella atypical fimbria) operon on putative plasmidassociated contigs (Folkesson et al., 1999). One isolate (SRS2442415, no clade) contained 11 genes associated with the yersiniabactin iron uptake system on a plasmid-associated contig (Carniel, 2001).

Identification of antibiotic resistance genes
All 111 TN Salmonella ser. Javiana isolates analyzed in the present study contained the aac(6 )-Iaa gene, which has been previously reported to confer aminoglycoside resistance (Shaw et al., 1993). One isolate (SRS2783476; TN clade I) contained the aph(3 )-Ia and sul3 genes on a contig (contig 32) associated with a putative plasmid. The aph(3 )-Ia gene has been found to confer resistance to aminoglycosides (Shaw et al., 1993) and sul3 gene has been shown to confer resistance to sulfonamides/sulfones through antibiotic target replacement and has been shown to be associated with resistance to sulfamethoxazole (Perreten & Boerlin, 2003). Additionally, one isolate (SRS2628542; no clade) contained the qnrB19 gene, which has been shown to confer resistance to fluoroquinolones through physical protection of the antibiotic target (Correia et al., 2017). The qnr gene is plasmid-associated and has been linked with reduced susceptibility to ciprofloxacin (Casas et al., 2016;Crump et al., 2015;Redgrave et al., 2014). It is unclear if this gene in SRS2628542 is on a plasmid, as it is located on a small (703 bp) contig and no plasmids were predicted in this isolate. However, the gene showed 100% identity over only 72.6% of the alignment with the reference gene (accession EU432277), so it may not be functionally capable of conferring the quinolone resistance phenotype.

Clade-enriched genes
A pan-genome analysis of the TN isolates revealed that the core genome (genes contained in ≥99% of isolates) consisted of 4,022 genes and the accessory genome consisted of 3,920 genes ( Table 1). The difference in gene content between the identified clades were mostly found in mobile genetic elements (91.5%).
TN clade I isolates had a core genome of 4,106 genes and an accessory genome of 2,513 genes (Table 1). This clade has a much larger accessory genome than the other two clades identified in this study. This is likely due to the large variety of mobile genetic elements (e.g., plasmids, prophages) present in isolates from this clade, which is also reflected in the much larger number of PFGE patterns (n = 28) displayed by these isolates as compared to other clades (n = 3 for TN clade II and n = 11 for TN clade III). The pan-GWAS revealed 153 loci (genes or groups of genes) to be significantly associated with inclusion in this clade (54 positively associated and 99 negatively associated) (Table 1 and Data S7). These loci consisted of 338 total genes (94 positively associated and 243 negatively associated) ( Table 1). The positively associated loci are found in 29 distinct genomic regions and the majority of the genes over-represented in TN clade I (73 genes) were located in eight prophage regions (Table 4).
Twelve of the overrepresented genes in TN clade I correspond to pathogenicity-related protein families (as identified by PathogenFinder): DNA damage-inducible protein I, phenolic acid decarboxylase subunit D, small toxic polypeptide LdrD, PTS system fructosespecific EIIB'BC component, PTS system mannose/fructose/sorbose family IID component, prepilin peptidase dependent protein A precursor, phage DNA binding protein, and other hypothetical proteins. Overexpression of ldrD, which is part of a chromosomal toxinantitoxin gene system (Alix & Blanc-Potard, 2009;Kawano et al., 2002), is toxic to the cell and leads to growth inhibition and rapid cell killing (Fozo et al., 2008;Kawano et al., 2002). ldrD homologs have not been found in plasmids, but may be involved in cellular response to environmental stress (Kawano et al., 2002). Prepilin peptidase dependent Table 4 Summary of Scoary results. Summary of genome regions positively associated with inclusion in TN clades I, II, and III, as determined by Scoary (Brynildsrud et al., 2016). Loci that are adjacent or located in close proximity were combined into a single region. Full results are in Data S7. Notes. protein A precursor is known to be plasmid-associated (Raspoet et al., 2019;Zhang, Lory & Donnenberg, 1994) and is involved in processing of the major pilus subunit (Filloux, Michel & Bally, 1998;Zhang, Lory & Donnenberg, 1994). Other genes of interest enriched in isolates from TN clade I include those encoding autotransporter adhesin SadA, which is associated with pathogenesis, and virulence protein MsgA, which is involved with survival within macrophage (Skyberg, Logue & Nolan, 2006). Mezal, Stefanova & Khan (2013) identified msgA virulence gene in 7 (out of 50) Salmonella ser. Javiana isolates, all of which were clinical. TN clade II isolates had a core genome of 4,290 genes and an accessory genome of 322 genes ( Table 1). The pan-GWAS revealed 22 loci to be significantly associated with inclusion in this clade (16 positively associated and 6 negatively associated) (Table 1 and Data S7). These loci consisted of 221 total genes (207 positively associated and 14 negatively associated) ( Table 1). The positively associated loci are found in 17 distinct genomic regions and many of the genes over-represented in TN clade II (93 genes) were located in the 87.5 kb IncFII type plasmid identified in SRS2922480 (Table 4 and Fig. S3). Among the over-represented genes contained on this plasmid are ccdAB, which are part of a toxin/antitoxin system. This system contributes to stability of the plasmid through post-segregational killing (killing new cells that do not inherit a plasmid copy during cell division) (Van Melderen, 2001). Additionally, 72 over-represented genes were located in four predicted prophage regions and 33 in three other putative MGE regions (indicated by gene annotations, clustering of genes, and/or close proximity to predicted prophage regions; Table 4). One of the overrepresented genes, alpha-xylosidase, corresponds to a pathogenicity-related protein family (as identified by PathogenFinder). VFDB identified pefC and pefD, fimbrial adherence determinants, on the plasmid (contig 13); sinH, a nonfrimbrial adherence determinant; and pipB, a TTSS-2 translocated effector. TN clade III isolates had a core genome of 4,115 genes and an accessory genome of 889 genes (Table 1). The pan-GWAS revealed 155 loci to be significantly associated with inclusion in this clade (101 positively associated and 54 negatively associated) (Table 1 and Data S7). These loci contained 332 total genes (238 positively associated and 94 negatively associated) ( Table 1). The positively associated loci are found in 29 distinct genomic regions and the majority of the genes over-represented in TN clade III were located in four predicted prophage regions (88 genes) or nine other putative MGE regions (134 genes) ( Table 4). Four of the overrepresented genes correspond to pathogenicity-related protein families (as identified by PathogenFinder: arginine/lysine/ornithine decarboxylase) and other hypothetical proteins.

Global population structure
The KSNP analysis of the diverse set of global clinical Salmonella ser. Javiana strains revealed three major clades ( Fig. 3; Fig. S1). Major clade I contained 107 strains, including TN isolates (from TN clades I, II, and III and the five isolates that didn't fall into the main clades). Major clade II contained 23 strains and major clade III contained 31 strains. Strains from major clades I and II belong to the 590 cgMLST eBurstGroup (ceBG) and strains Tree was constructed based on core SNPs determined by KSNP3 (Gardner, Slezak & Hall, 2015). The optimal tree with the sum of branch length of 31,777.6 is shown. The percentage of replicate trees in which the associated taxa clustered together in the bootstrap test that are ≤0.8 are represented by branch color (maximum as green, midpoint as yellow, and minimum as red). The tree is drawn to scale, with branch lengths (above branches) representing the number of base differences at core SNP positions per isolate (SNP distance). The analysis involved 161 isolates and 30,657 total SNP positions. The three major clades are labeled. HC900 (ceBG) clusters are indicated (590 is not shaded and 204 is shaded in gray). TN isolates belonging to TN clades I, II, and III from our original analysis (Fig. 1) are highlighted in purple, green, and blue, respectively. A standard tree with additional metadata can be found in the supplemental files (Fig.  S1).

DISCUSSION
The main objective of this study was to use WGS data to retrospectively examine the population structure of Salmonella ser. Javiana, both from a local (state of TN) and global perspective. The phylogenetic analysis of the 111 Salmonella ser. Javiana isolates from TN revealed a population structure with three main clades, with the majority of isolates found in TN clades I and III. Research published on the population structure of this serovar is limited. One comparable study, conducted by Mezal, Stefanova & Khan (2013) used PFGE to assess the relatedness of 50 Salmonella ser. Javiana isolates from food, environmental, and clinical sources. They found that the isolates represented 34 distinct PFGE patterns and grouped into five clusters of two or more isolates; the 30 clinical isolates represented 23 distinct PFGE patterns (compared to the 111 TN clinical isolates in the present study representing 47 distinct PFGE patterns) and spanned all five clusters. The diversity of PFGE patterns suggested that differences in genome content between Salmonella ser.
Javiana isolates are common. In this study, we found that differences in gene content between the TN clades were mostly attributed to mobile genetic elements (e.g., prophage regions and plasmids), with TN clade I exhibiting the highest level of accessory genome diversity.
The phylogenetic analysis of the diverse set of global clinical Salmonella ser. Javiana strains revealed three major clades ( Fig. 3; Fig. S1). Major clade I contained most of the strains, including all of the TN isolates. This indicates that the population of this serovar in TN represents only a portion of the global genomic diversity. Strains from major clades I and II belong to the 590 cgMLST eBurstGroup (ceBG) and strains from major clade III belong to the 204 ceBG. ceBGs are equivalent to eBurstGroups (eBGs; in legacy 7-gene MLST), which have been shown to correspond to serovar designations (EnteroBase Team, 2018;Zhou et al., 2019). Typically, monophyletic serovar isolates will belong to a single eBG, while polyphyletic serovar isolates will belong to multiple eBGs (Achtman et al., 2012;Alikhan et al., 2018;Banerji et al., 2020).The observation that the global clinical Salmonella ser. Javiana isolates consisted of multiple major clusters and two ceBGs suggests that this serovar may be polyphyletic. Ashton et al. (2016) characterized serovars in lineage 3 of S. enterica subspecies I (which includes serovars Bredeney, Chester, Javiana, Montevideo, Oranienburg, and Poona) as polyphyletic and containing multiple eBGs. The branch length between the two ceBG clusters on the ser. Javiana tree (1,423 allelic differences) was comparable to the branch lengths between ceBGs on the other serovar trees (929-2,850 allelic differences) ( Fig. 4 and Fig. S2). Based on this comparison, Salmonella ser. Javiana may be considered a polyphyletic serovar, although this depends on the branch length cutoff that is applied.
As WGS is becoming more commonly used for public health applications (e.g., cluster detection and outbreak investigation), it is important to understand genomic population structure of surveilled disease-causing microorganisms, specifically at the serovar level for Salmonella. Genomic distance thresholds (based on hqSNP or allelic distances) are an important factor used for identifying potential disease clusters of public health importance, but other factors are typically considered, including isolation date, number of isolates, and epidemiological data. In the present study, we found that using different hqSNP distance thresholds for cluster identification resulted in different numbers of potential clusters and associated isolates (Data S4). The selected threshold for cluster detection should be empirically determined so that it is larger than typical inter-genomic distances between outbreak strains, but smaller than typical inter-genomic distances between outbreak and background (non-outbreak) isolates. Inter-genomic SNP distances among Salmonella outbreak strains are typically small (in the 2 to 12 SNP range), but in some cases can be quite large (up to 249 SNPs) and likely vary from serovar to serovar (Leekitcharoenphon et al., 2014). Isolates from zoonotic or prolonged (e.g., persistent contamination from production environments) outbreaks will likely have larger genomic distances and outbreaks with very large genomic distances are typically polyclonal events (Besser et al., 2019). For Salmonella, the CDC uses a working definition of ≥3 cases within a 60-day period with ≤10 cgMLST allele differences, with ∼2 cases that have ≤5 allele differences (Besser et al., 2019). Thresholds can have impacts on epidemiological investigations; if they are set too low, isolates belonging to the same outbreak event may be mistakenly excluded from the cluster or separated into different clusters and, if they are set too high, background isolates may be inadvertently included in the cluster, making epidemiological investigations difficult, particularly source attribution. In the present study, as the hqSNP distance threshold was increased, the number of included isolates also increased. Increases in numbers and/or sizes of potential clusters may impact the ability of public health departments to further investigate them due to resource constraints. Thresholds may also need to be adjusted based on the timeline of the suspected outbreak (lower for short-term and higher for prolonged outbreaks). As we move forward with using WGS for routine surveillance and cluster detection of this serovar, more clusters may be successfully detected and investigated. In turn, this will provide information on typical genomic distances that can be used to establish and evaluate an appropriate serovar-specific threshold for cluster detection.
Another important consideration when using hqSNP calling analyses for epidemiological cluster detection is whether polyphyletic serovars or those with genetically diverse clades should be analyzed together or if each clade should be analyzed independently. An additional consideration is the choice of reference genome. These choices can affect the percentage of reads mapped to the reference genome and, in turn, the results of the analysis (primarily, hqSNP distances). Better performance (i.e., higher read mapping) would be expected when using closed genomes as references for hqSNP calling. However, some research has shown that using closed vs draft genomes as references have limited impact on hqSNP calling phylogeny reconstruction (Jagadeesan et al., 2019;Portmann et al., 2018). In the current study, we still achieved a level of high-quality mapping (>95%, as recommended by Katz et al., 2017; Table 1) when using draft genomes as references. As these types of studies are performed, representatives from each clade should be selected for long-read sequencing to establish high-quality reference genomes that can be used to further evaluate hqSNP distances. Additionally, when analyzing the TN isolates together or each clade separately and with internal or external reference genomes, similar levels of performance were achieved. This is likely due to the lack of diversity in the core genome of the isolates and the fact that the majority of the gene content differences between isolates from each clade were attributed to MGEs (plasmids and prophage regions). Commonly used hqSNP pipelines filter out SNPs that are found in close proximity and/or mask phage regions (Katz et al., 2017;Strain et al., 2019) and only SNPs present in genomic regions shared between isolates and the reference genome are identified in the analysis. When viewed from a global context, the TN isolates were all part of a single major clade and a single ceBG, which may also explain the similar levels of performance seen with the different analysis strategies.
Notable geographical and temporal patterns were observed for the Salmonella ser. Javiana isolates from TN. The geographical distribution within the state (most isolates from patients in counties in the western region; Table 2 and Fig. S3) is consistent with other reported data (Centers for Disease Control and Prevention, 2013;Mukherjee et al., 2020). This geographical distribution may be associated with the higher percentage of fresh forested/scrub-shrub wetlands in these west TN counties (Huang et al., 2017). A similar geographical distribution has been described in GA, with Salmonella ser. Javiana cases occurring more frequently in the southern part of state (Clarkson et al., 2010). Despite this, Harris et al. (2018) were unable to isolate Salmonella ser. Javiana from storm runoff or irrigation ponds used by fresh produce growers in South Georgia even though this is a high incidence area. The temporal distribution (most isolates collected July-September; Table 2) is in accordance with the notable seasonality of this serovar reported elsewhere (Clarkson et al., 2010;Srikantiah et al., 2004).
All 111 TN Salmonella ser. Javiana isolates analyzed in the present study contained the aac(6 )-Iaa gene, which is associated with aminoglycoside resistance (Shaw et al., 1993). However, there is evidence that this gene is cryptic and no longer confers phenotypic aminoglycoside resistance (Leon et al., 2018;Salipante & Hall, 2003), which is consistent with the low prevalence of phenotypic resistance to amikacin and gentamicin (0.04%) in U.S. clinical Salmonella ser. Javiana isolates (Centers for Disease Control and Prevention, 2019b). This finding highlights the complexity of antimicrobial resistance. The three other antibiotic resistance genes identified in this study (aph(3 )-Ia, sul3, and qnrB19) were each only present in a single isolate. The low prevalence of these three genes is consistent with the low prevalence of phenotypic resistance to gentamicin and kanamycin (0.12%), sulfamethoxazole/sulfisoxazole (0.63%), trimethoprim-sulfamethoxazole (0.21%), and ciprofloxacin (0%) seen in U.S. clinical Salmonella ser. Javiana isolates (Centers for Disease Control and Prevention, 2019b). Additionally, the hypothesis that the qnrB19 gene may not be functional is further supported by the fact that phenotypic ciprofloxacin resistance has not been reported in Salmonella ser. Javiana clinical isolates (Centers for Disease Control and Prevention, 2019b). As aminoglycosides are not typically used to treat Salmonella infections, the presence of the aac(6 )-Iaa and aph(3 )-Ia genes is of little clinical significance. Overall, these data show a low prevalence of genes associated antibiotic resistance in Salmonella ser. Javiana from TN. However, antibiotic susceptibility testing would need to be performed on these isolates to confirm.

CONCLUSIONS
This study demonstrates the population structure of Salmonella ser. Javiana in Tennessee and globally. As this is a clinically important Salmonella serovar, understanding the phylogeny can provide guidance for phylogenetic analyses and cluster detection for public health surveillance and response. We show that Salmonella ser. Javiana clinical isolates from TN show geospatial and temporal distribution, with most isolates originating from the western part of the state and during the summer months (July, August, and September). Based on the results of the pan-GWAS, it is clear that MGEs (namely plasmids and prophage regions) in the genome account for most of the differences in gene content between the three main clades of this serovar. This is noteworthy, as clinically-relevant genes (like ABR-conferring or virulence-related genes) can be found in these regions and they could potentially be used for isolate characterization. Additionally, we found that when performing hqSNP analysis for epidemiological cluster detection with the TN isolates, it is not necessary to first divide the isolates into clades, as we found this only minimally increases the SNP differences between isolates; however the TN isolates all belonged to a single global major clade and single ceBG, so this may only be applicable to less diverse populations. Further research should include clinical Salmonella ser. Javiana isolates and associated metadata from other states to obtain a more complete representation of the population structure of and epidemiological information about Salmonella ser. Javiana in the United States and an analysis of disease severity and gene content could assist in the identification of genes that may be involved in virulence. Another research direction would be to include isolates from other sources (e.g., environmental, animal, food) in a phylogenetic analysis, which may expand our understanding of the population structure and including isolates with diverse isolation sources may provide insight into source attribution and potential recommendations to prevent morbidity.

ADDITIONAL INFORMATION AND DECLARATIONS Funding
This work was supported by Epidemiology and Laboratory Capacity for Infectious Diseases (ELC) grant 6 NU50CK000528-01 funded by the Centers for Disease Control and Prevention (CDC) and multistate project S1077 ''Enhancing Microbial Food Safety