Recurrent mobilization of ancestral and novel variants of the chromosomal di-hydrofolate reductase gene drives the emergence of clinical resistance to trimethoprim

Trimethoprim is a synthetic antibacterial agent that targets folate biosynthesis by competitively binding to the di-hydrofolate reductase enzyme (DHFR). Trimethoprim is often administered synergistically with sulfonamide, another chemotherapeutic agent targeting the di-hydro-pteroate synthase (DHPS) enzyme in the same pathway. Clinical resistance to both drugs is widespread and mediated by enzyme variants capable of performing their biological function without binding to these drugs. These mutant enzymes were assumed to have arisen after the discovery of these synthetic drugs, but recent work has shown that genes conferring resistance to sulfonamide were present in the bacterial pangenome millions of years ago. Here we apply phylogenetics and comparative genomics methods to study the largest family of mobile trimethoprim resistance genes (dfrA). We show that most of the dfrA genes identified to date map to two large clades that likely arose from independent mobilization events. In contrast to sulfonamide resistance (sul) genes, we find evidence of recurrent mobilization in dfrA genes. Phylogenetic evidence allows us to identify novel dfrA genes in the emerging pathogen Acinetobacter baumannii, and we confirm their resistance phenotype in vitro. We also identify a cluster of dfrA homologs in cryptic plasmid and phage genomes, but we show that these enzymes do not confer resistance to trimethoprim. Our methods also allow us to pinpoint the chromosomal origin of previously reported dfrA genes, and we show that many of these ancient chromosomal genes also confer resistance to trimethoprim. Our work reveals that trimethoprim resistance predated the clinical use of this chemotherapeutic agent, but that novel mutations have likely also arisen and become mobilized following its widespread use within and outside the clinic. This work hence confirms that resistance to novel drugs may already be present in the bacterial pangenome, and stresses the importance of rapid mobilization as a fundamental element in the emergence and global spread of resistance determinants. 3. Impact statement Antibiotic resistance is a pressing and global phenomenon. It is well-established that resistance to conventional antibiotics emerged millions of years ago in either antibiotic-producing bacteria or their competitors. Resistance to synthetic chemotherapeutic agents cannot be explained by this paradigm, since these drugs are not naturally produced. Resistance is hence assumed to have evolved rapidly following the clinical introduction of these drugs. Recently we showed that resistance to one such drug, sulfonamide, evolved not recently, but millions of years ago, suggesting that the diversity of bacterial genomes may well contain genes conferring resistance to drugs yet to be developed. Here we analyze the origin of resistance to trimethoprim, another chemotherapeutic agent developed in the 1960’s. Using phylogenetic methods, we identify new variants of the trimethoprim resistance genes that had not previously been reported, and we trace the chromosomal origins for a number of already known resistance variants. Our results show that resistance to trimethoprim is very diverse and has originated both from recent mutations and from preexisting ancient variants. These results stress the importance of gene mobilization mechanisms as the main drivers of the current antibiotic resistance phenomenon. 4. Data summary - The scripts used for data collection and analysis can be obtained at the GitHub ErillLab repository (https://github.com/ErillLab/). - The Bayesian phylogenetic tree can be visualitzed online on iTOL (https://itol.embl.de/tree/855674159451585133078) [1]. - The authors confirm all other supporting data has been provided within the article or through supplementary data files.


Abstract
Trimethoprim is a synthetic antibacterial agent that targets folate biosynthesis by competitively binding to the di-hydrofolate reductase enzyme (DHFR). Trimethoprim is often administered 30 synergistically with sulfonamide, another chemotherapeutic agent targeting the di-hydro-pteroate synthase (DHPS) enzyme in the same pathway. Clinical resistance to both drugs is widespread and 32 mediated by enzyme variants capable of performing their biological function without binding to these drugs. These mutant enzymes were assumed to have arisen after the discovery of these 34 synthetic drugs, but recent work has shown that genes conferring resistance to sulfonamide were present in the bacterial pangenome millions of years ago. Here we apply phylogenetics and 36 comparative genomics methods to study the largest family of mobile trimethoprim resistance genes (dfrA). We show that most of the dfrA genes identified to date map to two large clades that likely 38 arose from independent mobilization events. In contrast to sulfonamide resistance (sul) genes, we find evidence of recurrent mobilization in dfrA genes. Phylogenetic evidence allows us to identify 40 novel dfrA genes in the emerging pathogen Acinetobacter baumannii, and we confirm their resistance phenotype in vitro. We also identify a cluster of dfrA homologs in cryptic plasmid and 42 phage genomes, but we show that these enzymes do not confer resistance to trimethoprim. Our methods also allow us to pinpoint the chromosomal origin of previously reported dfrA genes, and we 44 show that many of these ancient chromosomal genes also confer resistance to trimethoprim. Our work reveals that trimethoprim resistance predated the clinical use of this chemotherapeutic agent, 46 but that novel mutations have likely also arisen and become mobilized following its widespread use within and outside the clinic. This work hence confirms that resistance to novel drugs may already be 48 present in the bacterial pangenome, and stresses the importance of rapid mobilization as a fundamental element in the emergence and global spread of resistance determinants. 50 52 Antibiotic resistance is a pressing and global phenomenon. It is well-established that resistance to conventional antibiotics emerged millions of years ago in either antibiotic-producing bacteria or 54 their competitors. Resistance to synthetic chemotherapeutic agents cannot be explained by this paradigm, since these drugs are not naturally produced. Resistance is hence assumed to have 56 evolved rapidly following the clinical introduction of these drugs. Recently we showed that resistance to one such drug, sulfonamide, evolved not recently, but millions of years ago, suggesting 58 that the diversity of bacterial genomes may well contain genes conferring resistance to drugs yet to be developed. Here we analyze the origin of resistance to trimethoprim, another chemotherapeutic 60 agent developed in the 1960's. Using phylogenetic methods, we identify new variants of the trimethoprim resistance genes that had not previously been reported, and we trace the 62 chromosomal origins for a number of already known resistance variants. Our results show that resistance to trimethoprim is very diverse and has originated both from recent mutations and from 64 preexisting ancient variants. These results stress the importance of gene mobilization mechanisms as the main drivers of the current antibiotic resistance phenomenon. 66

Data summary
-The scripts used for data collection and analysis can be obtained at the GitHub ErillLab repository 68 (https://github.com/ErillLab/).
-The authors confirm all other supporting data has been provided within the article or through 72 supplementary data files. 74

Introduction
Bacterial resistance to antibacterial agents remains an increasingly challenging and global problem in 76 modern healthcare [2,3]. Bacterial cells display a diverse array of mechanisms to cope with exposure to antibacterial compounds, including modification or overexpression of the antibacterial 78 target, efflux or reduction of antibacterial uptake, and the use of alternate pathways [4]. Constant exposure to non-lethal concentrations of antibacterial agents may lead to the selection of partial 80 resistance to antibiotics over relatively short timespans [5], and this evolution may be hastened by simultaneous exposure to multiple antibacterials [6]. However, the rapid proliferation of multidrug 82 resistant nosocomial pathogens in the last fifty years has not been driven by the independent evolution of resistance traits, but through the extensive dissemination of mobile genetic elements 84 carrying resistance genes [4,7]. It is widely accepted that most genes conferring resistance to antibiotics present in pathogenic bacteria were obtained by successive lateral gene transfer of 86 homologs that originally evolved in the microbes that produce the antibiotic or in their natural competitors [7,8]. The high plasticity of bacterial genomes, enabled by a large repertoire of mobile 88 genetic elements, and the availability of a large pool of ancient antibiotic resistance determinants hence set the stage for the rapid proliferation of antibiotic resistance, giving rise to multi-resistant 90 clinical strains just a few years after the commercial introduction of antibiotics [7]. binding to the di-hydrofolate reductase (DHFR) enzyme, and hence inhibiting the production of tetrahydrofolic acid [13,14]. Hence, the synergistic use of trimethoprim and sulfonamides was 104 expected to have a potent bactericidal effect by producing a serial blockade on the tetrahydrofolic acid pathway [12,15]. 106 Unlike antibiotics, chemotherapeutic agents are not produced by natural organisms, yet resistance 108 to these novel drugs arose quickly after their mass-production and it is today pervasive among clinical isolates [7]. In the case of sulfonamides and trimethoprim, which are usually administered in 110 tandem, resistance via chromosomal mutations to both chemotherapeutics was reported soon after their clinical introduction [13]. Chromosomal resistance to sulfonamides can occur through 112 mutations yielding increased production of PABA [16]  In spite of their frequent co-occurrence on mobile genetic elements, there are significant differences 130 between the mobile genes conferring resistance to sulfonamides (sul genes) and trimethoprim (dfr genes). To date, only three sul gene classes have been described in clinical isolates [21], whereas 132 more than 30 different dfr genes have been reported in clinically-relevant strains [22]. Trimethoprim resistance (dfr) genes have been further classified into two families (dfrA and dfrB). These two 134 families encode evolutionarily unrelated proteins of markedly different sizes. Sequence similarity indicates that dfrA genes are homologs of the chromosomally-encoded folA genes, whereas dfrB 136 genes are functional analogs of unknown origin [23, 24]. Most dfrA genes follow a standard naming convention consisting of dfrA followed by a numerical value indicating their discovery rank order. 138 However, several dfrA genes first identified in Gram-positive bacteria, and thought at the time to be unrelated to the Gram-negative dfrA genes, were originally named following an alphabetical 140 convention (dfrC-K). The disparity in genetic diversity among sulfonamide and trimethoprim mobile resistance determinants is suggestive of different evolutionary processes leading to the onset and 142 spread of resistance to these two chemotherapeutic agents [13]. It was suggested that resistance to sulfonamide had arisen in a few isolated organisms and rapidly spread upon the introduction of sulfa 144 drugs, whereas trimethoprim resistance had independently evolved, and had been subsequently mobilized, multiple times [13]. 146 Recently, we examined the origins of sul genes through comparative genomics, phylogenetic analysis 148 and in vitro assays [25]. We identified a well-defined mutational signature in sul-encoded proteins with respect to chromosomally encoded folP genes, and we used this conserved motif to map the 150 origins of sul genes in bacterial chromosomes. Our work revealed that the three groups of sul genes identified in clinical isolates originated in the Leptospiraceae and were transferred to the 152 Rhodobiaceae more than 500 million years ago. These two ancient resistant determinants were later independently mobilized, and rapidly disseminated following the commercial introduction of sulfa 154 drugs. By tracing the phylogenetic lineage of sul genes and demonstrating that these two bacterial families were resistant to sulfonamides long before their discovery and clinical use, our work 156 indicated that resistance to novel drugs could very well preexist, and be ready for mobilization, within the vast bacterial pangenome. Here we apply similar methods to examine the phylogenetic 158 relationships among dfrA and chromosomally-encoded folA genes. Our aim is to shed light on the evolutionary processes giving rise to mobile trimethoprim resistance genes. Our work illustrates 160 significant similarities and differences in the processes leading to the emergence and spread of trimethoprim and sulfonamide resistance determinants, reveals previously unreported clusters of 162 dfrA genes, and suggests that systematic analyses of the bacterial pangenome may of use in the design of novel antibacterials. 164

Sequence data collection
To identify homologs of DfrA proteins, we first compiled a panel of Dfr proteins reported in the 168 literature. Dfr proteins belong to two distinct families of unrelated sequences (DfrA and DfrB; Figure  S1). We mapped these sequences to PFAM models of DfrA (PF00186) and DfrB (PF06442) using 170 HMMER (hmmscan), and we discarded sequences mapping to the DfrB family, retaining only DfrA proteins for analysis (Table S1; Data 1). We further excluded redundant DfrA sequences (identity 172 >90%) using T-COFFEE seq_reformat command [26], and used the resulting non-redundant panel to identify DfrA homologs in protein records associated with NCBI GenBank/RefSeq sequences 174 corresponding to mobile genetic elements. These were defined as sequences containing the keywords "plasmid", "integron" or "transposon" in their title, belonging to complete genome 176 records [27,28]. BLASTP hits matching stringent e-value (<1e -20 ) and query coverage (>75%) thresholds were added to the panel if non-redundant (identity <90% with respect to existing panel 178 members), and their classification as mobile elements was validated by assessing that they contained at least one gene coding for an integrase, transposase or plasmid replication protein, as 180 determined by HMMER (hmmscan, e-value<1e -05 ) with reference PFAM models ( Factor (PSRF) be within 0.005 of 1. Based on summarization results, the burn-in was set at 20% of iterations. A consensus tree was generated with the half-compat option and visualized using the 216 GGTREE R library [39]. Ancestral state reconstruction of a single binary trait (mobile/chromosomal) was performed with BayesTraits version 3.0.2 [40]. Known states at tree tips were labeled, and 218 ancestral states were reconstructed using the Multistate and Maximum Likelihood (ML) settings. 220

DNA Techniques and In vitro Trimethoprim Susceptibility Assay
With the exception of the Ralstonia solanacearum GMI1000 (Marc Valls; Center for Research in 222 Agricultural Genomics) and E. coli K-12 (CGSC5073) folA genes, which were amplified from genomic DNA, dfrA and folA homologs were adapted to E. coli codon usage, synthesized (ATG:biosynthetics 224 GmbH) and then subcloned into a dephosphorylated pUA1108 vector [41] using NdeI and BamHI double digest (New England Biolabs) when possible. Genes with internal restriction sites for any of 226 these two enzymes were subcloned into the same vector using the HIFI DNA Assembly kit (New England Biolabs) following the manufacturer's protocol. Oligonucleotides used in this work are listed 228 in Table S3. All constructs were validated by sequencing (Macrogen) prior to their transformation into E. coli K-12 (CGSC5073). The minimal inhibitory concentration (MIC) for trimethoprim (Sigma-230 Aldrich) for strains of E. coli K-12 (CGSC5073) carrying different versions of pUA1108 encoding folA or dfrA homologs was determined following the Clinical and Laboratory Standards Institute (CLSI) 232 guidelines using microdilution tests in Mueller-Hinton broth (Merck) [42]. All MIC assays were performed in triplicate. Colonies were grown on Luria-Bertani (LB) agar for 18 h and then suspended 234 in sterile 0.9% NaCl solution to a McFarland 0.5 turbidity level. Suspensions were then diluted at 10 -2 in Mueller-Hinton (MH) broth, and 50 μl (5x10 4

Sequence analysis
To assess whether the identified chromosomal gene associated with a mobile dfrA gene is the 242 canonical folA gene for the genus, and not the product of a subsequent recombination of the mobile dfrA gene into the chromosome, we computed the pair-wise amino acid identity among the 244 products of all chromosomal folA homologs and then compared this distribution with the pair-wise amino acid identity of the putative origin versus the chromosomal folA homologs. We used a one-246 sided Mann-Whitney U test to determine if the two distributions were significantly different. To analyze the %GC content relationship between sul/dfrA genes and their host chromosomes, we used 248 pre-compiled panels of sequences for non-redundant Sul [25] and DfrA homologs to search protein records associated with NCBI GenBank/RefSeq sequences of mobile genetic elements. The %GC 250 content of the corresponding sul and dfrA genes, as well as the overall %GC content in both the mobile genetic element and the chromosome of the species harboring it, were computed with 252 custom Python scripts. To analyze whether mobile dfrA genes with %GC content close to their hosts' genomes are more similar to the hosts' folA genes than expected if dfrA-host associations were 254 arbitrary, we performed a permutation test comparing the mean pairwise alignment distance between DfrA genes and host-encoded FolA genes. We randomly permuted DfrA-host assignments 256 1,000 times and computed the corresponding p-value as the rank of the non-permuted mean pairwise alignment distance. The input files (Data 2) and scripts (Data 3) used for data collection and 258 analysis are available on public repositories. To explore the phylogenetic relationship of trimethoprim resistance determinants with their chromosomally-encoded folA counterparts, we used a non-redundant panel (<90% pairwise identity) 264 of protein sequences encoded by reported dfr genes (Table S1) to detect putative DHFR homologs in sequenced mobile elements. We mapped all these reference sequences to the PFAM models for 266 DfrA and DfrB (Table S4) [45]. We discarded sequences associated with the dfrB gene family, and retained for analysis non-redundant sequences mapping to the clades defined by dfrA genes 268 reported in the Proteobacteria and by dfrDEFGK genes associated with Gram-positive bacteria. For convenience, and in accordance with recent reports on dfr nomenclature [46], we hereinafter 270 designate these two groups (dfrA and dfrDEFGK) with the umbrella term dfrA. These reference mobile DfrA homologs were then used to detect putative chromosomally-encoded homologs in 272 complete bacterial genomes using TBLASTN. These sequences were combined with FolA homologs sampled from representative genomes of all bacterial orders with complete genome assemblies, and 274 of each bacterial family within the Proteobacteria (Table S5; Data 4). The resulting multiple sequence alignment was used to perform Bayesian phylogenetic analysis of FolA/DfrA sequences. 276 The phylogenetic tree shown in Figure 1 showcases the genetic diversity of dfrA genes, which 278 encompass sequences with pairwise sequence identities ranging from 99% to 20% (Table S6). The resulting phylogeny also reveals that the vast majority (70.7%) of known DfrA sequences map to two 280 well-supported (>0.8 posterior probability), distinct clades that likely arose from two different ancestors.   previously reported and putative dfrA genes, we examined the %GC content of dfrA (935) and sul (408) gene homologs identified in this analysis with respect to the genome %GC content of the host 302 species harboring these mobile resistance genes.

304
The results shown in Figure 2 (top panel) and Table S8 (Data 5) reveal that dfrA genes tend to align with host genome %GC content (Pearson ⍴=0.56), whereas sul genes display a two-tiered 306 distribution of %GC content that is essentially independent of host genome %GC (Pearson ⍴=0.14).
Available dfrA and sul sequences are dominated by variants of a known dfrA and sul genes that have 308 been isolated predominantly in a select group of bacterial hosts (Figure 2; bottom panel). To correct for this skew, we filtered dfrA sequences based on the amino acid identity (<90%) of their encoded 310 proteins. This filtering resulted in a significantly smaller number of non-redundant representative dfrA (63) and sul (4) genes (Table S9). The four representative sul genes correspond to one exemplar 312 of the sul1 and sul2 families, and two exemplars of the sul3 family. Among representative dfrA genes, fourteen map to the first clade (Clade 1) of Figure 1 and four to the second clade (Clade 2). 314 The correlation of dfrA genes with host genome %GC increases significantly (Pearson ⍴=0.78) when considering only non-redundant representative dfrA sequences. The fact that the %GC content of 316 representative dfrA sequences aligns well with their host genome %GC could suggest that %GC content in dfrA genes has been ameliorated to match the host's. Alternatively, it could indicate that 318 the mobile dfrA gene originated via mobilization of a chromosomal folA gene from a bacterium in the same clade as the current host. The later scenario posits that, besides %GC content similarity, 320 representative dfrA genes should also encode proteins with significant sequence similarity to their hosts' FolA protein.
We performed a permutation test to analyze whether representative dfrA gene 322 products show significant similarity with their hosts' FolA protein (Table S10). Our results indicate that this is the case (p<0.001), suggesting that most mobile dfrA genes are still associated with 324 species from the same clade they originated in.

326
The filtering of dfrA sequences based on amino acid identity generates three large clusters (represented by dfrA12, dfrA5 and dfrA1, and belonging to Clade 1 from Figure 1) containing more 328 than 100 genes with identity larger than 90%. The dfrA genes in these clusters show a distribution of %GC content that is essentially independent of the host genome %GC, as in the case of sul genes 330 ( Figure 2; top panel), and their products show no significant sequence similarity with the hosts' FolA (permutation test p>0.1). This indicates that the dfrA genes in these large clusters have spread 332 across distantly related bacterial clades, primarily through their association with sul-containing integron-based transposable elements that are widely disseminated among clinically relevant 334 bacteria [49,50]. This analysis also brings to the fore the presence of multiple dfrA cluster representatives, with widely divergent %GC, on narrow bands of host genome %GC content. These 336 bands correspond to E. coli (50.7% GC) and Klebsiella pneumoniae (57.4% GC) isolates, which are heavily oversampled in the dataset (Figure 2, bottom panel). The marked divergence in %GC content 338 (E. coli: 9.55% SD 6.15%, K. pneumoniae: 6.93% SD 4.63%) and amino acid sequence identity (E. coli: 38.57% SD 15.24%, K. pneumoniae: 38.39% SD 14.02%; Table S11; Figure S2) among these 340 representative dfrA genes suggests that they originated via mobilization from a diverse set of chromosomal backgrounds. 342 The fact that the %GC of dfrA genes aligns with their host genome's %GC and that DfrA proteins 344 display higher sequence identity when aligned to their host genome FolA proteins than to other FolA proteins strongly supports the notion that dfrA genes have been mobilized multiple times within 346 different bacterial clades [13]. In a few instances, typified by the large dfrA clusters illustrated in Figure 2, dfrA genes have been captured by highly efficient mobile elements and dispersed widely 348 across unrelated groups of bacteria [49]. These mobile elements often harbor sul genes, which also display a host-independent %GC distribution. Many of the dfrA genes identified here are associated 350 with clinical isolates. The divergent %GC content and amino acid identity of these dfrA genes indicates that pathogenic bacteria have obtained dfrA genes on multiple occasions and from 352 different sources, highlighting the ability of mobilized resistance determinants to reach clinicallyrelevant pathogens [17, 51]. 354

identified through phylogenetic methods
The phylogenetic tree in Figure 1 includes reported DfrA proteins and their putative homologs, as 358 well as FolA proteins identified via TBLASTN as putative DfrA homologs or sampled uniformly across bacterial clades. The inferred phylogeny also reveals several groups of previously unreported mobile 360 DHFR homologs that form well-supported clades in association with chromosomal FolA proteins. These FolA proteins could hence constitute the chromosomal origins of the associated mobile DHFR 362 homologs, and provide insights into the emergence and dissemination of trimethoprim resistance genes. To determine whether these mobile DHFR homologs did confer resistance to trimethoprim, 364 we cloned a subset of the genes encoding them and we performed broth microdilution assays to determine the minimal inhibitory concentration (MIC) of trimethoprim. The results, shown in Table  366 1, reveal that most of the mobile DHFR homologs identified here do confer significant resistance to trimethoprim. The sole exception is AQW32254. Close inspection revealed that this DHFR homolog is 368 encoded by a megaplasmid (1.2 Mb) from a Ralstonia isolate, and that this is the only DHFR homolog present in its complete genome. We hence determined that this DHFR homolog was a bona fide FolA 370 protein and not a mobile DHFR homolog, and we did not consider as putative dfrA genes all other DHFR homologs identified in megaplasmids (> 400 kbp). 372 Two remaining clades of novel mobile DHFR homologs from clinically-relevant bacteria associated 374 with chromosomal FolA proteins were shown to confer resistance to trimethoprim on E. coli (Table  1). To investigate whether the sequence determinants conferring resistance had originated in the 376 associated chromosomal background, we cloned the most closely related chromosomal folA gene as well as gene encoding an additional DHFR homolog from the same genus and performed broth 378 microdilution assays to determine the MIC of trimethoprim. We also performed ancestral state reconstruction of the molecule encoding the DHFR homologs (chromosomal/mobile trait) (Table  380 S12). The combined results of Table 1 and Figure 1 reveal different patterns of trimethoprim resistance acquisition. KMV08986 is a DHFR homolog harbored by a conjugative plasmid from an 382 Acinetobacter baumannii clinical isolate. Its most closely related chromosomally-encoded DHFR homolog is the FolA protein of Flavobacterium branchiophilum, which confers resistance to 384 trimethoprim (Table 1).

386
To ascertain whether this chromosomally-encoded DHFR homolog was encoded by a bona fide folA gene, instead of a mobile dfrA gene that integrated into the chromosome, we compared the genus-388 wide distribution of pairwise alignment distances between FolA proteins to the pairwise distance of the identified homolog versus all other FolA proteins in the genus. The F. branchiophilum FolA 390 sequence is significantly different from other Flavobacterium FolA sequences (Mann Whitney U p<0.05; Table S13), raising the possibility that this chromosomal gene could be in fact a recombined 392 mobile dfrA gene. However, phylogenetic analysis with a broader representation of Flavobacterium sequences ( Figure S3) confirms the well-supported branching of F. branchiophilum FolA with other 394 Flavobacterium species FolA proteins, and comparative genomics analysis reveals that the genetic neighborhood of the chromosomal folA gene is conserved in the Flavobacterium genus ( Figure S4). 396 Furthermore, the FolA protein of a prototypical genus member, Flavobacterium faecale, also confers resistance to trimethoprim on E. coli (Table 1). These results indicate that the FolA protein was likely 398 resistant to trimethoprim in the ancestor of extant Flavobacterium species, which diverged more than 50 million years ago [52]. The branching of KMV08986 in the reconstructed phylogeny and the 400 associated ancestral state reconstruction indicates that this mobile DHFR homolog likely originated via mobilization of a chromosomal folA gene within the Bacteroidetes phylum. The encoded FolA 402 protein was likely resistant to trimethoprim, but the exact donor species remains to be elucidated.

404
In contrast to Flavobacterium proteins, the Acinetobacter schindleri FolA protein does not confer resistance to trimethoprim, in agreement with previous reports of A. schindleri susceptibility to 406 trimethoprim [53], and with the well-established susceptibility of A. baumannii FolA to trimethoprim [54,55]. The A. schindleri FolA protein is closely related to three mobile DHFR homologs conferring 408 resistance to trimethoprim and harbored by A. baumannii (WP_031380727, WP_034702334) and Acinetobacter defluvii (WP_004729503) clinical and environmental isolates. These mobile DHFR 410 homologs branch within a well-supported clade of chromosomal Acinetobacter FolA proteins, as supported by ancestral state reconstruction ( Figure 1; Table S12). The trimethoprim susceptibility of 412 Acinetobacter chromosomal folA genes and the phylogenetic placement of these DHFR homologs hence indicates that the observed resistance to trimethoprim was acquired immediately prior to or 414 after mobilization from an Acinetobacter chromosomal background. This is supported by the observation that these mobile DHFR homologs confer different levels of resistance to trimethoprim 416 ( Table 1), and that the largest MIC correlates with the location of the DHFR homolog on a plasmid harboring multiple antibiotic resistance determinants ( Figure S5). This suggests that these DHFR 418 homologs have acquired mutations conferring heightened resistance to trimethoprim in parallel to their broader dissemination on multi-resistant mobile elements. Based on their validated 420 trimethoprim resistance phenotype and their level of sequence identity versus previously reported DfrA proteins (<95%; DHPS, yielding a non-productive sulfonamide-bound di-hydropterin. For sulfonamides, therefore, it is the PABA-to-sulfonamide ratio that limits the production of di-hydropteroate from a limited pool 440 of pteridine di-phosphate, and this cannot be altered via overexpression of DHPS [61]. Conversely, trimethoprim overexpression can provide partial resistance to trimethoprim, and mutations 442 enhancing DHFR expression have been reported to be the first to appear in directed evolution experiments [58]. The ability to obtain partial resistance through overexpression may hence provide 444 a stepping stone for the gradual accumulation and refinement of mutations conferring substantial resistance with little fitness cost, and hence facilitate the development of trimethoprim resistance 446 [57,58].

Trimethoprim resistance in chromosomally-encoded folA genes
Besides uncovering novel dfrA genes, the phylogenetic analysis in Figure 1 also identifies several 450 chromosomal folA genes associated with previously reported dfrA genes. Two of these chromosomal folA genes have already been reported in the literature as putative origins of dfrA genes, and their 452 identification here provides some degree of validation for the phylogenetic approach implemented in this work. The putative chromosomal origin for Staphylococcus aureus Tn4003 S1-DHFR has been 454 identified as the chromosomally-encoded dfrC gene (Staphylococcus epidermidis) and reported to be susceptible to trimethoprim [62]. The Enterococcus faecalis dfrE gene, identical to the 456 chromosomally-encoded folA gene of E. faecalis, was reported to confer moderate resistance to trimethoprim in E. coli, but only when cloned in a multicopy plasmid, which could easily result in 458 overexpression-mediated resistance [61,63].

460
To ascertain whether the chromosomal folA genes found here to be associated with other known dfrA genes (dfrA20, dfrA26 and the dfrDGK cluster) confer resistance to trimethoprim, we performed 462 broth microdilution assays to determine the MIC of trimethoprim on these chromosomally-encoded FolA proteins and on another FolA protein from the same genus. In all cases, both related FolA 464 proteins confer resistance to trimethoprim ( Table 1). The most closely associated chromosomal folA genes are not significantly different from other folA genes in their respective genera (Mann Whitney 466 U p>0.05; Table S13), as reflected also by substantial conservation of the folA genomic neighborhood ( Figure S4). Together, these data indicate that resistance to trimethoprim was present on the 468 ancestor of these genera. The dfrA26 gene was identified on a K. pneumoniae clinical isolate and its most closely associated chromosomal folA gene is a member of the Alcanivorax genus. The 470 branching pattern of dfrA26 within this clade and ancestral state reconstruction results (Figure 1; Table S12) suggest that it arose via mobilization of a chromosomal folA gene from the Alcalinivorax 472 genus. The dfrDGK genes have been reported, respectively, in E. faecalis, Enterococcus faecium and S. aureus, and ancestral state reconstruction results indicate that these mobile dfrA genes originated 474 through mobilization of a member of closely-related the Bacillus genus, members of which have been reported to be naturally resistant to trimethoprim [64]. In both cases, therefore, the 476 phylogenetic evidence and the similarity in %GC content among chromosomal and mobile genes ( Figure 1, Table S15) point towards a mobilization event that has to date remained circumscribed to 478 related genera. Conversely, the dfrA20 gene was identified on Pasteurella multocida isolate, yet the chromosomal folA gene most closely associated to it is encoded by Fluviicola taffensis, a 480 Bacteroidetes, hence suggesting a much more distant mobilization event (Figure 1, Table S15). In all three cases, however, we find evidence that preexisting resistant folA genes can be readily mobilized 482 from both close (e.g. dfrDGK) or distant (e.g. dfrA20) species.
The resistance to trimethoprim reported here for the chromosomal folA genes of two different genera of Bacteroidetes, two distinct Alcanivorax species and a Bacillus strain underscores the deep 486 ancestry of chromosomal mutations yielding resistance to trimethoprim. The folA genes of Flavobacterium and Fluviicola were shown here to confer resistance to trimethoprim. These two 488 genera are thought to have diverged more than 500 milion years and define major lineages within the Flavobacteriales, suggesting that resistance to trimethoprim emerged in an ancestor of this 490 bacterial order. It is worth noting that several of the chromosomal folA genes shown here to be associated with mobile DHFR homologs (Alcanivorax, Flavobacterium and Fluviicola) appear to be 492 resistant at the genus level and correspond to genera of aquatic bacteria. This parallels our recent identification of soil and subterranean water bacteria as the likely originators of clinical sulfonamide 494 resistance genes [25], and suggests that the intensive use of trimethoprim/sulfamethoxazole in agriculture, aquaculture and animal husbandry in the last fifty years may have acted as a trigger for 496 the selection and mobilization of preexisting folA and folP genes conferring resistance to trimethoprim and sulfonamides. Conversely, trimethoprim susceptible chromosomal folA genes 498 found here to be associated with dfrA genes belong to clinically-relevant genera (Staphylococcus and Acinetobacter) that may have been under more direct trimethoprim pressure. This suggests that 500 among relatively isolated bacterial populations frequent exposure to high levels of trimethoprim may trigger the mobilization of spontaneous folA mutants, whereas longer term exposure to sub-502 lethal doses of trimethoprim in ecological rich habitats might instead rely predominantly on the mobilization of naturally resistant folA genes ( Figure 3). 504

506
Our phylogenetic analysis also identifies a well-defined clade of Enterobacteriaceae cryptic plasmids derived from Salmonella phage SSU5 and encoding DHFR homologs [65][66][67][68]. Genes coding for DHFR 508 homologs occur frequently in many bacteriophage families, often in tandem with thymidylate synthase genes [69], but their functional role has not been fully elucidated. We performed broth 510 microdilution assays to determine the MIC of trimethoprim of E. coli O104:H4 DHFR (AFS59762). This phage-encoded DHFR does not confer resistance to trimethoprim ( Table 1). The high sequence 512 identity and neighborhood conservation among the DHFR enzymes encoded by these Enterobacteriaceae cryptic plasmids and phages (Table S16, Figure S4) therefore suggests that all 514 these DHFR enzymes are susceptible to trimethoprim.

516
Bacteriophages can transfer substantial amounts of genetic material via generalized transduction, and their potential as reservoirs of antibiotic resistance determinants has gained increased attention 518 with the advent of metagenomics [70,71]. However, recent studies have shown that many potential resistant determinants encoded by phages do not confer resistance against their putative targets. 520 Furthermore, only a small proportion of complete phage genomes contain putative antibiotic resistance genes [72]. Enzymes participating in the folate biosynthesis pathway, however, are 522 relatively frequent in phage genomes. These include homologs of the folP gene encoding DHPS, of the thyX gene encoding flavin-dependent thymidylate synthase [73-75] and, predominantly, 524 homologs of the folA gene encoding DHFR often found in tandem with the thyA gene encoding type 1 thymidylate synthase [69]. 526 Early work on Enterobacteria phage T4 showed that the phage-encoded thyA and folA gene products 528 are functional and participate also in the phage baseplate structure [76], and thyX has been shown to be functional in a number of phages [73][74][75]. It has been proposed that these genes help 530 bacteriophages overcome shortages in the deoxynucleotide pool during replication, but their potential in conferring resistance to sulfonamides or trimethoprim remains largely unexplored. The 532 detection here of DHFR homologs in Enterobacteriaceae cryptic plasmids and phages, and the subsequent assessment of their trimethoprim susceptibility, reinforces the notion that these genes 534 have been functionally co-opted by phages principally for deoxynucleotide synthesis. Nonetheless, these genes may still confer partial trimethoprim resistance as a byproduct of folA overexpression, 536 as recently reported for Stenotrophomonas maltophilia phage DLP4 [77].

Conclusions
Recent work has shown that resistance to sulfonamide, a synthetic chemotherapeutic agent, can be 540 present in the bacterial pangenome well before their discovery. Here we have used a combination of in silico and in vitro techniques to identify novel trimethoprim resistance genes and to identify 542 chromosomal folA genes that are strongly associated with novel and previously reported dfrA genes. We find that most of the chromosomal folA genes associated with mobile dfrA genes confer 544 resistance to trimethoprim, but we detect cases of novel mutations being rapidly mobilized. Our work hence shows that the observations from sulfonamide resistance extend to trimethoprim, with 546 generalized chromosomal resistance determinants predating the origin of several genera and several clusters of resistance genes disseminated broadly among clinical isolates. Moreover, this work also 548 reveals that, unlike sulfonamides, resistance to trimethoprim is relatively easy to generate and frequently associated with species from the same clade it originated in. The identification of ancient 550 resistance determinants for two synthetic chemotherapeutic agents strongly suggests that resistance to any novel drugs is likely to be already present in the bacterial pangenome. Systematic 552 screening of existing natural variants could therefore provide the means to preemptively identify derivatives presenting widely-distributed natural resistance determinants and, conversely, to 554 engineer derivatives that circumvent most, if not all, natural resistant variants.     Figure   808 1, Figure 2 and  Figure S1 -Multiple sequence alignment including all reported mobile DHFR proteins. DfrB protein sequences are highlighted in yellow.

Figures and tables
822 Figure S2 -Pairwise percent amino acid identity and %GC difference between aligned representative DfrA protein sequences harbored by mobile genetic elements of E. coli and K. pneumoniae.  836   Table S1 -List of dfrA genes reported in the literature.  840 Table S4 - Table S4 -HMMER mapping between the product of dfrA genes and PFAM models corresponding to DfrA and DfrB.

Supplementary Tables
842 Table S5 -List of accession numbers for chromosomal and mobile sequences containing DfrA/FolA-encoding genes used in this work. The species (chromosomal) or dfrA index (mobile), the nucleotide and protein accession numbers are provided.
844 Table S6 -Protein identity values between a set of non-redundant (<90% identity) previously reported DHFR proteins resulting from pairwise alignments using the Needleman-Wunsch algorithm.
846 Table S7 -%GC content for previously reported dfrA genes and their host genome.

852
Table S10 -Pairwise percent amino acid identity between aligned representative mobile DfrA protein sequences and the sequence of the chromosomal FolA protein for the species harboring the mobile genetic element encoding the dfrA gene.

854
Table S11 -Pairwise percent amino acid identity and %GC difference between aligned representative DfrA protein sequences harbored by mobile genetic elements of E. coli and K. pneumoniae.
Table S12 -Maximum-likelihood estimates for ancestral states in the phylogenetic tree of FolA/DfrA homologs reported in Figure 1.
858 Table S13 -Pair-wise percent identities among all species in the genera under study and with the putative chromosomal origin for different dfrA genes. The results (p-value) of the Mann-Whitney U test comparing intra-genus pair-wise identities 860 versus the chromosomal origin are reported for each dfrA gene. 864 Table S15 -Comparison of %GC content between dfrA genes and putative chromosomal folA origins. The average coding %GC content of donor genomes is also provided.

866
Table S16 -Protein sequence identity values from pairwise alignments among cryptic plasmid-and phage-encoded DHFR homologs.