Systematic analysis of REBASE identifies numerous Type I restriction-modification systems that contain duplicated, variable hsdS specificity genes that randomly switch methyltransferase specificity by recombination

N6-adenine DNA methyltransferases associated with some Type I and Type III restriction-modification (R-M) systems are able to randomly switch expression by variation in the length of locus-encoded simple sequence repeats (SSRs). SSR tract-length variation causes ON/OFF switching of methyltransferase expression, resulting in genome-wide methylation differences, and global changes in gene expression. These epigenetic regulatory systems are called phasevarions, phase-variable regulons, and are widespread in bacteria. A distinct switching system has also been described in Type I R-M systems, based on recombination-driven changes in hsdS genes, which dictate the DNA target site. In order to determine the prevalence of recombination-driven phasevarions, we generated a program called RecombinationRepeatSearch to interrogate REBASE and identify the presence and number of inverted repeats of hsdS downstream of Type I R-M loci. We report that 5.9% of Type I R-M systems have duplicated variable hsdS genes containing inverted repeats capable of phase-variation. We report the presence of these systems in the major pathogens Enterococcus faecalis and Listeria monocytogenes, which will have important implications for pathogenesis and vaccine development. These data suggest that in addition to SSR-driven phasevarions, many bacteria have independently evolved phase-variable Type I R-M systems via recombination between multiple, variable hsdS genes. Importance Many bacterial species contain DNA methyltransferases that have random on/off switching of expression. These systems called phasevarions (phase-variable regulons) control the expression of multiple genes by global methylation changes. In every previously characterised phasevarion, genes involved in pathobiology, antibiotic resistance, and potential vaccine candidates are randomly varied in their expression, commensurate with methyltransferase switching. A systematic study to determine the extent of phasevarions controlled by invertible Type I R-M systems has never before been performed. Understanding how bacteria regulate genes is key to the study of physiology, virulence, and vaccine development; therefore it is critical to identify and characterize phase-variable methyltransferases controlling phasevarions.


Introduction 58
Phase variation is the high frequency, random and reversible switching of gene expression (1). 59 Many host-adapted bacterial pathogens posses surface features such as iron acquisition systems (2,60 3), pili (4), adhesins (5, 6), and lipooligosaccharide (7, 8) that undergo phase-variable ON-OFF 61 switching of expression by variation in the length of locus encoded simple sequence repeats (SSRs) 62 (1). Variations in SSRs result in the encoded gene being in-frame and expressed (ON), or due to a 63 frameshift downstream of the SSR tract, out-of-frame and not expressed (OFF). Several bacterial 64 pathogens also contain well characterised cytoplasmic N 6 -adenine DNA methyltransferases, that are 65 part of restriction-modification (R-M) systems, that exhibit phase-variable expression. We recently 66 characterised the distribution of SSR tracts in Type III mod genes and Type I hsdS, hsdM, and hsdR 67 genes in the REBASE database of restriction-modification (R-M) systems, and demonstrated that 68 17.4% of all Type III mod genes (9), and 10% of all Type I R-M systems contain SSRs that are 69 capable of undergoing phase-variable expression. Phase variation of methyltransferase expression 70 leads to genome-wide methylation differences, which can result in differential regulation of 71 multiple genes in systems known as phasevarions (phase-variable regulon). Phasevarions controlled 72 by ON-OFF switching of Type III mod genes has been well-characterised in a number of host-73 adapted bacterial pathogens, such as Haemophilus influenzae (10, 11), Neisseria spp. (12), 74 Helicobacter pylori (13), Moraxella catarrhalis (14,15), and Kingella kingae (16) (reviewed in 75 (17)). Although we have recently demonstrated that almost 10% of Type I R-M systems contain 76 SSRs, and can potentially undergo phase variation, to-date phase-variable expression of Type I R-M 77 systems has only been demonstrated in two species: an hsdM gene switches ON-OFF via SSRs 78 changes in non-typeable Haemophilus influenzae (NTHi) (7,18), and an hsdS gene phase varies due 79 to SSRs alterations in Neisseria gonorrhoeae (19). The hsdS gene in N gonorrhoeae, encoding the 80 NgoAV Type I system, contains a G [n] SSR tract, with variation in the length of this tract resulting 81 in either a full length or a truncated HsdS protein being produced, rather than an ON-OFF switch 82 seen with the hsdM gene in NTHi and Type III mod genes. The full length and truncated HsdS 83 proteins produced from phase variation of the NgoAV system have differing methyltransferase 84 specificities (19). 85 86 Type I hsdS genes can also undergo phase-variation by recombination between inverted repeats 87 (IRs) encoded in multiple variable copies of hsdS genes encoded in the Type I R-M locus (20) and 88 reviewed in (21) ( Figure 1A). These systems have been named 'inverting' Type I loci, as they 89 phase-vary via 'inversions' between the IRs located in the multiple variable hsdS genes. The 90 generation of sequence variation by shuffling between multiple protein variants through inverted 91 that are 15bp, 85bp, and 33bp long, encoded within multiple variable hsdS genes. Therefore, setting 127 our minimum length criteria for an IR at 20bp means any IRs detected are above the length shown 128 previously to result in homologous recombination between variable hsdS genes. 129

130
We carried out our search for inverted repeats using a bespoke perl script (irepeat.upstream.pl), 131 which we have made available at https://github.com/GuoChengying-7824/type_I. This script was 132 also implemented as a simple, easy-to-use server called 'RecombinationRepeatSearch', which can 133 be found at https://sparks-lab.org/server/recombinationrepeatsearch/. This software allows a user to 134 input any gene or DNA sequence (e.g., an hsdS gene) and by providing the relevant upstream and 135 downstream DNA sequence (e.g., the hsdS gene plus 30kb upstream and downstream as a single 136 sequence), the software is able to locate regions containing inverted repeats (see Figure 2). 137

138
Our analysis showed that of the 3683 hsdS genes containing at least one IR, many hsdS genes had 139 more than one downstream IR, and so were counted twice (for an hsdS gene with two downstream 140 inverted repeats), three times (for an hsdS gene with three downstream inverted repeats), and so on. 141 Therefore, in order to determine the number of individual hsdS genes with at least one downstream 142 shuffle between four different HsdS proteins (26). These findings serve as a 'positive control' for 179 our search methodology, in that it is able to identify systems previously shown to contain IRs and to 180 be phase-variable by ad-hoc searches. 181 182 Our search confirms the presence of inverting Type I R-M systems with downstream IRs identified 183 previously. For example, we show that 7 out of 15 strains of P. gingivalis with an annotated 184 genome in REBASE contain hsdS genes with IRs located within 30kb, and 2 out of 7 strains of T. 185 forsythia contain annotated hsdS genes where IRs are present within 30kb (27). Our analysis of 186 these regions confirmed the IRs to be present in a second, variable hsdS gene that is part of the 187 same Type I R-M locus, and which we class as an inverting, i.e., a phase-variable Type I locus. 188 Using these systems as an example, and based on previous work with the SpnIII system in S.

Three major veterinary pathogens contain Type I R-M systems containing duplicated variable 198 hsdS loci 199
Many species contained a high prevalence strains with hsdS genes with downstream IRs, and with 200 these IRs located within a separate, variable hsdS genes that were part of the same Type I locus 201 containing the hsdS gene under study. For example, we identified Type I R-M systems with 202 multiple hsdS genes in two major veterinary pathogens, in addition to the one identified in S. suis 203 ( Figure  haemolytica showed that these systems also contain a gene encoding a recombinase/integrase, and 210 additional genes encoding proteins unknown function ( Figure 3A). In addition, our survey 211 demonstrated that 24 out of 42 S. suis strains analysed contain an inverting Type I system, 212 confirming our earlier observation that the Type I system in this species is not present in all strains, 213 but conserved within a virulent lineage that causes zoonotic infections (26). In all three of these 214 veterinary pathogens, two IRs are present in a second distinct hsdS gene (hsdS′) immediately 215 downstream of the hsdS understudy, and part of the same Type I R-M locus (Figure 1). 216 Examination of the location of each pair of IRs present in these two hsdS genes demonstrated they 217 occurupstream of the 5′-TRD , and between the 5′-TRD and 3′-TRD ( Figure 1, Figure 3). The 218 presence of multiple IRs that are in a second variable hsdS gene (hsdS′) immediately downstream of 219 the hsdS gene under study is highly indicative that these hsdS genes undergo inversions, i.e., they 220 are phase-variable. 221

222
We cloned and over-expressed two hsdS alleles, alleles A and B, of the Type I inverting system that 223 we found in S. suis (26) in order to solve the methyltransferase specificity of the Type I 224 methyltransferases containing these HsdS proteins. We have used this approach extensively with 225 Type III mod genes in order to solve specificity (5, 9), with the same site observed using the native 226 protein using genomic DNA from the actual species and the over-expressed protein in E. coli (26). 227 We only expressed HsdS alleles A and B as we do not observe any strains of S. suis with annotated 228 genomes where either allele C or allele D ( Figure 3B) is present in the hsdS expressed locus 229 immediately downstream of the hsdM (26). This approach demonstrated that allele A methylates the 230 sequence CC m6 AN (8) CTT, and allele B methylates the sequence CC m6 AN (6) DNH (D = A, G, or T; H 231 = A, C, or T; N = any nucleotide). This is consistent with allele A and allele B sharing the same 5′-232 TRD (giving the same half recognition sequence of CCA), but a different 3′-TRD (giving different 233 half recognition sequences of CTT, and DNH, respectively) ( Figure 3B). Solving the specificity of 234 the two most common alleles found in the expressed hsdS locus of this phase-variable system (26)  235 provides valuable information required to fully characterise the gene expression differences that 236 result from the phase-variation of this system. 237

R-M system that appears to be associated with virulence 240
Our analysis shows that an inverting Type I R shows that strains containing this system cluster in specific clades. This data suggests that selection 247 and expansion of strains containing this system is occurring, with a possible association between 248 this system and with strains that persist in fish and chickens (37). Analysis of the phenotypes 249 regulated by this system may have an impact on vaccine and pathogenesis studies of this important 250 human and veterinary pathogen. 251 252

The nosocomial, antibiotic-resistant pathogen Enterococcus faecalis contains a highly diverse 253 phase variable Type I R-M locus that is widely distributed. 254
We identify a Type I R-M system containing multiple variable hsdS loci containing IRs present in 255 Enterococcus faecalis, a multidrug-resistant, nosocomial pathogen of major medical importance. 256 This system has been previously noted to occur in a single strain of E. faecalis (21) This is the first time, to our knowledge, that a systematic study has been carried out to identify Type 271 I R-M systems that contain inverted repeats that are capable of mediating phase-variable expression, 272 and thereby potentially control phasevarions. A previous study demonstrated that 273 integrases/recombinases with high homology to the integrase present in the SpnD39III locus (20)  274 were widespread in the bacterial domain (21). In order to carry out our systematic analysis, we 275 designed software to specifically search for inverted repeats in DNA (code available at 276 https://github.com/GuoChengying-7824/type_I), and applied strict selection criteria so that we only 277 identified inverted DNA repeats that are longer than those that have previously been shown to result 278 in homologous recombination between variable hsdS genes (20). We limited the distance away 279 from the hsdS locus understudy (30kb) in order to only identify distinct 'inverting' Type I R-M 280 systems. We have made this software available as a user-friendly server 281 (RecombinationRepeatSearch; https://sparks-lab.org/server/recombinationrepeatsearch/), which 282 allows the user to search any DNA sequence for inverted repeat regions. 283 284 By limiting our selection criteria (100% IR identity; minimum IR length of 20bp; 30kb window 285 upstream and downstream each hsdS), we have likely missed some Type I loci that are 'inverting'; 286 for example, we will miss any IRs that are <20bp, and we would not detect any hsdS containing IRs 287 that are over 30kb away. However, we would argue that hsdS genes located over 30kb away from 288 each other would not comprise a single 'inverting' Type I hsd locus, and that the recombination of 289 these separate hsdS genes may not control phasevarions. We also identified a small number of large 290 (>200bp) IRs present within 30kb of annotated hsdS genes, but a manual examination of these 291 systems revealed that the IRs are not present in a second hsdS gene. 292

293
Our systematic analysis of REBASE identified Type I loci containing multiple hsdS genes where 294 we detect IRs in a range of commensal organisms such as Bacteroides fragilis and multiple 295 Ruminococcus species, in environmental bacterial species such as Leuconostoc mesenteroides, and 296 in a number of Lactobacillus species that are important to the biotechnology and food production 297 imdustries (Supplementary Data 3). This reflects our previous studies where we observed simple 298 sequence repeats that mediate phase-variation in multiple Type I (38) and Type III 299 methyltransferase genes (9) present in a variety of commensal and environmental organisms. One 300 obvious reason for generating diversity in methyltransferase specificity is that it will increase 301 resistance to bacteriophage. However, in every case where a methyltransferase has been 302 demonstrated to phase-vary, it has also been shown to comprise a phasevarion; therefore in addition 303 to improving survival when exposed to bacteriophage, phase-variable methyltransferases are also 304 likely to increase the phenotypic diversity present in a bacterial population, providing bacteria that 305 encode them an extra contingency strategy to deal with changing environmental conditions. It will 306 be interesting to determine how such plasticity of gene expression would be advantageous in a 307 changing environment that cannot be dealt with via conventional "sense and respond" gene 308 regulation strategies (1) containing simple sequence repeats (9), and a Type I system containing variable hsdS loci where 322 IRs are present (this study; Figure 3A). We predict that this inverting Type I system switches 323 between four separate hsdS genes, and therefore results in four different methyltransferase 324 specificities. This means that there are a total of sixteen different combinations of methyltransferase 325 activity potentially present in a population of A. pleuropneumoniae. Therefore it is critical to 326 determine the genes and proteins that are part of the phasevarions in these species, although this 327 will not be a simple task due the breadth and diversity of the variable methyltransferases present in 328 these organisms. 329

330
In summary, we identify that 5.9% of Type I R-M systems contain duplicated variable hsdS genes 331 containing inverted repeats, are likely to phase vary, and consequently comprise a phasevarion. A 332 broad range of bacterial species encode these systems. Our previous work showed that 2% of Type I 333 hsdM and 7.9% of Type I hsdS genes contain SSRs (38). Together with our findings in this study, 334 this means that 15.8% of all Type I systems are capable of phase-variable expression. In addition, 335 previous studies have shown that 17.4% of Type III methyltransferases contain SSRs (9) and 336 therefore capable of phase-varying. That approximately the same percentage of two independent 337 DNA methyltransferase systems have evolved the ability to phase-vary in expression demonstrates 338 that generating variation via switching of methyltransferase expression is a widespread strategy 339 used by bacteria, and that this method of increasing diversity has evolved independently multiple 340 times. The study of phasevarions is not only key to vaccine development against pathogenic 341 bacteria that contain them, but necessary to understand gene expression and regulation in the 342 bacterial domain. mapping to a 500 bp reverse fragment, we further scanned the fragment length between 500 and 359 1500 bp. This process is implemented by a perl script (irepeat.upstream.pl) located at 360 https://github.com/GuoChengying-7824/type_I. We also established this software as a server called 361  four-way switch occurring in Streptococcus suis. S. suis contains a Type I locus containing 560 duplicated variable hsdS loci containing inverted repeats (SSU1271-SSU1274 in S. suis strain 561 P1/7). As illustrated in Figure 1, each hsdS gene is made up of separate 5 (red and white) and 3 562 (blue and green) TRDs. Inverted repeats are present before the 5 TRD (grey) and between the 5 563 and 3 TRDs (yellow). Each TRD recognises a different 3bp DNA sequence, giving rise to 4 564 separate HsdS proteins that are predicted to methylate four different DNA sequences dependent on 565 the TRDs present. We have solved the specifity of allele A (5TRD-1 [red] + 3TRD-1 [blue]) and 566 allele B (5TRD-1 [red] + 3TRD-2 [green]) . 5TRD-1 (red) recognises CCA, 3TRD-1 (blue) 567 recognises CTT, 3TRD-2 (green) recognises DNH. D = A, G, or T; N = any nucleotide; H = A, C, 568 or T. XXX = the recognition motif is undetermined. 569