Modularity and diversity of target selectors in Tn7 transposons

Summary To spread, transposons must integrate into target sites without disruption of essential genes while avoiding host defense systems. Tn7-like transposons employ multiple mechanisms for target-site selection, including protein-guided targeting and, in CRISPR-associated transposons (CASTs), RNA-guided targeting. Combining phylogenomic and structural analyses, we conducted a broad survey of target selectors, revealing diverse mechanisms used by Tn7 to recognize target sites, including previously uncharacterized target-selector proteins found in newly discovered transposable elements (TEs). We experimentally characterized a CAST I-D system and a Tn6022-like transposon that uses TnsF, which contains an inactivated tyrosine recombinase domain, to target the comM gene. Additionally, we identified a non-Tn7 transposon, Tsy, encoding a homolog of TnsF with an active tyrosine recombinase domain, which we show also inserts into comM. Our findings show that Tn7 transposons employ modular architecture and co-opt target selectors from various sources to optimize target selection and drive transposon spread.

In brief Faure et al. use phylogenomic and structural approaches to systematically survey transposon target selectors, revealing a modular architecture that promotes transposon spread. Through this work, they identified a new target selector, TnsF, and a previously uncharacterized transposable element, Tsy.

INTRODUCTION
Transposable elements (TEs) are DNA sequences that can move around and across genomes, employ diverse molecular mechanisms to achieve mobility, and exhibit a broad range of targeting specificities. 1 Where a TE integrates is critical for its survival, and various strategies have evolved to select target sites, both for homing and for jumping to a mobile genetic element. For example, members of the Tn7 group of prokaryotic DNA transposons recognize (1) a highly conserved sequence of an essential gene to guide integration to a safe locus for homing and (2) a particular DNA conformation (which is agnostic to sequence) to guide integration to mobile elements at replication forks. [2][3][4] These two modes of target recognition are carried out by two dedicated proteins, TnsD (sequence-specific) and TnsE (structure-specific) ( Figure 1A). In addition to these two target selectors, Tn7 encodes the heterocomplex TnsA/TnsB, which recognizes the ends of the transposons (TnsB) and excises the transposon (TnsA and TnsB), and TnsC, the central hub component that coordinates the transpososome assembly with target-site selection. 5 TnsC recognizes TnsD bound to the attachment site, recruits the transpososome, and directs its integration. 6 In addition to these canonical modes of target-site selection, several groups of Tn7-like transposons have co-opted CRISPR-Cas systems, enabling RNA-guided transposition. These CRISPR-associated transposons (CASTs) target mobile genetic elements (MGEs) using matching spacers encoded in the CRISPR array and Cas-effector components coupled with the small protein TniQ, a homolog of TnsD. [7][8][9][10][11][12] The CASTs target homing sites in two alternative modes, either by RNA-guided transposition or through TnsD, similarly to the canonical Tn7like transposons. 12 The CASTs appear to have evolved as a result of the recruitment of CRISPR-Cas effector modules by Tn7-like transposons on multiple independent occasions. 7,13 Specifically, different groups of Tn7-like transposons acquired CRISPR subtype I-B (at least twice, independently) and subtype I-F and subtype V-K effectors. In each of these cases, the CRISPR effector module acquired by the TE retained the ability to recognize and bind the target DNA but lost the capacity of typical CRISPR systems to cleave the target DNA. In the case of type I CRISPR effectors, the elimination of the cleavage activity results from the loss of the Cas3 helicase-nuclease, whereas in the case of subtype V-K, it is due to the mutation of the catalytic amino acids in the active site of the RuvC-like nuclease. 8,9 The existence of these distinct modes of target-site selection by Tn7-like transposons suggests a high degree of flexibility that maximizes their spread and highlights the utility of multiple, functionally orthogonal target-site selectors. Given this flexibility, comprehensive identification of target selectors is challenging.
We used a combination of phylogenomic and structural analyses to discover target selectors. Among these candidate target selectors, we identified and characterized three TE systems: a distinct CAST subtype I-D, a Tn7-like transposon that uses a protein we denoted TnsF as a target selector, and a previously unreported TE we name Tsy. Our results expand the understanding of target selection by Tn7-like transposons, reveal the structural features linked to RNA-guided or protein-guided modes of transposition in CAST systems, characterize the modular architecture of Tn7 target selectors, and discover a distinct target-selector partner co-opted from a previously undescribed non-Tn7 TE. . TnsD is a sequence-specific target selector and binds an attachment site in the bacterial genome to recruit TnsC (orange) and the transposon for insertion. (B) Pipeline for discovery of novel target selectors. Sequence databases were mined for Tn7 component seeds and searched for genomic co-localization of these seeds. The genomic neighborhoods of the detected loci were annotated, with the focus on cas effectors and genes that appear to be operonized with tnsC and tniQ/tnsD. (C) Locus architectures of known systems and novel systems identified in this study. Mu (muA and muB) and IS21 (istA and istB) encode relatives of TnsB and TnsC. IS21 has not been reported to be associated with a target selector. Tn7 encodes various target selectors including TniQ/TnsD, TnsE, and Cas effectors (Cas12k, Cascade I-F, and Cascade I-B, which all partner with TniQ), the latter of which constitute CAST systems. We found a novel CAST system containing Cascade I-D and a novel target selector we named TnsF. We also found a TnsF-like target selector in a distinct non-Tn7 transposon.

RESULTS
TnsC phylogeny reveals the diversity of target selectors in Tn7-like transposons Known target-selector proteins exhibit a wide range of diversity, but they all use TnsC to bridge target selection with transposition. We therefore used phylogenomic analysis of TnsC, which is the most prominently conserved protein among the Tn7-encoded proteins, as the framework to investigate the diversity of target selectors and search for new ones ( Figures 1B and 1C). We selected 80,028 Tn7-like loci identified in publicly available prokaryotic datasets from NCBI (National Center for Biotechnology Information), JGI (Joint Genome Institute), and MG-RAST (Metagenomic Rapid Annotation using Subsystem Technology), that together included about 1.6 3 10 6 bacterial and archaeal genomes (STAR Methods) and from these, extracted a representative set of TnsC for phylogenetic analysis.
To build the phylogenetic tree, 6,384 TnsC homologs were selected. The tree included two main clades, one consisting of MuB-the TnsC homolog from the transposable phage Mu (MuTn)-and the other one of TnsC from Tn7-like transposons ( Figure 2). To explore the diversity of target selectors, we first mapped the known ones (tniQ/tnsD, tnsE, and cas genes from CAST systems) on the tree; tniQ/tnsD is ubiquitous in the Tn7 branch and is represented either by a single gene (2,905 loci) or as tandem genes (1,048 loci). These tandems consist of either two tniQ genes, or tniQ and tnsD, or two tnsD genes, which we collectively refer to as dual tniQ-tnsD. In contrast to the ubiquity of tniQ/tnsD, tnsE is more restricted in its spread and is present in transposons closely related to the canonical E. coli Tn7 and in the more distantly related group of Tn6022 transposons (379 loci total) ( Figure 2). CAST systems are spread around the tree and generally grouped according to their subtypes. However, as noted previously, CAST I-B is represented in two distinct clades (1 and 2), suggesting independent capture by two distinct transposons. 12 We made similar observations for CAST I-F: Tn7017-a CAST I-F variant that harbors a dual tniQ-tnsD and uses TnsD for protein-mediated homing 14 -belongs to a branch distant from other CAST I-Fs, which use a dedicated spacer for RNA-guided transposition. 10 This branch consists of transposons encoding tnsC and dual tniQ-tnsD, but mostly lacking Cascade I-F, suggesting multiple gains or losses of Cascade I-F.

Identification of CAST I-D
To identify potential distinct CAST systems, we searched for cas genes encoding CRISPR-effector components (see STAR Methods) in the vicinity of tnsC and mapped the detected cas genes onto the tree (Figures 1 and 2). We identified 234 groups of loci harboring at least one of these cas genes. Manual examination of tnsC tree branches bearing cas genes showed that several of these genes are part of the cargo and are unlikely to be involved in transposition 15 or belong to already reported CAST systems. However, we identified one group of loci in the Tn7 clade in branches closely related to CAST I-B2 that encodes transposase components closely similar to those of I-B2 PmcCAST, with $50% sequence identity between TnsABs and TnsCs and $30% sequence identity between the dual TniQ-TnsDs ( Figure S1A). However, these loci encoded Cascade I-D, rather than Cascade I-B, and thus comprise a distinct CAST variety.
To experimentally characterize CAST I-D, we chose a locus from the cyanobacterium Cyanothece sp. PCC 7425, CyCAST, which encodes a complete subtype I-D CRISPR-Cas system encompassing both the adaptation module (Cas1, Cas2, and Cas4), the Cas6 processing nuclease, and the Cascade complex, along with TniQ and TnsD ( Figure 3A). Unlike other known CASTs, the CRISPR-Cas system of CyCAST appears to be fully functional-that is, competent for both adaptation and interference-based on the conservation of catalytic residues in the HD-nuclease domain of Cas10d, suggesting that this is a recent acquisition of a CRISPR system not yet fully domesticated by Tn7 ( Figure S1B). Similar Cas10d proteins are also found encoded in loci where Tn7 components are absent in the vicinity ( Figure S1B). Manual identification of the CyCAST boundaries indicated potential attachment sites in a tRNA gene, similar to the attachment site of CAST I-B2 systems ( Figure S1A).
We expressed CyCAST heterologously in E. coli and tested for activity. Using previously established assays, 8,9,12 we determined that CyCAST exhibits a GTT protospacer-adjacent motif (PAM) preference, as observed in subtype I-D CRISPR-Cas systems 16 ( Figure S1C). We detected mainly unidirectional left-end (LE) and cargo right-end (RE) insertions within a 70 to 80-bp window downstream of the protospacer on a target plasmid (pTarget) ( Figure S1D).
To clarify the roles of TniQ and TnsD in RNA-guided insertion, we deleted either TniQ, TnsD, or both and checked for activity. The full CyCAST system with both TniQ and TnsD showed RNA-guided transposition at an insertion frequency of 0.001% ( Figure 3B). Deletion of TnsD boosted RNA-guided transposition about 2.5-fold, to 0.0025%, suggesting that TnsD partially inhibits this activity. Elimination of TniQ abolished RNA-guided insertion activity, but perhaps unexpectedly, in the absence of both TniQ and TnsD, a low level of RNA-guided transposition was detected (0.0004%). Thus, CAST I-D retains some basal RNA-guided transposition activity in the absence of a Rings around the tree show the presence of a particular gene or a feature in the vicinity of tnsC within the genomic contig. From inner to outer ring: tnsA is shown in yellow, tnsAB fusion is shown in light blue, presence of tnsB operonized with the central tnsC (representative of the leaf) is shown in orange, presence of an additional distinct tnsC operonized with tnsB in the vicinity is shown in dark blue, tniQ/tnsD is shown in dark green and the presence of a second tniQ/tnsD (tniQ 2) in pink where both their protein size are proportional to size of the ring bar, tnsE is shown in dark red, cas effectors and cas6 genes are shown in purple, the presence of a gene operonized with the central tnsC is shown in green. Various known transposons are annotated around the tree including known CAST systems. Red boxes highlight areas of interest. The Mu clade corresponds to the left branch harboring a conserved gene operonized with MuB (homologous to tnsC). This gene is part of the Mu phage genome. By subtraction, the Tn7 clade corresponds to the remaining clade and is characterized by the presence of TniQ/TnsD. protein-target selector, in contrast to the previously characterized CAST systems, which seems to suggest that TnsC can recognize Cascade at the insertion site ( Figure S1E).
We also tested CRISPR-independent transposition (homing) of CyCAST. To this end, we cloned into the pTarget plasmid a leucine tRNA gene from Cyanothece sp. PCC 7425 to serve as a homing site. Homing transposition was observed when TnsAB, TnsC, and TnsD were expressed in the absence of TniQ (0.008%); the presence of TniQ drastically diminished this activity, but transposition was still detectable (0.0006%) (Figure 3C). Homing transposition occurs around 30-33 nt downstream of the end of the tRNA homing site, resembling the insertion site of CyCAST in the genome of Cyanothece sp. PCC 7425 ( Figure S1F). Thus, CyCAST exhibits dual modes of transposition that rely on different target selectors, namely, the small TniQ protein for RNA-guided transposition and the larger TnsD for protein-mediated homing. 12,14 Comparison of TniQ and TnsD reveals a modular architecture of target selectors The presence of dual tniQ-tnsD genes in a variety of Tn7 loci motivated us to examine these proteins in greater detail to gain additional insight into their roles in target selection. Mechanistic studies of CAST systems have shown that TniQ, which is smaller than TnsD, partners with Cascade to mediate RNA-guided transposition, whereas TnsD mediates protein-guided transposition (similar to its role in E. coli Tn7). 12 We therefore first focused on TniQ and TnsD from CASTs and non-CAST Tn7-like transposons to determine how they function in these two capacities. We employed structural prediction using AlphaFold2 (AF2) 17-19 to compare the domain organizations of TniQ and TnsD variants encoded by E. coli Tn7, CAST I-B, CAST I-D, and Tn7017 (the dual TniQ-TnsD CAST I-F) ( Figures 4A and S2A). From the structural models, we identified a common core of about 300 amino acids (aa) that consists of a helical domain (Hel1), a zinc finger (ZF), a connector helix, and another helical domain (Hel2). Even within this core region, however, there are notable differences among the TniQ and TnsD proteins from different transposons ( Figure S2A). TniQ from Tn6677 CAST I-F interacts with Cas6 as well as Cas7 and the guide RNA through a loop in the Hel2 domain, 20 suggesting that Hel2 provides a bridge to Cascade and that the structural diversity of Hel2 translates into compatibility with distinct Cascades. By contrast, TniQ from CAST V-K, which associates with the Cas12k effector rather than Cascade, contains only the Hel1 and ZF domains, 21,22 indicating a different mode of interaction between the target-selector components and highlighting the flexibility of TniQ as an adaptor between target selection and transposase machineries.
The CAST TniQ proteins consist largely of the core region, whereas the CAST and Tn7 TnsD proteins have diverse, long C-terminal extensions, which might confer target-DNA recognition to enable protein-guided transposition ( Figure 4A). However, some of these regions share similarities that might reflect overlapping functions. For example, the TnsDs of Tn7 and

Article
CAST I-B, both of which home to the same site (the glmS gene), 2,12,23 share common C-terminal domain architectures, which might indicate that this C-terminal region is involved in the recognition of the attachment site. We also detected similar structures and domain architectures between the C-terminal regions of TnsDs from CAST I-B2 and CAST I-D, suggesting that they also recognize similar attachment sites.

Identification of additional target selectors
The diversity of the interactions between TniQ and Cas effectors raises the possibility that the TniQ core acquired the ability to bind other target-selector proteins as well. Furthermore, these findings suggest that the presence of a small TniQ lacking a large C-terminal extension is a general hallmark of Tn7-like transposons that employ additional target selectors. We therefore expanded our analysis of TniQ and TnsD beyond the CAST systems to systematically analyze their diversity and identify potential partners of TniQ. We extracted the sequences of all TniQ and TnsD proteins encoded in the vicinity of the tnsC homologs in the Tn7 clade (5,072 altogether). We found 2,905 loci encoding a single TniQ or TnsD, 998 loci encoding dual TniQ-TnsD proteins, and 50 loci encoding more than 2 TniQ-TnsDs (33 loci with 3, 14 loci with 4, 2 loci with 5, and 1 locus with 6) (Figures 1B and  S2B). Loci with more than a single TniQ or TnsD are from Tn7 co-occurring or TniQ-TnsD split into partial genes. TniQ and TnsD protein size distribution falls into four bins, suggesting selection driven by particular size restraints ( Figure S2C), but TniQ and TnsD seem to represent two ends of a continuum, spanned by proteins containing extensions of variable length ( Figure S2C). This variation makes it difficult to predict, for some of these proteins, whether they bind directly to attachment sites or require a target selector partner. We refer to such proteins as TniQ/TnsD, reflecting this uncertainty. In dual TniQ-TnsD loci, one of these proteins is usually larger than 400 aa, suggesting direct target selection, whereas the second one is substantially smaller ( Figure S2C). Such an architecture could provide transposons with two distinct options for target selection: direct, via TnsD, and indirect, via TniQ interacting with an additional target selector. Analysis of the phylogenetic tree of TniQ/TnsD built using the multiple alignment of the core regions (made from 4,916 proteins passing alignment filters; see STAR Methods) indicates that the dual TniQ-TnsD arrangement is polyphyletic (i.e., it emerged independently on multiple occasions via duplication of a single tniQ/tnsD [Figure S2B]), with the monophyly of the dual TniQ-TnsD (i.e., evolution via a single duplication of an ancestral TniQ/TnsD) compellingly ruled out (p value = 2.4eÀ236). The ancestral Tn7-like transposons likely encoded a single TniQ/TnsD protein, and the dual TniQ-TnsD configurations apparently evolved by in situ duplication of a single tniQ/tnsD ( Figure S2B), followed by neofunctionalization. Such duplication might maintain the compatibility with the other transposase components, while opening the possibility of evolving different modes of target selection. Consistent with the independent duplication scenario, we identified only a few examples of dual TniQ-TnsD loci containing distantly related proteins, as would be expected under an alternative scenario including the exchange of tniQ/tnsD genes between different transposons ( Figures S2B and S2D).
Apart from the dual TniQ-TnsD loci, numerous Tn7-like transposons encompass TniQ together with another, unrelated target selector, such as TnsE, the plasmid-target selector ( Figure S2B). We identified two distinct branches encoding TniQ and TnsE ( Figures S3A and S3B). Although these TnsEs are highly divergent in sequence (less than 10% of sequence identity), they are predicted to form closely similar structures ( Figure S3C). The N-terminal domain of TnsE binds dsRNA via a unique fold, 24 but the function of this domain has not been explored. Using structural prediction and structural mining 25 of the N-terminal region, we found that it folds into two single-strand binding (SSB) domains related to PriB ( Figure S3D), a component of the bacterial primosome, which can bind both ssDNA and ssRNA and is involved in restarting replication at the fork. 26-28 Such structural similarity with PriB suggests that TnsE might have been coopted from a system that functions at the replication fork. The domain architecture of TnsE has features including a dsDNAbinding domain and a domain predicted to bind ssRNA or DNA, suggesting that it targets the lagging strand of replication. Thus, TnsE is likely to specifically target replication forks of conjugative plasmids, further highlighting the remarkable diversity of target selectors co-opted by Tn7-like transposons.
To search for additional target selectors, we focused on genes operonized with TniQ/TnsD and conserved in multiple nodes of the tree (see STAR Methods and Figure S2D). The most common candidate gene encodes an uncharacterized protein of 498 aa and forms a putative operon with a gene encoding a short TniQ (369 aa) in the Tn6022 family of Tn7-like transposons (121 groups of loci). This gene was previously annotated as orf3 or tniE, 29,30,31 but we denote it TnsF ( Figure S4A). AF2 prediction of the TnsF structure revealed a distinct domain architecture including an N-terminal region containing multiple ZFs in the first 199 aa, whereas the remaining $300 aa exhibit significant structural similarity to the N-and C-terminal domains of the tyrosine recombinase superfamily member XerH (PDB: 5jk0 32 ) (Dali score 3.9) ( Figures 4B and S4B). However, the chamber holding the tyrosine catalytic site is missing in TnsF ( Figure S4B). Tyrosine recombinases typically contain an N-terminal DNA-binding domain (CB domain 32 ) and a C-terminal catalytic domain (CAT) and dimerize or tetramerize on DNA during site-specific recombination. Both the N-and C-terminal domains of XerH interact with DNA in the crystal structure ( Figure S4B to contain tandem domains (designated CB1 and CB2), which are structurally similar and, by inference, homologous to the CB of XerH ( Figure S4B), suggesting that these domains may impart the ability to interact with DNA. A b sheet at the C-terminal region of XerH maps to the last 70 aa of TnsF (designated partial CAT or pCAT) ( Figure 4B) and corresponds to the DNA-binding region of the recombinase domain of XerH ( Figure S4B) but lacks the helix K. 33 In XerH, the helix K contributes substantially to its interaction with DNA, suggesting that the pCAT domain of TnsF lost the DNA-binding capacity and might instead interact with TniQ. Indeed, based on an AF2 multimer model, TnsF is predicted to interact with TniQ through its C-terminal region (Figure 4B). Additionally, we found one case where the N-terminal region of TnsF is fused to the C-terminal region of TniQ (GenBank: SCZ64694). This fusion protein lacks the entire CAT but contains an additional CB-like domain within the linker between TniQ and TnsF.
The Tn6022 transposons also encode tnsE on the opposite strand, suggesting they can jump to conjugative plasmids, whereas TniQ and TnsF are likely involved in target selection within the bacterial chromosome. Together, these data suggest that TnsF is a distinct target selector and that TniQ of Tn6022 serves as a hub that bridges TnsF, which binds directly to the attachment site, with the transposition machinery.

TnsF is essential for Tn6022 transposition and interacts with TniQ
To experimentally test the predicted target selector function of TnsF, we focused on the Tn6022 transposon. Tn6022 encodes TnsA, TnsB, TnsC, TniQ, TnsF, and TnsE and is inserted in the comM gene, which encodes a protein containing a AAA+ ATPase domain and a Mg chelatase domain 34,35 ( Figure 5A; Data S1). We reconstituted Acinetobacter johnsonii Tn6022 (hereafter, AjTn6022) in E. coli. We determined the ends of the transposon (see STAR Methods) and cloned the left and right ends into a pDonor plasmid with a kanamycin-resistance gene as a cargo. We also cloned a 100-bp fragment (50 bp upstream and 50 bp downstream of the insertion site) of the AjcomM gene into a pTarget plasmid. These plasmids were co-electroporated into E. coli with a pHelper plasmid (bearing tnsA, tnsB, tnsC, tniQ, and tnsF). To determine the structure of the insertion, we performed long-read, amplification-free nanopore sequencing. We found simple insertions (60.9% of insertions) and co-integrate insertions (39.1% of insertions) ( Figure S4C), and we confirmed the presence of target-site duplications (TSDs), a signature of Tn7like transposition ( Figure S4D). To determine whether all Tns proteins including TnsF are essential for transposition into the comM gene, we generated pHelper variants lacking each of the Tns proteins and repeated the transposition assay. AjTn6022 achieves transposition at 2.4% efficiency, and removal of any Tns protein (A, B, C, Q, or F) impaired transposition, as quantified by droplet digital PCR (ddPCR) ( Figure 5B).
To test our prediction that TnsF and TniQ interact, we purified TnsF and TniQ proteins from AjTn6022, and performed pulldown assays. We showed that in vitro TnsF interacts specifically with AjTniQ, but not with TniQ proteins from ShCAST, AvCAST, PmcCAST, CyCAST, or Tn7017 ( Figure 5C; Data S2). The docking model of AjTnsF:AjTniQ predicts an interaction between the C-terminal helix of TniQ and the pCAT region of TnsF via hydrophobic contacts and salt bridges ( Figure 5D). TniQ from other systems lack this helical region, which may explain why they do not interact with AjTnsF. Mutants designed to disrupt the predicted salt bridges (TnsF E427K/D476K and TniQ K348D/K361D) abrogated or substantially reduced transposition ( Figure 5D), supporting the hypothesis that this Aj-specific region of TniQ is important for the interaction with TnsF.
Identification of Tsy, a non-Tn7 transposon that uses TnsF for target selection We next searched for TnsF homologs in genomic databases and identified 1,099 nonredundant TnsF homologs (STAR Methods; Data S1), including Tn6022 TnsF and a homolog of TnsF (referred to as TnsF-like protein) containing a predicted active CAT ( Figures S4A and S4B). Although we also detected more distant structural homologs of TnsF containing adjacent CB and CAT domains, these proteins lacked the ZF-containing N-terminal regions, so we did not include them in the further analysis (Figure S4E). We built a phylogenetic tree of the TnsF and TnsFlike proteins ( Figure S4F) and mapped on it the conserved genes located in the vicinity of tnsF and tnsF-like genes. Tn7 TnsF forms a distinct clade and apparently evolved from TnsF-like proteins with an active tyrosine recombinase CAT. The genomic neighborhood of these TnsF-like proteins lacks Tn7 components but instead includes upstream genes encoding a tyrosine recombinase (YRec) and a small helix turn helix (HTH)-domain protein as well as a downstream gene encoding a GIY-YIG nuclease (present only in the branch close to Tn7) ( Figures 6A and S4F). Although we could not detect inverted repeats or any canonical ends in these loci, we noticed the presence of comM fragments, namely, the 5 0 terminal portion located upstream of YRec and the 3 0 terminal portion located downstream of the putative transposon ( Figure 6A). The downstream portion of the comM gene is in some cases located after several additional genes, which likely represent transposon cargo. These features suggest that this locus is a distinct transposon and that the TnsF-like protein recognizes comM, similarly to TnsF of Tn6022.
To determine the function of these potentially enzymatically active TnsF-like proteins, we experimentally characterized this mobile element, which we designate Tsy (target selector based on tyrosine recombinase). We reconstituted the system from Zoogloea sp. LCSB751 (hereafter, ZooTsy) in E. coli. To assess ZooTsy transposition, we cloned 135 and 39 bp of each transposon end1 (comM 5 0 terminal portion) and end2 (comM 3 0 terminal portion) with 12 bp of homology arm extensions into an R6K origin pDonor plasmid with a kanamycin-resistance gene as cargo ( Figures S5A and S5B). We also cloned a Tsy attachment site (a 100-bp fragment with 50 bp upstream and 50 bp downstream of the Zoogloea sp. comM insertion site) into a pTarget plasmid ( Figure S5A and S5B). These plasmids were co-electroporated into E. coli with a pHelper plasmid (bearing YRec, HTH, tnsF, and the GIY-YIG nuclease, nuc). We detected transposition into pTarget and observed circular intermediates (CIs) derived from pDonor by PCR in a YRec-dependent manner ( Figure S5A), as previously demonstrated for various transposons encoding tyrosine recombinases. 34 To confirm these findings, we established an assay to isolate and confirm the structure of the CI ll OPEN ACCESS Article derived from the pDonor. We constructed a derivative pDonor that contained the ColE1 origin of replication and kanamycin resistance gene as cargo as well as lacZa but no other origin of replication. Upon circularization, this pDonor will lose lacZa and can be isolated from white colonies by traditional blue-white screening ( Figure S5A). Using this assay, we obtained white colonies (98% of the total) in a pHelper-dependent manner after retransformation with the extracted plasmids and successfully isolated a smaller plasmid which had lost the 0.6-kb backbone region of the pDonor ( Figure S5A; Data S3). We confirmed the smaller plasmid as a CI by nanopore long-read sequencing and observed the connected end2 (...AATCCCAGTC) and end1 (AAGTTCTGAT...) junction by Sanger sequencing ( Figure S5A; Data S3). To determine the structure of the ZooTsy insertions, we performed nanopore long-read sequencing and found simple insertions (62.3% of total insertions) ( Figures S5C and S5D).
To narrow down the requirements for transposition, we generated variants of the cargo with serial deletions of end1 from 135 to 0 bp, finding that truncation of this end gradually decreased the rate of simple insertions to zero ( Figure S5E). By contrast, only 20 bp at end2 are required for transposition ( Figure S5E). By systematically combining these optimized parameters, we found Article that 12 bp of homology arm 1 (hom1), 135-bp end1, and 20-bp end2 are sufficient for the transposition of ZooTsy ( Figure S5F).
To determine the genetic requirements for ZooTsy transpositions in E. coli, we constructed a series of pHelper plasmids with deletions of each gene. ZooTsy achieves transposition at 5.0% efficiency, and removal of any component (YRec, HTH, TnsF, or GIY-YIG nuclease) from the system impaired transposition, as quantified by ddPCR for the upstream-end1 junction formation ( Figures 6B and S5G). The GIY-YIG nuclease, however, is not essential for transposition, which is supported by the identification of Tsy relatives lacking this component ( Figure S4E). By contrast, we found that the tyrosine recombinase catalytic activities of both YRec and TnsF are essential for transposition ( Figure S5H).
TnsF targets a conserved region of comM Both AjTn6022 and ZooTsy use TnsF to target comM of their respective hosts, although the directionality of insertion is different, suggesting different modes of insertion. ComM has been reported to facilitate recombination of sequences acquired by transformation, 35 and disruption of the comM gene by transposon insertion could inactivate this functionality, limiting transformation of other MGEs. To explore comM targeting in greater depth, we sought to compare TnsF-mediated targeting of this gene by AjTnsF and ZooTnsF. Using our heterologous E. coli expression systems, we looked for AjTn6022 and ZooTsy targeting of genomic comM, finding that they can both target E. coli endogenous comM in addition to their respective comM sites on pTarget, highlighting the broad recognition of comM by both TnsFs (Figures 6C and 6D). To determine the specificity of comM gene targeting by TnsF, we performed tagmentation-based tag integration site sequencing (TTISS). 36 For AjTn6022, in the absence of pTarget, we observed that 96.7% of insertions were at the E. coli endogenous comM locus; in the presence of pTarget, 56.6% of insertions were on pTarget and 40.8% were on the genomic comM (97.4% on target in total) ( Figure S6A). For ZooTsy, we observed similarly high levels of specificity, indicating that TnsF is highly selective for comM ( Figure S6B). We confirmed that both TnsF proteins bind 200-bp dsDNA fragments corresponding to their respective comM target sequences ( Figures 6E and 6F).
To further characterize the attachment sites of these proteins, we constructed additional pTarget variants with different lengths of comM gene fragments to map the target-site specificity of both TnsFs at a greater resolution. For AjTn6022, deletion of the 40-50 bp of either upstream or downstream sequences of the insertion site substantially reduced transposition, indicating that the insertion site was located within the TnsF attachment site ( Figure S6C). For Tsy, we found the attachment site is within a 40-bp upstream region from the insertion site (Figure S6D). Mapping these refined target sites on the respective comM genes showed that they overlap with a conserved 10-bp region within the Walker B motif (a highly conserved ATPase motif 37 ) of the ComM protein ( Figure 6G). We confirmed that this 10-bp region is necessary for both AjTnsF and ZooTnsF binding (Figures S7A and S7B). Furthermore, we found that mutating this region abolished the transposition activity of both AjTn6022 and ZooTsy (Figures S7C and S7D). Targeting the Walker B motif of comM might provide a natural conserved anchor for TEs to spread across species, paralleling Tn7 targeting of the catalytic site in glmS. Together, these results demonstrate that TnsF is a target selector related to the tyrosine recombinase family and involved in the transposition of at least two distinct groups of transposons.

DISCUSSION
Target-site selection is a crucial step in the life cycle of TEs because insertion of the element can dramatically affect the fitness of both the host and the TE depending on the insertion site. 38,39 Moreover, the choice of target site is critical for the TE to spread horizontally. Here, we expand the current understanding of target-site selection mechanisms, identifying previously uncharacterized target-site selectors and a distinct family of TE. Together, our results reveal the modular architecture used by Tn7-like transposons to bridge target-site selection with transposase activity, providing TEs with maximal targeting flexibility (Figure 7). This flexibility is manifested in the various proteins that Tn7like transposons have co-opted and adopted for target-site selection. Although the CRISPR systems in most CASTs have lost their ancestral interference and nuclease functions, the CyCAST system we describe here has an interference component and a Cas10d protein with an active HD-nuclease domain. Very recently, other CAST I-D systems have been reported-but these lack Cas3d, and Cas10d is inactivated. 40 Thus, CyCAST seems to represent an evolutionary intermediate where the CRISPR system has been co-opted by the TE but has not fully lost its native function. Along similar lines, we found that the Tn7-like transposon Tn6022 has co-opted TnsF, a catalytically inactive derivative of a tyrosine-recombinase-containing protein, whereas the Tsy transposon we describe here encodes the apparent ancestor of TnsF, a catalytically active recombinase.
Our finding that additional enzymes have been co-opted as target selectors by Tn7-like transposons raises the question of how these recruited proteins evolve target-site selection capacity. In Tn7, to recruit the transposase machinery to the target site, TnsD binds to and induces a local distortion in the target DNA, which then attracts TnsC. Artificial induction of such DNA distortion has been shown to attract TnsC independently of TnsD. 23 ddPCR experiments were performed with three biological replicates. All data points are shown with an error bar showing standard deviation, and statistical significance was assessed by t test. *p < 0.05; ***p < 0.001; ****p < 0.0001; n.s., not significant. See also Figures S5-S7. as TnsF. Although a canonical tyrosine recombinase would entirely cover the distorted DNA, precluding TnsC access, the distinct architecture of TnsF, with two different DNA-binding domains, might bend the target DNA while still allowing partial or full access for TnsC. Thus, proteins that provide more efficient target selection could supersede the target-selecting role of the C-terminal region of TnsD, ultimately leading to their domestication and loss of their native enzymatic activity (Figure 7). CAST-I-D and the TnsF homolog in the Tsy system are examples of apparent intermediate stages on the evolutionary path to domestication. We also detected numerous loci encoding a TniQ that is too short to enable target selection and no other identifiable target-selector partner. Such elements might recruit target selectors in trans, perhaps representing the initial step in the evolution of new target selectors or an even greater flexibility in target-selector recruitment.
Limitations of the study Given the apparent fast evolution of target-selector proteins, our sequence-based mining might have limited the scope of our ana-lyses. The recent advances in protein-structure prediction now enable structure-based mining, which could yield candidates beyond those reported here. Indeed, although a sequence-based search did not identify TnsF homologs in other systems, a structural-mining approach might shed light on the potential origins of target selectors, as exemplified by the relationship between TnsE and PriB ( Figure S3). These types of evolutionary and structural analyses, combined with further study of the mechanisms of TnsF and Tsy, will shed light on this distinct mode of target-site selection. The identification of distinct target selectors described here highlights the remarkable plasticity of the insertion machinery of Tn7-like transposons, but further research will likely reveal additional mechanisms of transposon targeting.

STAR+METHODS
Detailed methods are provided in the online version of this paper and include the following: Evolutionary scenarios for various Tn7-like transposons with distinct modes of target selection. Locus architecture is shown on the left and mechanics of targetsite selection on the right. (1) An ancestral Tn7-like transposon might have used TnsD for site-specific target selection and a DNA-bending protein or complex (e.g., Cas effector, transcription factor, or tyrosine recombinase) in trans as a second mode of target-site selection. These DNA-bending proteins would create a distortion in the DNA that TnsC would recognize, albeit with a low efficiency. Gene duplication produced a second copy of TnsD. (2) Neofunctionalization of the second copy of TnsD yielded TniQ, which evolved to optimize the interaction between a trans DNA-bending target selector and the target site. The trans system could also be captured by the transposon as cargo, as was the case with CRISPR-Cas systems. (3) Further domestication of the target selectors would then occur, eventually leading to the loss of the native function of the system (e.g., CASTs I-B and I-F), fusion to TniQ generating a distinct TnsD (e.g., TniQ-TnsF fusion), or adaptation of the system for dual modes of transposition as in CAST V-K, which relies entirely on the CRISPR system for both homing and jumping. Pink indicates DNA-binding function; green indicates TniQ core; blue indicates native function of DNA-binding system.

INCLUSION AND DIVERSITY
One or more of the authors of this paper self-identifies as an underrepresented ethnic minority in their field of research or within their geographical location. One or more of the authors of this paper self-identifies as a gender minority in their field of research.

RESOURCE AVAILABILITY
Lead contact Further information and requests for resources and reagents should be directed to and will be fulfilled by the lead contact, Feng Zhang (zhang@broadinstitute.org).

Materials availability
Plasmids generated in this study have been deposited to Addgene (Data S3).
Data and code availability d All Illumina NGS and Oxford Nanopore Technologies (ONT) sequencing data generated from this publication have been deposited and are publicly available as of the date of publication. Accession numbers are listed in the key resources table. d All original code for transposition junction NGS reads analysis has been deposited to GitHub and Zenodo. DOI are listed in the key resources table. d Any additional information required to reanalyze the data reported in this paper is available from the lead contact upon request.

METHOD DETAILS
Identification of Tn7-like transposons HMMs of Tn7 proteins (TnsA (PF08721.12, PF08722.12), TnsB (PF00665.27), TnsC (PF05621.12, PF11426.9, PF13401.7), TnsD PF06527.12, PF15978.6), and TnsE (PF18623.2)) were used to search homologs with hmmsearch software 43 (using option ga_cut) within predicted protein sequences derived from publicly available microbial contigs in the NCBI Genbank and WGS databases, JGI database (projects with stated permission to use), and the MG-RAST database 44 (all frozen in November 2020). The full database encompasses 521,828,662 contigs in total (contigs greater than 1.5kb), covering 6,932,321,054,498 bp of genomic DNA. 1,617,895 contigs have detectable rpob genes (detected with hmmsearch from TIRGR02013.1 profile), suggesting the diversity is probably reflected by 1.6 million genomes. 45,46 Loci were built by mapping the location of the hits into the contig and aggregating hits when they are no further than 20 kb from each other. Loci were selected if they satisfied the following criteria: (i) at least 2 hit genes homologous to 2 distinct Tn7 components, (ii) 2 of the hit genes are in a putative operon, which is operationally defined as 2 codirectional ORFs separated by less than 50 bp of non-coding sequence, and (iii) the hit genes are less than 3 kb from the contig boundary (to remove likely incomplete transposons). Using these criteria, 80,028 loci were obtained from which we extracted and translated 86,517 TnsC homologs. We clustered these homologs at 80% sequence identity over 75% of the protein length (coverage) using MMSeqs2 (v. 12-113e3) 47 and obtained 7,789 TnsC homolog representatives.

Construction of the phylogenetic tree of TnsC homologs
The protein sequences of representative TnsC homologs were aligned using a method described previously. 48 Briefly, the protein sequences were clustered at 50% identity and cluster members were extracted and aligned using MUSCLE (version 5). 49 An all-versus-all comparison of the multiple sequence alignments (MSAs) was computed using HHsearch. 50 An unweighted pair group method with arithmetic mean (UPGMA) dendrogram was constructed using the HHSearch similarity scores. The dendrogram was used to guide the iterative pairwise alignment of cluster MSAs using HHalign. 50,51 Clusters were discarded if they could not be aligned using this approach, leaving 6,988 sequences in the single final alignment. The alignment was first filtered to remove sites with conserved gaps using trimal version 1.2 with the option gappyout. 52 Finally, aligned sequences from which Walkers A and B were not aligned correctly (any gap in one of the positions) were discarded. This led to an alignment of 6,384 sequences that are used for this study. The alignment was input into FastTree2 53 with the Whelan-Goldman models of amino acid evolution and gamma-distributed site rates. The tree was visualized and annotated using the interactive tree of life (itol). 54 Annotation of the genomic neighborhoods of TnsC homologs Representative TnsC homologs were mapped on the respective genomic contigs and genes within 50 kb were extracted. These genes were translated and annotated for specific genes of interest including Cas effectors (Cascade and RAMP components and single protein effector from Class 2 CRISPR-Cas systems) and Cas6 that were annotated using the profiles extracted from DefenseFinder 55 and hmmsearch with a threshold set at a score of 25 whereby all hits with a score of 25 or greater were selected. Hits were mapped onto leaves of the TnsC homolog tree and subsequently the TniQ homolog tree and displayed with itol (Figures 2  and S2). As TniQ/TnsD can be extremely divergent and therefore not always detected by hmmsearch, a profile-profile comparison was performed using HHalign software, and the TniQ PF06527 profile was used to annotate distant TniQ homologs. TniQ hits were selected if they have a hhsearch probability >= 80. Given the abundance of TniQ, false positive detection of TnsC (AAA-ATpase among cargo), and sometimes multiple Tn7 transposons co-occurring, it is challenging to know which tniQ/tnsD is associated with which tnsC. For a given tnsC, we defined an association with tniQ/tnsD(s) if no other tnsC homologs were found in the vicinity. If there was a tnsC homolog closer than the given tnsC, we associated tniQ/tnsD to it only if there was a tnsB gene operonized with this closer tnsC. In total 5,072 tniQ/tnsD were found to be associated with TnsC homolog representatives. The presence of these tniQ/tnsD were mapped onto the TnsC homolog tree. When multiple tniQ/tnsD were found in the vicinity, the two largest were selected and the second selected tniQ/tnsD were mapped on an additional barplot indicating the size of the second protein in aa. tnsE were detected in relatives of E. coli Tn7 but also weakly in Tn6022-like transposons where it is encoded in the far vicinity of the Tn core components (in contrast to E. coli Tn7 relatives where tnsE is part of a full operon encompassing all Tn core components and tnsD). Manual inspection indicated these TnsE remote homologs are encoded near the left end of the transposon far from the other transposon components. Structural comparison from the models predicted by Alphafold2 18 (Figure S3) confirmed that these hits are TnsE relatives. These remote TnsE were used as a new seed to annotate additional TnsEs using blastp on all translated genes in the vicinity of tnsC. 552 TnsE in total were detected, and their presence was mapped on the TnsC homolog tree and on the TniQ homolog tree and displayed using the itol framework.
Protein structure prediction and analysis All structural models were built using Alphafold2 (AF2) software under the colabfold framework installed locally. 17-19 Multiple sequence alignments were constructed using colabfold_search on the colabfold database that includes Uniref and environmental protein sequences. Alignments were input into AF2, and three models were generated with 35 recycles. All models were examined using the PyMOL framework (The PyMOL Molecular Graphic System Version 1.2, Schrodinger, LLC), mapping the predicted local distance difference test (plDDT), 18,56 a local measure of prediction confidence, on each residue. Regions of the proteins with plDDT less than 50 were not considered. Protein docking prediction was performed using AF2 with the multimeric model. 17 Results of protein docking were analyzed by examining the predicted aligned error (PAE) matrix 18 and visualizing the interaction area on PyMOL. The spatial distributions of specific chemical interactions found in protein-protein interactions 57,58 were analyzed using PyMOL to validate models when PAE was weak. Predicted structures with high confidence (typically average pLDDT>70) were considered for downstream analysis. Searches for structural similarity were performed using DALI software version 5 25 using the PDB50 (non-redundant at 50% of sequence identity) and a custom database made from the EBI Alphafold2 database. This database contains Alphafold2 models of Uniref50 47,59 extracted from EBI Alphafold2 database (https://alphafold.ebi.ac.uk/) restricted to models having at least 30aa with a plDDT greater than 50 (hereafter called AF2DB50). Hits with Z-scores greater than 5 were retained, and every hit was manually verified by building a structural alignment in PyMOL editing mode.
Phylogenetic analysis of TniQ 5,072 tniQ/tnsD detected in the vicinity of the Tn7 tnsC homolog representatives were translated and aligned using MUSCLE version 5 using the option super5 to cluster the sequences. Alignments were then created for each cluster and merged into a single alignment. The alignment was first filtered to remove sites with conserved gaps using trimal version 1.2 with the option gappyout. As TniQ/TnsD can harbor very divergent C-terminal regions in terms of both sequence and size, the alignment was restricted to the core region of TniQ. To determine the core (roughly corresponding to the first 300aa), the structure of all dual TniQ-TnsD CAST systems was predicted with AF2 and manually aligned structurally (as described above). Core positions were mapped to the sequence alignment and the downstream regions (C-terminal region) were trimmed out. Finally, to filter out misaligned TniQ cores, aligned sequences from which the first CxxC motif of the zing finger were not well aligned (any gap in one of the positions) were discarded. The 4,916 remaining aligned sequences were used to build a tree using FastTree2 with the Whelan-Goldman models of amino acid evolution and gamma-distributed site rates. The presence of cas effector genes and cas6 as well as tnsE in the vicinity was shown similarly as for the TnsC tree. Operonized genes with tniQ/tnsD (see TniQ partner candidate analysis) were also displayed (light green ring). To investigate the origin of dual TniQ-TnsD, a visual approach and a statistical approach were used. A connector between dual TniQ-TnsD leaves of the tree was drawn and colored based on a rainbow gradient spread across all leaves from the left to the right of the tree. The connector has a uniform color picked from the color assigned to the left-most leaf. If a connector has a color that matched the color both at the starting leaf (left) and the arriving leaf (right), the proteins are closely related. Conversely, any contrast between the colors indicates the proteins are not closely related. For the statistical approach, the branch distances of all dual TniQ-TnsD were extracted and compared with 1000 random branch distances involving nondual TniQ-TnsD across the tree. Branch distances were calculated using the Phylo package from the Biopython library. Biopython package. 60 Comparison of these distances was done via a T-Test using SciPy python library version 1.0. 61

Analysis of candidate partners of TniQ
Genes operonized with tniQ/tnsD were extracted and translated if they were not related to any Tn7 components (TnsA, TnsB, TnsC, TniQ/TnsD, or TnsE). 782 proteins were clustered at 30% of sequence identities and 30% sequences coverages using MMSeqs2 (v. 12-113e3). 47 From each cluster, members were aligned together using mafft-linsi. 62 From each alignment, secondary structures were predicted psipred version 2.6 63 to ensure compatibility with hhpred, and HMM profiles were built for each using hhmake. 51 Each hhpred profile was compared to the Pfam protein domain database (preformatted for hhpred and available at https:// wwwuser.gwdg.de/$compbiol/data/hhsuite/databases/hhsuite_dbs/) using hhsearch. Hits with hhsearch probability >= 90 were considered. Candidates operonized with tniQ/tnsD were mapped onto the tree (light green ring). The 6 largest groups of candidates were selected based on the conservation of the operonized genes across several adjacent leaves in the TniQ tree and analyzed further. For each candidate, we performed profile analysis (using hhpred webserver) to assess their potential function and structural docking with TnsC and TniQ/TnsD to test for potential interactions (interaction with TnsC could highlight a novel target selector independent from TniQ/TnsD, whereas interaction with TniQ/TnsD could suggest a partner target selector) using Alphafold multimer.

TnsF analysis
TnsF from Acinetobacter johnsonii Tn6022 (hereafter AjTn6022 and AjTnsF) was chosen as the representative for computational and experimental analysis. HHpred was used to annotate the domain architecture region of AjTnsF and detect the presence of multiple zinc fingers in the N-terminal region (positions 1-180 has a hit to a LIM Zinc-binding domain-containing protein hhprob=97.23) and similarity with tyrosine recombinase in the C-terminal region (positions 328-493 has a hit to a site-specific recombinase IntI4 hhprob=99.14). A structural model of TnsF was obtained using AF2 and split into 3 domains defined by long linkers connecting globular regions and a long N-terminal region encoding several zinc fingers. Each domain was used as a seed for structural similarity search using DALI software across the PDB50 (as described above). Top hits were inspected manually with PyMOL.

Mining of TnsF and phylogenetic analysis of TnsF homologs
To search for TnsF relatives, AjTnsF was used as a seed for a psiblast 64 search for 3 iterations on the NCBI NR database (in August 2022). 22,419 protein hits were extracted and clustered at 80% of sequence identity and 70% of coverage with MMSeqs2 (v. 12-113e3). A blastall comparison 65 was performed, and e-values associated with each comparison were input into CLANS software 66 to cluster hits according to their e-values and draw a graph network representation. Several clusters (point density connected and close to each other) are also connected to each other. AjTn6022 TnsF was mapped onto the graph to identify the cluster to which it belongs in order to define the Tn6022 TnsF cluster. The Tn6022 TnsF cluster connected to another cluster from which several members were extracted and mapped onto genomic contigs. Genomic comparison between these contigs and Tn6022 contigs reveals a distinct system with partial comM surrounding the system and with no apparent Tn7 components but other genes operonized with tnsF. Hhpred webserver 51 was used to annotate these genes, revealing the presence of a gene encoding a tyrosine recombinase (yrec), a gene encoding a helix turn helix domain (hth), and a gene encoding a GIY-YIG nuclease. The system was named transposon using Target Selector based on tyrosine (Y) recombinase (Tsy) based on the components operonized with tnsF. Hits belonging to the CLANS Tn6022 TnsF cluster and the adjacent cluster harboring the Tsy TnsF were extracted and aligned using MUSCLE version 5 with the super5 algorithm. The alignment of 1,095 protein sequences was further trimmed using trimal version 1.2 (gappyout option) and was input to FastTree2 with the Whelan-Goldman models of amino acid evolution and gamma-distributed site rates. The tree was visualized and annotated using the interactive tree of life (itol). tnsF genes were mapped to genomic contigs, and genes in the vicinity (20 kb) were extracted, translated, and further clustered at 30% of sequence identity retaining 50% of coverage using MMSeqs2 (v. 12-113e3), and each cluster was converted into HMM profile using hhmake and compared to the profile pfam database using HHsearch. The top populated clusters were Tn7 components (tnsA, tnsB, tnsC, and tniQ/tnsD) and candidates operonized with Tsy TnsF (yrec, GIY-YIG). The presence of tniQ/tnsD as a marker of Tn7, and the presence of the yrec and the GIY-YIG nuclease were mapped on the TnsF tree as distinct rings. Split comM genes were extracted and translated from AjTn6022 and one Tsy locus and used as seeds to search for the full protein version of Mg chelatase for Tn6022 and Tsy using blastp. The closest Mg chelatase was selected for each of the two systems and used as a seed to detect comM pieces in the nucleotide vicinity of each locus using tblastn. The presence of a comM hit is indicated as a ring on the TnsF tree. Inspection of the tree shows Tn6022 TnsF is monophyletic (branch support = 0.976). A TnsF from Tsy was extracted from Zoogloea sp. LCSB751 (ZooTnsF) and the structure was predicted using AF2. Structural comparison between AjTnsF and ZooTnsF was performed manually using a PyMOL framework. The structural similarity search was done using DALI on AF2DB50 (as described above). Hits with a zscore greater than 8 were inspected manually to search for tandem CB+CAT domain architecture using PyMOL.

Determination of transposon ends for Tn6022 and Tsy
Transposon ends for AjTn6022 were determined using Geneious searching for at least one distinct cluster of short repeats (12 nt with 3 mismatches maximum, repeated at least twice in each end) that surround the transposon components including the tnsA, tnsB, tnsC, tniQ, tnsF, and tnsE. Exact end boundaries were then adjusted manually based on local alignment of the clustered repeats area and search for target site duplications. Transposon ends for ZooTsy were determined by prediction based on previous findings about YRec combined with experimental validation ( Figure S5). YRec usually works as a dimer to recognize a region with two DNA motifs (each bound by the CB domain of each monomer) and cleave the middle region surrounding these motifs during recombination. 67 Based on this, we reasoned that in an excision scenario where the cleavage site for excision is at the edge of the partial comM gene, one motif should be located within the comM gene while the other motif should be located downstream of comM in the end of the transposon which would lead to a cleavage site at the transition area between the comM gene and the end region. To test this, we first cloned the ends of ZooTsy -135-bp end1 (the region upstream of YRec extending to the border of the 5 0 -terminal portion comM) and 39-bp end2 (the region from the end of the cargo extending to the border of the 3 0 -terminal portion of comM). We then performed transposition assays, initally testing five different extensions (called homology arms (hom): 100-, 50-, 25-, 12-, 0-bp) for each end with the comM sequence upstream end1 (hom1) and downstream end2 (hom2) to determine if comM itself encoded a motif for end recognition. Based on these results, we concluded that 12 bp are required for hom1 (the requirement for hom2 was inconclusive). We then further refined this initial construct (end1:135bp, end2:39bp, hom1:12bp and hom2:12bp) to determine the minimal requirements for transposition ( Figure S5F).