Novel molecular requirements for CRISPR RNA-guided transposition

Abstract CRISPR-associated transposases (CASTs) direct DNA integration downstream of target sites using the RNA-guided DNA binding activity of nuclease-deficient CRISPR-Cas systems. Transposition relies on several key protein-protein and protein-DNA interactions, but little is known about the explicit sequence requirements governing efficient transposon DNA integration activity. Here, we exploit pooled library screening and high-throughput sequencing to reveal novel sequence determinants during transposition by the Type I-F Vibrio cholerae CAST system (VchCAST). On the donor DNA, large transposon end libraries revealed binding site nucleotide preferences for the TnsB transposase, as well as an additional conserved region that encoded a consensus binding site for integration host factor (IHF). Remarkably, we found that VchCAST requires IHF for efficient transposition, thus revealing a novel cellular factor involved in CRISPR-associated transpososome assembly. On the target DNA, we uncovered preferred sequence motifs at the integration site that explained previously observed heterogeneity with single-base pair resolution. Finally, we exploited our library data to design modified transposon variants that enable in-frame protein tagging. Collectively, our results provide new clues about the assembly and architecture of the paired-end complex formed between TnsB and the transposon DNA, and inform the design of custom payload sequences for genome engineering applications with CAST systems.


INTRODUCTION
Transposons are pervasive genetic elements capable of mobilizing between distinct genetic contexts using various targeting pathways, which serve as a potent force for genome evolution in all domains of life (1)(2)(3). Despite their abundance and sheer diversity, most transposons share a common feature in the presence of inverted repeat sequences that dictate the boundaries of the mobile element (4). These terminal transposon sequences are referred to as the left and right ends and typically encode one or more binding sites recognized by oligomeric transposase proteins through specific protein-DNA interactions.
DNA transposons (Class II) may be classified by their hallmark transposase domain, leading to major classes of DD(E/D) transposons, serine transposons, tyrosine transposons and Y1/Y2-transposons (5). DD(E/D)-family transposable elements encompass all known examples of 'cut-and-paste' DNA transposons, which are excised from the donor site and inserted into the target site. This reaction relies on mechanistic and enzymatic symmetry, with transposase subunits executing identical chemical steps on both transposon ends that involve coordinated nucleophilic attacks during strand cleavage and joining (4). The transposase binding sites themselves, however, are not always positioned symmetrically. Whereas some transposons encode symmetrically-positioned transposase binding sites within identical left and right ends, including Tc1/mariner-family transposons such as Mos1 (6), other elements encode asymmetrically-positioned binding sites within distinct left and right ends, including Hermes, piggy-Bac, Bacteriophage Mu, P elements, and Tn7-family transposons (7)(8)(9)(10)(11). For the well-characterized Tn7 transposon, the asymmetric transposon ends facilitate strict control over integration orientation (7,12), allowing one transposon end to be preferentially integrated adjacent to a target site, but the underlying mechanistic basis explaining this preference is unresolved.
Tn7-family transposons exhibit a modular nature in that they encode conserved core transposition proteins--TnsA, TnsB and TnsC--but diverse target site selection components. TnsB is a DDE-family integrase that recognizes and binds the transposon ends, catalyzes 3' cleavage of both transferred strands at the donor site, and catalyzes the transesterification reaction at the target site. TnsA is an endonuclease protein that forms a heteromeric complex with TnsB and cleaves the 5' end of the non-transferred strand, collectively allowing for full excision of the transposon from the donor site (13,14). TnsC is an AAA+ ATPase protein that communicates between the transposase module and the targeting component (15,16). Tn7-like transposons encode a sequence-specific DNA-binding protein as their targeting module, often of the TniQ family and referred to as TnsD, which directs transposition to a safeharbor locus such as glmS, comM, yhiN and parE (17)(18)(19)(20)(21)(22). In contrast, CRISPR-associated transposons (CASTs) use nuclease-deficient CRISPR-Cas systems to catalyze programmable, RNA-guided DNA transposition (18,(23)(24)(25)(26)(27).
We previously reported RNA-guided transposition activity for VchCAST, a Tn7-like transposon from Vibrio cholerae (also referred to as Tn6677), which mediates efficient and highly specific DNA integration in an Escherichia coli heterologous host (23). VchCAST encodes a Type I-F CRISPR-Cas system that specifies integration sites through RNA-guided DNA targeting by a multi-subunit complex called Cascade (23,28). Importantly, Cascade binds DNA in complex with an accessory transposition protein, TniQ, which ultimately recruits TnsC and the heteromeric TnsAB transposase, in complex with the donor DNA, to assemble the catalytically active transpososome ( Figure 1A) (28,29). DNA integration occurs roughly 50-bp downstream of the R-loop formed by TniQ-Cascade, but the preferred transposon insertion site varies across separate targets, with the dominant integration position ranging between 46-52-bp downstream of the target site (23,30). The sequence determinants underlying these heterogeneous integration distances remained enigmatic in our previous work (23,30). Additionally, the role and requirement for multiple putative TnsB binding sites within both transposon ends is unclear, limiting efforts to further engineer VchCAST as a DNA integration tool. Like other Tn7-family transposons, the transposon left and right ends encode asymmetrically positioned TnsB binding sites, which may also relate to the biased orientation with which transposon insertions occur (7,31).
Here, we employ library-based experiments in combination with high-throughput sequencing to investigate DNA sequence requirements during RNA-guided transposition by VchCAST. By individually mutating both transposon ends and measuring resulting DNA integration activity, we revealed sequence requirements of transposase binding sites that also mediate transposase-transposon cognate specificity. Interestingly, our results indicated that the relative positioning of each transposase binding site plays a crucial role in defining the proper architecture of the transpososome complex, with spacing patterns that correspond to the helical pitch of double-stranded DNA. These mutational data also revealed the importance of an integration host factor (IHF) binding site within the left transposon end, and subsequent genetic knockout and rescue experiments confirmed the role of IHF in stimulating transposition efficiency in E. coli. Finally, we uncovered sequence preferences at the site of integration, and we exploited our mutagenesis data to rationally engineer the transposon right end to enable in-frame tagging of endogenous protein-coding genes. Collectively, this work expands our understanding of both protein and DNA sequence requirements of Tn7-like transposons, reveals insights into the architecture of the transpososome complex, and provides new knowledge to inform the design of custom transposon sequences for genome engineering applications.

Cloning, testing, and analysis of pooled pDonor libraries
Donor plasmid (pDonor) libraries were generated by cloning 1549 transposon left end variants or 1849 transposon right end variants into a donor plasmid, which was cotransformed with an effector plasmid (pEffector) that directed transposition into the E. coli genome (schematized in Figure 1D). Each transposon end variant was associated with a unique 10-bp barcode that was used to uniquely identify variants in our sequencing approach, which relied on sequencing the starting plasmid libraries (input) and integrated products from genomic DNA (output) by NGS to determine the representation of each library member before and after transposition. To sequence the output, we independently amplified integration events in the T-RL and T-LR orientations using a cargo-specific primer flanking the transposon end and a genomic primer either upstream or downstream of the integration site. We wrote custom Python scripts to compare each library member's representation in the output to its representation in the input, allowing us to calculate the relative transposition efficiency of our custom transposon end variants.
To clone the transposon donor libraries, we first generated library variants as 200-nt single-stranded pooled oligos (Twist Bioscience). All variants are listed in Supplementary Tables S3 and S4. We wrote custom MATLAB scripts to automate the design of our substitution and truncation variants. The remainder of variants were designed by hand Nucleic Acids Research, 2023, Vol. 51, No. 9 4521 in spreadsheets. 1 ng of oligoarray library DNA was amplified by PCR for 12 cycles in 40 l reactions using Q5 High-Fidelity DNA Polymerase (NEB) and primers specific to the right or left end library, in order to add restriction enzyme digestion sites. All plasmids used in this study are listed in Supplementary Table S1, and oligos are listed in  Supplementary Table S2. Amplicons were cleaned up and eluted in 45 l Milli-Q H 2 O (QIAquick PCR Purification Kit). As the backbone vector, we used a plasmid encoding a 775-bp mini-transposon, delineated by 147-bp of the native transposon left end and 75-bp of the native transposon right end, on a pUC57 backbone. The backbone vector and library insert amplicons were digested (AscI and SapI for the right end library, and NcoI and NotI for the left end library) at 37 • C for 1 h, gel purified, and ligated in 20 l reactions with T4 DNA Ligase (NEB) at 25 • C for 30 min. Ligation reactions were cleaned up and eluted in 10 l Milli-Q H 2 O (MinElute PCR Purification Kit), and then used to transform electrocompetent NEB 10-beta cells in five individual electroporation reactions according to the manufacturer's protocol. After recovery (37 • C for 1 h), transformed cells were plated on large 245 mm x 245 mm bioassay plates containing LB-agar with 100 g/ml carbenicillin. Plates were scraped to collect cells, and plasmid DNA was isolated using the QIAGEN Plasmid Midi Kit.
Transposition experiments were performed in E. coli BL21(DE3) cells. pEffector encoded a CRISPR array (repeat-spacer-repeat), a native tniQ-cas8-cas7-cas6 operon, and a native tnsA-tnsB-tnsC operon, all under the control of a single T7 promoter on a pCDFDuet-1 backbone (30). 2 l of DNA solution containing 200 ng of pDonor and pEffector in equal molar amount was used to co-transform electrocompetent cells according to the manufacturer's protocol (Sigma-Aldrich). Four transformations were performed for each sample, and following recovery at 37 • C for 1 h, each transformation was plated on a large bioassay plate containing LB-agar with 100 g/ml spectinomycin, 100 g/ml carbenicillin, and 0.1 mM IPTG. Cells were grown at 37 • C for 18 h. Thousands of colonies were scraped from each plate, and genomic DNA was extracted using the Wizard Genomic DNA Purification Kit (Promega).
Next-generation sequencing (NGS) amplicons were prepared by PCR amplification using Q5 High-Fidelity DNA Polymerase (NEB). 250 ng of template DNA was amplified in 15 cycles during the PCR1 step. PCR1 samples were diluted 20-fold and amplified in 10 cycles during the PCR2 step. PCR1 primer pairs contained one pDonor backbonespecific primer and one transposon-specific primer (input library), or one genomic target-specific primer and one transposon-specific primer (output library). PCR amplicons were resolved by 2% agarose gel electrophoresis and gel-purified (QIAGEN Gel Extraction Kit). Libraries were quantified by qPCR using the NEBNext Library Quant Kit for Illumina (NEB). Sequencing for both input and output libraries was performed using a NextSeq Mid or High Output Kit with 150-cycles (Illumina). Additionally, the input libraries were also sequenced with a paired-end run using a MiSeq Reagent Kit v2 with 500-cycles (Illumina).
NGS data analysis was performed using custom Python scripts. Demultiplexed reads were filtered to remove reads that did not contain a perfect match to the 19-bp primer binding sequence at the 3'-terminus of the transposon end. Then, the 10-bp sequence directly downstream of the primer binding sequence was extracted, which encodes a barcode that uniquely identifies each transposon end variant. The number of reads containing each library member barcode was counted. If a read did not contain a barcode that matched a library member barcode, it was discarded. The barcode counts were summed across two NGS runs using the same PCR2 samples for the input libraries. Two biologically independent replicates were performed for the output libraries. The relative abundance of each library member was then determined by dividing the barcode count of each library member by the total number of barcode counts. The fold-change between the output and input libraries was calculated by dividing the relative abundance of each library member in the output library by its relative abundance in the input library. This fold-change was then normalized by dividing the fold-change of each library member by the average fold-change of four wildtype library members that contained identical transposon ends but unique barcodes.
One source of experimental noise in our approach came from PCR recombination (32), in which barcodes became uncoupled from their associated transposon end variants during PCR amplification. PCR amplification has been previously observed to be a source of uncoupling between paired elements during lentiviral production, which can confound analysis in CRISPR screens (32). We quantified the frequency of uncoupling by performing long-read Illumina sequencing (MiSeq, 500-cycles) to sequence both the barcode and full-length transposon end, and found that 64.4% of the right end and 40.9% of the left end barcodes were coupled to their correct transposon end sequence (Supplementary Figure S1C). However, uncoupled reads mapped to a diverse pool of sequences, with the most abundant incorrect sequence for each library member representing only a low percentage of total reads (Supplementary Figure S1D). Indeed, on average, the most abundant incorrect sequence for a given library member was only 2.8% and 9.0% for the right and left end libraries, respectively. These data therefore indicate that uncoupling events did not largely affect the ability to calculate relative integration efficiencies for each library member.
Sequence logos were generated with WebLogo 3.7.4. The VchCAST sequence logo in Figure 2B was generated from the six predicted TnsB binding sites.

Cloning, testing and analysis of pooled pTarget libraries
pTarget libraries were designed to include an 8-bp degenerate sequence positioned 42-bp downstream of one of two potential target sites, as schematized in Figure 3B. Integration was directed to one of the two target sites flanking the degenerate sequence by a single plasmid (pSPIN) encoding both the donor molecule and transposition machinery under the control of a T7 promoter, on a pCDF backbone [described in (33)]. To generate insert DNA for cloning the pTarget libraries, two partially overlapping oligos (oSL2241 and oSL2245, Supplementary Table S2) were annealed by heating to 95 • C for 2 min and then cooling to room temperature. Annealed DNA was treated with DNA Polymerase I, Large (Klenow) Fragment (NEB) in 40 l reactions and incubated at 37 • C for 30 min, then gel-purified (QIAGEN Gel Extraction Kit). Double-stranded insert DNA and vector backbone was digested with BamHI and AvrII (37 • C, 1 h); the digested insert was cleaned-up (MinElute PCR Purification Kit) and the digested backbone was gel-purified. Backbone and insert were ligated with T4 DNA Ligase (NEB), and ligation reactions were used to transform electrocompetent NEB 10-beta cells in four individual electroporation reactions according to the manufacturer's protocol. After recovery (37 • C for 1 h), cells were plated on large bioassay plates containing LB-agar with 50 g/ml kanamycin. Thousands of colonies were scraped from each plate, and plasmid DNA was isolated using the QIAGEN Plasmid Midi Kit. Plasmid DNA was further purified by mixing with Mag-Bind TotalPure NGS Beads (Omega) at a vol:vol ratio of 0.60× and extracting the supernatant to remove contaminating fragments smaller than ∼450 bp.
2 l of DNA solution containing 200 ng of pTarget and pSPIN at equal mass amounts was used to co-transform electrocompetent E. coli BL21(DE3) cells according to the manufacturer's protocol (Sigma-Aldrich). Three transformations were performed and plated on large bioassay plates containing LB-agar with 100 g/ml spectinomycin and 50 g/ml kanamycin. Thousands of colonies were scraped from each plate, and plasmid DNA was isolated using the QIAGEN Plasmid Midi Kit.
Integration into pTarget yielded a larger plasmid than the starting input plasmid. To isolate the larger plasmid, we performed a digestion step that facilitated resolution of the integrated and unintegrated bands on an agarose gel, for extraction of the larger integrated plasmid. We performed this digestion step on both input and output libraries, digesting with NcoI-HF (37 • C for 1 h) and running them on a 0.7% agarose gel. The products were gel-purified (QIAGEN Gel Extraction Kit) and eluted in 15 l EB in a MinElute Column (QIAGEN). 6.5 l of cleaned-up DNA was used in each PCR1 amplification with Q5 High-Fidelity DNA Polymerase (NEB) for 15 cycles. PCR1 samples were diluted 20-fold and amplified in 10 cycles for PCR2. PCR1 primer pairs contained pTarget backbone-specific primers flanking a 45-bp region encompassing the degenerate sequence. Sequencing was performed with a paired-end run using a NextSeq High Output Kit with 150-cycles (Illumina).
NGS data analysis was performed using a custom Python script. Demultiplexed reads were filtered to remove reads that did not contain a perfect match to the 34-to 35-bp sequence upstream of the degenerate sequence for any i5reads, or to the 45-to 46-bp sequence for any i7-reads (35bp and 46-bp was used for reads that were amplified from primers containing an additional nucleotide, which were used in PCR1 to generate cluster diversity during sequencing). For all reads that passed filtering, the 8-bp degenerate sequence was extracted and counted. The integration distance was determined in the output libraries by examining the i5 read sequence at an integration distance of 43-to 56bp downstream of each target for the presence of the transposon right or left end sequence (20-nt of each end). The degenerate sequence was then extracted from either or both of the i5 and i7 reads, depending on the integration position. The degenerate sequence counts were summed across the two primer pairs. The relative abundance was determined by dividing the degenerate sequence count by the total number of degenerate sequence counts. Finally, the foldchange between the output and input libraries was calculated by dividing the relative abundance of each degenerate sequence at each integration position in the output library by its relative abundance in the input library, and then log2transformed.
Sequence logos were generated with WebLogo 3.7.4. The preferred integration site logos in Supplementary Figure  S3A were generated from all degenerate sequences that were enriched four-fold in the integrated products compared to the input. The overall preferred integration site logos in Figure 3C and Supplementary Figure S3E were generated by first applying the minimum threshold of four-fold enrichment in the integrated products compared to the input, and then selecting nucleotides from the top 5000 enriched sequences across all integration positions. We selected nucleotides from the top 5000 sequences from each library, yielding a total of 10 000 nucleotides at each position.

Endogenous protein tagging experiments
All VchCAST constructs were subcloned from pEffector and pDonor as described previously, using a combination of inverse (around-the-horn) PCR, Gibson assembly, restriction digestion-ligation, and ligation of hybridized oligonucleotides (23,30). pEffector encodes a CRISPR array (repeat-spacer-repeat), a native tniQ-cas8-cas7-cas6 operon, and a native tnsA-tnsB-tnsC operon, all under the control of a single T7 promoter on a pCDFDuet-1 backbone (30). Donor plasmids (pDonor) were designed to encode a mini-transposon (mini-Tn) with a wild-type 147-bp transposon left end and 57-bp linker-coding right end variant, on a pUC19 backbone. For endogenous protein tagging experiments, superfolder GFP (sfGFP) lacking a ribosome binding site (RBS) and start codon was cloned into the mini-Tn cargo region, and the mini-Tn was further cloned into a temperature-sensitive pSIM6 backbone.
Linker functionality constructs were designed to encode sfGFP with an extended 32-amino acid (aa) loop region between the 10th and 11th ␤-strands, under the control of a single T7 promoter, as described by Feng and colleagues (34). Linker variants encoding 18-19 aa were subcloned into the 32-aa loop region as follows. An entry vector was generated on a pCOLADuet-1 (pCOLA) vector harboring sfGFP, such that the 11th ␤-strand (GFP11) was replaced by the aforementioned extended 32-aa loop (34). Fragments encoding transposon right end linker variants and GFP11 were then amplified by conventional PCR and inserted into the extended loop region of the entry vector downstream of ␤-strands 1-10 (GFP1-10), such that total length of the loop remained constant at 32 aa.
To perform linker functionality assays, chemically competent E. coli BL21(DE3) cells were co-transformed with T7-controlled sfGFP linker functionality constructs (pCOLA) and an equimolar amount of empty pUC19 vector. Negative control transformants harbored both unfused sfGFP1-10 and sfGFP11 fragments on separate pCOLA and pUC19 backbones, respectively, or harbored isolated sfGFP fragments. Transformed cells were plated on Nucleic Acids Research, 2023, Vol. 51, No. 9 4523 LB-agar plates with antibiotic selection (100 g/ml carbenicillin, 50 g/ml kanamycin), and single colonies were used to inoculate 200 l of LB medium (100 g/ml carbenicillin, 50 g/ml kanamycin, 0.1 mM IPTG) in a 96-well optical-bottom plate. The optical density at 600 nm (OD 600 ) was measured every 10 min, in parallel with the fluorescence signal for sfGFP, using a Synergy Neo2 microplate reader (Biotek) while shaking at 37 • C for 15 h. To derive normalized fluorescence intensities (NFI), all measured fluorescence intensities were divided by their corresponding OD 600 values across all time points.
Transposition experiments were performed by transforming chemically competent E. coli BL21(DE3) cells harboring pEffector plasmids with pDonor plasmids by heat shock at 42 • C for 30 s, followed by recovery in fresh LB medium. Recovery was performed at 30 • C for 1.5 h for temperaturesensitive pDonor plasmids, and 37 • C for 1 h for all other pDonor plasmids. Transformants were isolated on LB-agar plates containing the proper antibiotics and inducer (100 g/ml carbenicillin, 100 g/ml spectinomycin, 0.1 mM IPTG). After 43 h growth at 30 • C for temperature-sensitive pDonor plasmids, and 18 h growth at 37 • C for all other pDonor plasmids, samples were prepared for downstream qPCR analysis of integration efficiency or colony PCR identification of integration events.
For qPCR quantification, colonies were scraped from plates and resuspended in LB medium, and cell lysates were prepared for qPCR as described by Klompe and colleagues (23). Pairs of transposon-and target DNA-specific primers were designed to amplify fragments from integrated transposition products at the expected loci in either of two possible orientations. In parallel, a separate pair of genomespecific primers was designed to amplify an E. coli reference gene (rssA) for normalization purposes. qPCR reactions (10 l) contained 5 l of SsoAdvanced Universal SYBR Green Supermix (BioRad), 1 l Milli-Q H 2 O, 2 l of 2.5 M primers, and 2 l of hundredfold-diluted cell lysate. Reactions were prepared in 384-well clear/white PCR plates (BioRad), and measurements were obtained in a CFX384 Real-Time PCR Detection System (BioRad). Transposition efficiency was calculated for each orientation as 2 Cq , in which Cq is the Cq difference between the experimental and control reactions. All measurements presented were determined from three independent biological replicates.
For colony PCR identification of integration events, colonies were scraped from plates after transposition assays, resuspended in fresh LB medium, and re-streaked on LB-agar plates with the appropriate antibiotics and without IPTG inducer. To generate lysates, individual colonies were each transferred to 10 l of Milli-Q H 2 O, followed by incubation at 95 • C for 2 min and centrifugation at 4000 × g for 5 min to pellet cell debris. Pairs of transposon-and target DNA-specific primers were designed to amplify fragments from integrated transposition products in the expected locus and orientation. In parallel, a separate pair of genomespecific primers was designed to amplify an E. coli reference gene (rssA) and determine whether the crude lysates were sufficiently dilute to allow successful amplification of the integrated transposition product. To verify in-frame integration events, amplicons of the expected length were excised after gel electrophoresis, purified by the Gel Extraction Kit (QIAGEN), and sent for Sanger sequencing (GENEWIZ).
Fluorescence microscopy experiments were performed as follows. A pEffector plasmid was designed to C-terminally tag the native E. coli msrB gene by integrating a mini-Tn encoding a linker variant (ORF2a) and sfGFP cargo in-frame with the coding sequence, thereby interrupting the endogenous stop codon. Transposition experiments were performed as described above by transforming chemically competent E. coli BL21(DE3) cells harboring pEffector plasmids with temperature-sensitive pDonor plasmids. Colonies were then scraped and resuspended in fresh LB medium. Resuspensions were diluted and re-streaked on double antibiotic LB-agar plates lacking IPTG (100 g/ml carbenicillin, 100 g/ml spectinomycin). After overnight growth on solid medium at 30 • C, individual colonies were used to inoculate liquid cultures (100 g/ml spectinomycin) for overnight heat-curing at 37 • C, followed by replica plating on single and double antibiotic plates to isolate heatcured samples. In tandem, colony PCR and Sanger sequencing (GENEWIZ) were performed to identify colonies with in-frame transposition products as described above. On the day of imaging, 500 l of saturated overnight cultures was transferred to 5 ml of fresh LB medium with the appropriate antibiotics. Aliquots of the newly inoculated cultures were removed around the stationary or mid-log phases and immobilized in glass slides coated with partially dehydrated aqueous 1% agarose-TAE pads. Immediately after immobilization, fluorescent microscopy was performed with a Nikon ECLIPSE 80i microscope using a 100× oil immersion objective, which was equipped with a Spot CCD camera and SpotAdvance software. All images were processed in ImageJ by normalizing background fluorescence.
Generating and testing E. coli knockout mutants E. coli genomic knockouts of ihfA, ihfB, ycbG, hupA, hupB, hns and fis were generated using Lambda Red recombineering, as previously described (35). Knockouts were designed to replace of each gene with a kanamycin resistance cassette, which was amplified by PCR with Q5 High-Fidelity DNA Polymerase (NEB) using primers that contained 50nt homology arms to knockout gene locus. PCR amplicons were resolved on a 1% agarose gel and gel-purified, eluting with 40 l Milli-Q H 2 O (QIAGEN Gel Extraction Kit). Electrocompetent E. coli BL21(DE3) cells were prepared containing a temperature-sensitive plasmid that encodes the Lambda Red machinery under the control of a temperature-sensitive promoter (pSIM6). Protein expression from the temperature-sensitive promoter was induced by incubating cells at 42 • C for 25 min immediately prior to electrocompetent cell preparation. 300-600 ng of each insert was used to transform cells via electroporation (2 kV, 200 , 25 F), and cells were recovered overnight at 30 • C by shaking in 3 ml of SOC media. After recovery, 250 l of culture was spread on 100 mm standard plates (LBagar with 50 g/ml kanamycin) and grown overnight at 30 • C. Kanamycin-resistant colonies were isolated, and the genomic knock-in was confirmed by PCR amplification and Sanger sequencing (GENEWIZ) using primer pairs flanking the knock-in locus.
VchCAST transposition experiments in E. coli knockout strains were performed by first preparing chemically competent WT and mutant cells and then transforming these strains with a single plasmid (pSPIN), which encodes the donor molecule and the native transposition machinery under the control of a T7 promoter and a crRNA targeting the lacZ genomic locus, on a pCDF backbone. After transformation by heat shock, cells were plated onto LB-agar with 100 g/ml spectinomycin and 0.1 mM IPTG to induce protein expression, and incubated at 37 • C for 18 h. Hundreds of colonies were scraped from each plate, and integration efficiencies were quantified using the same qPCR assay described for the endogenous protein tagging experiments. Transposition experiments for other Type I-F homologs were performed as in the VchCAST experiments, except that the concentration of IPTG was reduced to 0.01 mM to mitigate toxicity.
Experiments that tested protein expression conditions in WT and IHF cells were performed as described in the VchCAST transposition experiments. Promoters were varied from constitutive promoters (J23119, J23101) to inducible promoters (T7), for which different concentrations of IPTG were also tested.
For the complementation experiments, cells were cotransformed with pSPIN and a rescue plasmid (pRescue) that encoded both E. coli ihfA and ihfB under the control of separate T7 promoters on a pACYC backbone, and plated onto LB-agar with 100 g/ml spectinomycin, 25 g/ml chloramphenicol, and 0.1 mM IPTG to induce protein expression. Cells were incubated at 37 • C for 18 h, before colonies were scraped from each plate and integration efficiencies in both orientations were measured by qPCR.
To test DNA donor molecules with symmetric transposon ends, we cloned mutant pDonor encoding two right or two left transposon ends, and measured integration efficiency by co-transforming pDonor with pEffector under the control of a T7 promoter on a pCDF backbone. Cells were plated onto LB-agar with 100 g/ml spectinomycin, 100 g/ml carbenicillin, and 0.1 mM IPTG and incubated at 37 • C for 18 h, before colonies were scraped from each plate and integration efficiencies in both orientations were measured by qPCR.

E. coli Tn7 transposition experiments and NGS analysis
To measure the integration efficiencies and distance distributions of E. coli Tn7 in WT and E. coli mutant cells, we cloned genomic primer binding sites into the mini-Tn cargo of a single plasmid for Tn7 transposition, which encoded a native tnsA-tnsB-tnsC-tnsD operon under the control of a constitutive pJ23119 promoter, on a pCDF backbone. The genomic primer binding sites were cloned adjacent to the transposon left and right ends such that the NGS amplicon length would be the same for unintegrated products and integrated products in either orientation (schematized in Supplementary Figure S7A). To quantify integration efficiencies using qPCR, we used primer pairs designed to amplify integrated products in both orientations, with one primer adjacent to the right transposon end a second primer either upstream or downstream of the integration site.
To quantify integration efficiencies by NGS, we amplified genomic DNA using a single primer pair with one primer complementary to the genomic primer binding site and the second primer complementary to the 3'-end of the glmS locus. Genomic DNA was extracted using the Wizard Genomic DNA Purification Kit (Promega). 250 ng of genomic DNA was used in each PCR1 amplification with Q5 High-Fidelity DNA Polymerase (NEB) for 15 cycles. PCR1 samples were diluted 20-fold and amplified in 10 cycles for PCR2. Sequencing was performed with a paired-end run using a NextSeq High Output Kit with 150-cycles (Illumina).
NGS data analysis was performed using a custom Python script. Demultiplexed reads were filtered to remove reads that did not contain a perfect match to the first 65-bp of expected sequence resulting from either non-integrated genomic products or from integration events spanning 0to 30-bp downstream of the glmS locus, and then we counted the number of reads matching each of these possible products.

A pooled library approach to investigate transposon end sequence requirements
We set out to mutagenize the transposon left and right end sequences of V. cholerae Tn6677 (VchCAST) using large pooled oligoarray libraries, building off our previous study of the VchCAST system (23). Starting with a minimal pDonor design that directed efficient genomic integration in  We assigned each variant a unique 10-bp barcode located between the transposon end variant and the cargo, obviating the requirement to sequence across the entire transposon end to identify each variant. Each library also included four wildtype (WT) variants associated with unique barcodes, which we used for downstream validation of our experimental setup and to approximate the relative integration efficiency of other library members. Libraries were then synthesized as single-stranded oligos, cloned into a minitransposon donor (pDonor), and carefully characterized using next-generation sequencing (NGS), which demonstrated that all members were represented in the input sample for both transposon left and right end libraries (Supplementary Figure S1B-E).
We performed transposition experiments by transforming E. coli BL21(DE3) cells expressing the transposition machinery with pDonor encoding either the left end or right end library, amplifying successful genomic integration products in both orientations via junction PCR (Figure 1D), and subjecting PCR products to NGS analysis. An enrichment score was then calculated for each variant, revealing a wide range of integration efficiencies, with most library members exhibiting diminished integration relative to the four WT samples (Supplementary Figure S1E). Finally, we used enrichment scores of the WT library members for normalization, yielding a score for each variant that represented its relative activity. To validate our approach, we performed two biological replicates for each library transposition experiment and found strong concordance between both datasets, especially in the dominant orientation, 'T-RL,' in which the transposon was integrated in the rightleft ('RL') orientation relative to the target site (Supplementary Figure S1F). Importantly, we also determined the   In previous work, we tested truncations of the transposon ends by cloning and testing transposon end variants individually (23). To explore the strength and verify the robustness of our pooled library approach, we generated a similar but vastly expanded panel of truncations by sequentially mutating the transposon end sequences, effectively creating end truncations, albeit without a change in overall mini-transposon size ( Figure 1E and Supplementary Figure S1A). A single pooled transposition experiment then confirmed that, to facilitate efficient integration, a minimum of three transposase (TnsB) binding sites are required within the left end but only two are required within the right end. These findings are consistent with previous literature and add information at single-bp resolution to the minimal transposon end sequences for efficient integration (23).

Transposase activity depends on specific sequence requirements
TnsB is integral to the mobilization of Tn7-like transposons, in that it catalyzes the excision and integration chemistry while also conferring sequence specificity for the transposon ends through recognition of repetitive sequence elements known as TnsB binding sites (TBSs) (7,14,36). Sequence analysis of the native VchCAST ends revealed three conserved TBSs in both the left and right ends (Figure 2A, B and Supplementary Figure S2A) (23), and we verified these sequence requirements by examining a mutational panel at single-bp resolution ( Figure 2C and Supplementary Figure S2B). This dataset revealed that individual TBS point mutations can affect efficiency, particularly for positions 1, 6-9 and 12-14 (of which all but position 9 are completely conserved across TnsB binding sites), but are not critical for integration. This lenient sequence requirement is in line with recently published cryo-EM structures of DNA-bound TnsB from Tn7 and Type V-K CAST systems, which revealed that many protein-DNA interactions occur with the Nucleic Acids Research, 2023, Vol. 51, No. 9 4527 phosphodiester backbone rather than specific nucleobases (37)(38)(39).
Experiments with E. coli Tn7 showed that the internal TBSs are occupied before the more terminal sites (7). Even though the majority of bases within the six TBSs of Vch-CAST are conserved ( Figure 2B and Supplementary Figure  S2A), we wondered if the existing differences might be biologically important, perhaps by enforcing a specific assembly pathway. To test this hypothesis, we tested all possible combinations of TBSs for the left and right ends, which we defined as L1-L3 and R1-R3 (Supplementary Figure S2C). For both VchCAST ends, site 1 displayed the strongest TBS preference and preferred the L1, L3 or R1 sequence, whereas site 2 preferred L1, R1 or R2; site 3 exhibited the weakest TBS preference but favored L3. We observed a preference for R1 in the first position on the left end, and a preference for L1 in the first position on the right end, suggesting that transposition might be favored when the terminal end sequences are identical (whether based on equal affinity or otherwise).
Apart from regulating transposition frequency, TBS sequence identity could also explain the propensity of a given CAST system to cross-react with related transposon substrates (17). We previously showed that VchCAST could ef-ficiently mobilize mini-transposon substrates from three homologous CAST systems, but not from Tn7002. To determine which Tn7002 sequences were incompatible with mobilization by VchCAST machinery, we designed chimeric transposon ends that contain parts of both the VchCAST and Tn7002 transposon ends ( Figure 2D). The data revealed that chimeric left ends allowed for near WT integration efficiencies whereas chimeric right ends drastically decreased integration efficiency, likely due to the deleterious presence of a cytidine at position 9 of R1-R3 ( Figure 2D). These data thus demonstrate that TBS sequence identity imparts specific constraints on the substrate recognition of a transposase for its cognate transposon DNA.
Finally, we sought to investigate the conserved positioning of TBSs within the transposon ends, after hypothesizing that the specific distance between TBSs might facilitate proper assembly of transposase subunits within a pairedend-complex (PEC) (17). After testing a panel of variants in which the length between TBSs was systematically varied ( Figure 2E and Supplementary Figure S2D), we found that even single-bp perturbations caused drastic changes in integration efficiency. Additionally, we detected an intriguing pattern of increasing and decreasing integration efficiencies at roughly 10-bp intervals, suggesting that the three-dimensional positioning of transposase proteins on helical DNA is important for transposition.
Together, these data highlight the impact of TBS mutations and TBS sequence positioning on transposition, and provide clues about how TBSs may have evolved to direct efficient assembly of synaptic paired-end complexes.

Transposase sequence preferences influence integration site patterns
In our previous work, we showed that VchCAST integration patterns differed in subtle but reproducible ways between distinct genomic target sites (23,30). Since integration is the result of both RNA-guided DNA targeting and transposase-mediated DNA integration, we wondered which DNA sequences and protein machineries were responsible for the heterogeneity in integration products. First, we used deep sequencing to compare integration site patterns for four endogenous E. coli target sequences, designated 4-7, either at their native genomic location or on an ectopic target plasmid ( Figure 3A). Integration site patterns were notably distinct between the four targets but were highly consistent between genomic and plasmid contexts, suggesting that these patterns are dependent on local sequence alone and independent of other factors such as DNA replication or local transcription. Next, to disentangle contributions of the 32-bp target sequence (complementary to crRNA guide) from the downstream region including the integration site, we tested target plasmids that contained chimeras of the four target regions ( Figure 3A). Remarkably, integration patterns for these chimeric substrates closely mirrored the patterns observed for the non-chimeric substrates when the 'downstream region' was kept constant, clearly indicating that the sequence identify of the 32-bp target region alone does not modulate selection of the integration site.
We hypothesized that, like other transposases, TnsB might exhibit local sequence preferences immediately at the site of DNA insertion, and that these preferences could explain the observed heterogeneity in integration site patterns (40). To test this possibility, we generated a target plasmid (pTarget) library encoding two target sequences, designated, 'Target A' and 'Target B,' flanking an 8-bp degenerate sequence, such that integration events directed by a crRNA matching either target would lead to insertion into the degenerate 8-mer sequence ( Figure 3B). We sequenced the target plasmids before and after transposition and compared the representation of integration site sequences to determine which sequences were enriched after transposition. These analyses revealed striking nucleotide preferences at conserved positions relative to the integration site ( Figure  3C and Supplementary Figure S3A). Specifically, there were clear biases for a YWR motif within the central three nucleotides of the target-site duplication (TSD), as well as a preference for D (A, T or G) and H (A, T or C) at the -3 and + 3 positions relative to the TSD, respectively. Similar TSD preferences were previously observed for the Type V-K ShCAST system (24), suggesting that they may be broadly applicable to TnsB-family transposases.
To further explore the deterministic role of the preferred motif within the TSD, we plotted the distribution of reads containing a central 5'-CWG-3' motif at different positions within the degenerate sequence. We focused on this motif because it favored a more unimodal distribution for the integration site by avoiding a centrally-preferred A or T nucleotide flanking the W. We found that this motif was indeed predictive of the preferred integration site distance that was sampled by VchCAST ( Figure 3D). We extended this observation by plotting the distribution of reads containing multiple 5'-CWG-3' motifs within the integration site and found that two copies of this preferred motif within the integration site conferred a bimodal distribution, wherein there were not one but two preferred integration sites within the degenerate sequence (Supplementary Figure S3B). Finally, we examined the integration site distribution of previously targeted locations (23) and found that they corresponded to the preferred sequence motifs determined in our library experiment (Supplementary Figure S3C). Indeed, the dominant integration distance(s) always encoded more preferred motifs within the TSD relative to the motifs found at neighboring positions.
Both of the two distinct crRNAs and corresponding target sites on pTarget yielded consistent sequence preferences for both the TSD and ±3-bp positions ( Supplementary Figure S3A), but we were surprised to find that the preferred integration distance was shifted by 1 bp when comparing the two (Supplementary Figure S3D). We suspected that this difference could be due to sequences preferences at the ±3-bp position that fell outside the degenerate sequence, and indeed, when we examined the sequences flanking the 8-mer library, we found that the downstream target (target B) contained a disfavored nucleotide in the -3-bp position for insertions that would occur with the 49-bp distance (Supplementary Figure S3E). Interestingly, the role for these positions in modulating transposition behavior is supported by two recent structures of TnsB from Type V-K ShCAST bound to strand-transfer DNA (38,39), which revealed residue K290 of both terminal TnsB protomers in close proximity to the ±3-bp position outside of the target site duplication.

Role of boundary sequences and right end internal features on DNA integration
We next focused our attention on additional sequence features at the outermost edges of mini-transposon substrates. VchCAST and many other Tn7-like transposons encode an 8-bp terminal end immediately adjacent to the first transposase binding site, with the terminal TG dinucleotide highly conserved among a broad spectrum of transposons including IS3, Tn7, Mu and even retrotransposons (41)(42)(43)(44). Integration data with library variants that featured mutations within these terminal residues revealed that positions 1-3, but not 4-8, were critical for efficient transposition (Supplementary Figure S4B). This result is consistent with the DNA-bound cryo-EM structure of TnsB from a Type V-K CAST system, in which base-specific interactions were observed for the terminal TG dinucleotide (38), and with experiments indicating that these terminal dinucleotides are important for the formation of a stable Mu transpososome complex (42,45). Sequences beyond the terminal TG are also acted upon during excision of Tn7-like transposons, since the endonuclease TnsA cleaves the 5' ends of the donor DNA 3-bp outside the transposon end boundaries (46). Also, the sequences flanking the donor impact transposition efficiency of TnsB from a Type V-K CAST (47). These observations suggested the possibility that the sequence context of the transposon donor itself might play a role in efficient transposition. However, library variants with mutations in the 5-bp sequence flanking the mini-transposon were integrated with equivalent efficiencies (Supplementary Figure S4A), indicating that transposition machinery does not exhibit sequence specificity within this region.
To investigate whether the spacing between the terminal TG dinucleotide and the first TBS mattered, we tested variants that modulated the distance between the 8-bp terminal end and TBS1 (Supplementary Figure S4C). Adding a single base pair in either the left or right end still allowed for efficient transposition, whereas transposition was completely ablated with the removal of 1 bp or addition of 2 bp, indicating tight control over this spacing. Interestingly, larger bp additions or deletions between the TG dinucleotide and first TBS were in some cases also permitted, but always with a concomitant shift in the transposon boundary that was actually mobilized and integrated at the target site (Supplementary Figure S4C); in all cases, transposition still required a terminal TG. These data therefore suggest that the essential feature within the terminal end sequence is the TG dinucleotide, and that the ∼8-bp spacing between this dinucleotide and the first TBS is critical for efficient transposition.
We also further investigated the importance of a palindromic sequence found 97-107 bp from the transposon right end boundary. Previous work suggested that this sequence might affect integration orientation, possibly by promoting transcription of the tnsABC operon, which would be consistent with empirical expression data and the AT-richness of the transposon end (48). To test this possibility, we mutated the sequence and found that not the palindromic nature, but the sequence of only one arm of the palindrome (P B ) was sufficient to shift the orientation bias away from T-RL (Supplementary Figure S4D, E). We also included bona fide constitutive promoters in place of the palindromic sequence and found that promoters directing transcription inwards (towards the cargo) did not impact integration orientation, whereas promoters directed outwards (across the right end) shifted the orientation preference towards T-LR, perhaps by antagonizing stable assembly of TnsB selectively at the right end (Supplementary Figure S4F). These data highlight the role of this right end sequence region on integration orientation, which should be considered when designing custom cargo sequences.

Endogenous protein tagging with rationally engineered right ends
The left and right end sequences are critical for transposon DNA recognition and excision/integration, and transposition products therefore necessarily include these sequences as 'scars' at the site of insertion. We sought to exploit this feature and use our new knowledge of the mutability of the transposon ends to convert these scars into functional sequences that encode amino acid linkers for downstream protein tagging applications. We focused on the shorter right end, starting with a minimal 57-bp sequence, and observed that stop codons were present in all three possible open reading frames (ORF) for the WT sequence ( Figure  4A) (23). When we tested a library of rationally designed right end variants that replaced stop codons and codons encoding bulky and/or charged amino acids (Supplementary Figure S5A), we identified numerous candidates for each possible ORF that maintained near-wild-type integration efficiency (Supplementary Figure S5B). After validating library data by testing individual linker variants for genomic integration in E. coli ( Figure 4B), we next set up a fluorescence-based assay to test for functionality of the encoded amino acid linkers.
GFP naturally consists of 11 ␤-strands that are connected by small loop regions, and a prior study demonstrated that the loop region between the 10th and 11th ␤-strand can be extended with novel linker sequences while still allowing for proper folding and fluorescence of the variant GFP protein (34). We cloned selected transposon right end variants into the loop region between ␤-strand 10 and 11 and measured GFP fluorescence intensity after expression of each construct, which revealed a subset of variants that were fully functional ( Figure 4C and Supplementary Figure  S5C). Next, we selected the endogenous E. coli gene msrB for C-terminal tagging in a proof-of-concept experiment ( Figure 4D). msrB encodes the enzyme MsrB, a methionine sulfoxide reductase (49,50), which has been fluorescently tagged and imaged in an endogenous context by others (51) and contains a PAM sequence that allows for DNA insertion within the msrB stop codon, providing an ideal target for this initial trial. After generating a pDonor construct that encodes a right end linker variant with an adjacent, in-frame GFP gene lacking a promoter or start codon, we performed transposition experiments and used Sanger sequencing to verify that integration interrupted the endogenous stop codon while placing the linker and GFP sequence directly in-frame. Proper expression of MsrB-GFP fusion proteins was analyzed by imaging cells via fluorescence microscopy that received either the WT transposon right end or the linker variant, demonstrating that only the modified right end variant elicited the expected cellular fluorescence ( Figure 4E and Supplementary Figure S5D). Finally, to confirm that GFP was translationally fused to MsrB, we performed an anti-GFP western blot and found that GFP was not detected in the WT transposon end fusion but was detected at the expected size in the modified linker variant ( Figure 4F). Together, these data provide the basis for new genome engineering tools that allow for facile, endogenous protein tagging with single-bp control.

Integration host factor (IHF) binds the left transposon end to stimulate transposition
Closer inspection of the transposon left end mutational data revealed a sequence between the two terminal TnsB binding sites (TBSs) that, when mutated, led to reproducible transposition defects ( Figure 5A). We noticed that the corresponding DNA sequence perfectly matched a consensus binding sequence for integration host factor (IHF) (52,53), a heterodimeric nucleoid-associated protein (NAP) that binds to the consensus sequence 5'-WATCARNNNNTTR-3' and induces a DNA bend of more than 160 • (54). First identified as a host factor for bacteriophage integration, IHF is also involved in diverse cellular activities including chromosome replication initiation, transcriptional regulation, and various site-specific recombination pathways (55)(56)(57). This observation suggested the intriguing possibility that IHF might also play a role in RNA-guided transposition by CAST systems.
To test whether the IHF binding site in the left transposon end functions to promote transposition, we first generated IHF knockout strains by mutating either ihfA and ihfB, and then measured integration efficiency with WT VchCAST. Deletion of either ihfA or ihfB decreased integration efficiency in the mutant strains by ∼20-fold (Figure 5B), and this effect was completely rescued when we introduced a plasmid encoding recombinant ihfA or ihfB, confirming the IHF knockouts as causative genetic perturbations ( Figure 5B). Interestingly, the reduction in integra-tion efficiency was sensitive to vector design and expression conditions, as integration was less dependent on IHF when the donor DNA was encoded on a separate plasmid from the transposition machinery compared to when the donor DNA was encoded on the same plasmid as the transposition machinery (Supplementary Figure S6A). This sensitivity to vector design may be due to differences in the expression of transposition proteins. Even though cells were always grown for 18 h after transformation, when a separate plasmid was used to express the transposition machinery ('pEffector + pDonor' conditions), cells already contained the effector plasmid before they were transformed with donor DNA. This longer time for effector proteins to be expressed may have increased transposition efficiency in IHF cells for these conditions. When we selectively mutated the conserved IHF binding site residues of a transposon donor, we found that transposition efficiency decreased ( Figure 5C). Moreover, sensitivity to IHF was dependent on the presence of an intact IHF binding site, since the loss of IHF in cells containing a mutant binding site did not cause an additional decrease in transposition efficiency. In other words, our results indicate that mutating the IHF binding site is epistatic to the loss of IHF. These experiments indicate that IHF binds the left transposon end to stimulate RNA-guided transposition.
We next wondered whether the IHF requirement was conserved across diverse I-F CAST systems, taking advantage of the twenty homologous systems that we recently described (17). Visual examination of the transposon left ends revealed a highly conserved IHF binding site across all homologs ( Figure 5D, E), and aligning the sequence between the first two TBSs using Clustal Omega also revealed the binding site consensus as a conserved feature (Supplementary Figure S6B). To test whether IHF stimulated transposition for these systems, we performed experiments in WT and IHF cells for five other systems and found that only two (Tn7000 and Tn7014) showed a strong IHF dependence ( Figure 5F). These data suggest that the IHF dependence may not be conserved across all I-F CAST systems.
Given the involvement of IHF and, more generally, the importance of donor/target DNA supercoiling and topology for other mobile elements (58,59), we decided to test whether other E. coli NAPs might play a role in transposition. We generated knockout strains of 5 additional NAP genes (ycbG, hupA, hupB, hns, and fis), which play architectural roles in DNA compaction and organization and affect a variety of cellular processes such as transcription, replication, recombination, repair and SOS response (60)(61)(62). We measured integration efficiency within these mutant backgrounds and found that only the loss of fis affected transposition, decreasing integration efficiency by 2fold (Supplementary Figure S6F). When we tested the same cohort of NAP knockouts for transposition with the prototypic Tn7 system, IHF had no effect whereas Fis again influenced integration efficiency, though with a ∼4-fold increase in the knockout strain (Supplementary Figure S7B). Fis (factor for inversion stimulation) plays diverse roles in altering DNA topology, mediating DNA inversions, and regulating gene expression (63)(64)(65); these varied roles, and the lack of a clearly defined consensus sequence, make it difficult to know how Fis impacts transposition in either system, or whether changes in integration efficiency might instead be indirect effects. Interestingly, our amplicon-sequencing detection approach for E. coli Tn7 transposition (Supplementary Figure S7A) also yielded new information about the nature of DNA integration products for the well-studied TnsABCD pathway. Whereas prior studies identified a single integration site downstream of the essential glmS gene (66-68), our more high-throughput analyses were able to uncover additional integration events that sampled a wider sequence space, including rare but reproducible transposition products in the less-common T-LR orientation (Supplementary Figure S7C). These findings highlight the value of deep sequencing to thoroughly and unbiasedly query the range of potential integration products for a given transposable element.
Finally, we decided to investigate whether IHF might also bias the orientation of transposon integration for CAST systems, since the IHF binding site is uniquely present within the transposon left end. After testing bidirectional transposition for two CAST systems in both a WT and IHF strain of E. coli, we found that although the loss of IHF did not affect orientation preference for VchCAST, its loss reversed the dominant orientation for Tn7000 from T-RL to T-LR (Supplementary Figure S6C). This result raises the intriguing possibility that IHF may be involved in establishing a transpososome architecture that controls the directionality of DNA insertions. Previous work with the prototypic Tn7 system found that transposon substrates with two right ends were competent for integration whereas two left ends were not (12), and we wondered whether a symmetric VchCAST donor with two right ends would similarly be competent for transposition while also eliminating IHF dependency. In agreement with this hypothesis, the loss of IHF had no impact on transposition with a substrate containing two transposon right ends, which was integrated without orientation bias, while a substrate containing two left ends exhibited severely reduced integration efficiency that retained a dependence on IHF (Supplementary Figure  S6D, E). Overall, our data support a model ( Figure 5G) in which IHF binds the region between TBSs L1 and L2 to bend the transposon left end and drive DNA integration, akin to the proposed role of HU in Mu transposition [12]. This model is also similar to the role of IHF in CRISPR adaptation, during which IHF binds and bends the leader sequence of the CRISPR array to recruit the Cas1-Cas2 integrase and drive the specificity of leader-proximal integrations (69)(70)(71).

DISCUSSION
RNA-guided DNA integration by CRISPR-associated transposons depends on diverse, sequence-specific nucleic acid determinants. Focusing on VchCAST, a highly efficient and accurate CAST system derived from Vibrio cholerae (also known as Tn6677) (23,30), we employed high-throughput screening methods to systematically investigate and characterize these sequence requirements in this study. We first determined the minimal transposon sequences needed for robust activity and validated the importance of each transposase binding site (TBS) found within both left and right ends. Interestingly, our data revealed a broad degree of tolerance to mutagenesis of individual TBSs, a feature corroborated by recent TnsB transposase-DNA structures that show interactions mainly with the DNA backbone rather than specific nucleobases (37)(38)(39). The presence of multiple binding sites within each transposon end might allow for accumulative specificity and affinity, and likely play a role in regulating transposition frequency. Our results furthermore suggest that the asymmetric nature of the two transposon ends controls the idiosyncratic preferences of a given element for integrating in one orientation over another.
One limitation of our experimental setup for the transposon end libraries is that we could not directly compare relative integration orientation within the same NGS libraries, since integration events were amplified independently in the T-RL and T-LR orientations. Instead, we inferred approximate integration efficiencies by comparing the enrichment scores of transposon end variants to those of wildtype variants within the same library. We also note that our strategy involved separate mutagenesis of either the left end or Nucleic Acids Research, 2023, Vol. 51, No. 9 4533 right end, but not both transposon ends simultaneously. Lastly, we stress that all transposition assays with pDonor libraries were performed heterologously in E. coli under overexpression conditions, and thus subtleties of transposon end recognition and binding that depend on regulated TnsB expression levels may be obscured.
We uncovered additional regions within the transposon ends that drastically affect integration efficiencies, including a sensitive region within the left end that ultimately revealed a conserved binding site for integration host factor (IHF). Transposition assays with perturbations of the IHF binding site, and in E. coli strains lacking IHF, demonstrated that IHF is critical for efficient transposition of VchCAST and some, but not all, homologous Type I-F CAST systems, at least under the conditions we tested. Systems that were insensitive may still exploit IHF to increase transposition in native environments, where other transposition components may not be as abundant as in our overexpression setup, or these systems have evolved to bypass this molecular requirement altogether. For VchCAST, where the effect is clear, we propose that IHF is important for the proper quaternary organization of the transpososome, given the role that IHF plays in bending its bound DNA (54,57). This hypothesis is further supported by transposon end variants containing alternate spacing between the TBSs, which revealed a conserved periodicity that is consistent with the helical nature of double-stranded DNA. It is striking that, although Type I-F CASTs rely on a multitude of transposon-encoded genes, diverse DNA sequence determinants, and potential additional host-encoded factors, heterologous assays in E. coli with twenty CASTs from a range of gammaproteobacteria revealed active transposition for all (17). How and why mobile genetic elements would evolve dependencies on host-specific factors are questions that encourage further research into the regulation of transposition and search for additional accessory factors (72), especially in native host organisms.
We also analyzed sequence biases at the site of integration and found a clear preference for insertions into sites containing a central 5'-YWR-3' motif, with additional nucleotide preferences 3-bp upstream and downstream of the TSD. Interestingly, these are the same regions that appear to make direct contacts with the TnsB transposase from a Type V-K CAST (38). Remarkably, by projecting this new information onto the integration site patterns we previously obtained for a panel of genomic target sites in E. coli, we were able to explain the observed product heterogeneity, thus enabling guide RNA selection with high predictability for integration products at single-bp resolution. Finally, we exploited our dataset on transposon end mutability and integration site preference to design modified transposon variants that enabled in-frame tagging of endogenous proteincoding genes. In a proof-of-concept experiment, we tagged the endogenous E. coli MsrB protein with GFP using a modified short transposon right end and an in-frame gfp gene within the transposon cargo, and similar efforts should enable in-frame tagging in other cell types, where transposon end 'scars' are converted into functional sequence modifications.
Our work demonstrates the power of combining rationally designed libraries with deep sequencing approaches.
We reveal new insights on the molecular mechanism of RNA-guided transposition while also building a register, at single-bp resolution, of which bases can and cannot be mutated for engineering purposes. We envision combining these insights with future structural data to enable opportunities for rational design of hyperactive transposon end sequences that improve integration activity in other cellular contexts. Collectively, these insights inform both the biology and application potential of CAST systems.

DATA AVAILABILITY
High-throughput sequencing data are available at the National Center for Biotechnology Information (NCBI) Sequence Read Archive (BioProject Accession: PR-JNA919078). Custom scripts used for analyses of highthroughput sequencing data are available at GitHub (https: //github.com/sternberglab/Walker Klompe etal 2023) and on Zenodo (DOI 10.5281/zenodo.7776252). Datasets generated and analyzed in the current study are available from the corresponding authors on reasonable request.