A Simplified and Efficient Method for Himar-1 Transposon Sequencing in Bacteria, Demonstrated by Creation and Analysis of a Saturated Transposon-Mutant Library in Mycobacterium abscessus

Transposon insertion sequencing is a powerful tool, but many researchers are discouraged by the apparent technical complexity of preparing the genomic library for deep sequencing and by the complicated computational analysis needed for insertion site identification. Our proposed method makes the preparation of the library easy and straightforward, relying on well-known molecular biology techniques. In addition, the results obtained from the deep sequencing are easily analyzed in terms of transposon insertion site identification, placing library preparation and analysis within the reach of more researchers in the microbiology community, including those with less computational and bioinformatic resources and experience. This is demonstrated by analysis of the most saturated Tn-mutant library created to date in the emerging pathogen Mycobacterium abscessus.


INTRODUCTION R andom transposon mutagenesis is a powerful tool in bacterial genetics research.
Many systems use the Himar-1 (Mariner) transposon, where the only sequence requirement for transposition is a TA dinucleotide sequence. Determination of the exact location of the transposon insertion is labor-intensive, and several methods were developed to accomplish it. One method, highly effective for the Himar-1 transposon, was developed several years ago (1,2) and is based on introduction of a restriction enzyme (MmeI) recognition site into the inverted repeats (IR) of the transposon, without compromising transposition efficacy. As MmeI digests 20 bp laterally to the recognition site, digesting the genome with it creates a fragment containing the transposon and IR, flanked by 12-bp-long sequences on either side of the TA that was the target of transposition (i.e., the insertion site), with additional two overhang bases. By ligating a mix of 16 different adapters (to accommodate every possible combination of the 2-base overhang) with specialized Illumina-compatible sequences to the over-hang bases on either side of the transposon and performing PCR and then Illuminabased deep sequencing, the 14 bp immediately adjacent to the insertion site (16 bp in total-TA plus 14 variable bases) can be identified and used to pinpoint the insertion site. This technique has been successfully used in many transposon library analyses of various bacteria (3)(4)(5). However, the technical steps of creating the genomic library for deep sequencing are not simple and necessitate the design of multiple oligonucleotides (to form at least 16 adaptors, and, if tagging is desired, then each tag needs its own 16-bp adaptors) and their hybridization to form the multiple different adaptors to be used as a mixture and are dependent on successful end-to-end ligation of the adaptors to the transposon-containing fragment. Also, the final product of 16 bp (TA plus N 14 ) does not always allow unequivocal identification of the insertion site, especially, but not only, in organisms with large genomes and a high number of TA sites.
Here, we describe a method that simplifies the procedure of genomic library creation, abolishing the need for multiple-adaptor design and the dependence on relatively inefficient end-to-end ligation. In addition, this method allows the identification of a longer, 26-bp sequence (with the insertion of TA in the middle Fig. 1), reducing the number of insertions with equivocal identification by a factor of 1.1 to 7.3 (mean ϭ 2.05) ( Table 1), depending on the organism used.

RESULTS
After digestion of the target genome by MmeI, rather than running the digest on an agarose gel (see Discussion for other options described in the literature), isolating the gel area with the expected transposon-containing fragment (by size), cleaning it, and ligating to the adaptor mixture, we directly blunt the overhang ends by using T4 DNA polymerase (with 3¡5 exonuclease activity). The exonuclease activity is highly efficient, does not necessitate the addition of deoxynucleoside triphosphates (dNTPs), and is done at 12°C for 15 min. The reaction mixture is then cleaned on a DNA purification column, and T4 DNA ligase is added (with the correct buffer) for self-ligation of all the fragments in the mixture (as all would be blunt-ended at that time point). Again, self-ligation is highly efficient even for short incubations at room temperature. Following this, the mixture is cleaned again on a similar column and subjected to PCR with outward-facing primers (referred to here as "outward-looking PCR") located inside the transposon at anywhere between 50 and 100 bp from the inverted repeat (see Fig. 4). These primers have Illumina compatible ends. Tagging of the primers between the transposon-binding part and the Illumina ends can be done by tagging using 4 or 6 bp. The resulting fragments are 250 to 400 bp long, with Illumina-compatible ends, the first and last 50 to 100 bp of the transposon, both IR, and a 26-bp sequence which is composed of a TA and the two 12-bp sequences (on either side of the transposon). Of note, the primers can be designed to bind closer to or even immediately before the IR, thus making the final product shorter, with possibly better sequencing quality. However, they should not bind the IR itself, as this may result in products that originate from the same primer and that thus have the same Illumina adaptor. After the reads have been obtained from the Illumina procedure, the nucleotides in the sequence (the FASTA files) have to be reorganized to recreate the original order (an easy computational step; Fig. 1, bottom right) and then mapped to the reference genome. As the sequences are composed of 26 bp (N 12 ϩTAϩN 12 ) rather than 16 bp, a larger proportion of the sequences can be mapped to their genomic location, with many fewer ambiguous sequences that cannot be traced.
The technical differences between the present method ( Fig. 1, left) and our proposed method (Fig. 1, right) are shown. The in silico rearrangement of the 26 bp obtained in the proposed method is also shown. The code details of this rearrangement and of the step of mapping to the genome are available at https://github.com/MarkBio/ In-silico-Analysis-Transposon-Sequencing. Table 1 shows the expected improvement in the identification of the Tn site (the reduction in the number of sequences that cannot be equivocally mapped) from the use of the proposed method (generation of 26-bp reads) compared to the present method (with TA plus 14 bp), in 30 different bacteria.
T r a n s p o s o n IR IR CS2 CS1 The list of bacteria analyzed is composed of several important pathogens, widely used model organisms, and several additional organisms, chosen either randomly or on the basis of unique characteristics such as having the largest known bacterial genome (Sorangium cellulosum). Results of statistical analysis of the improvement in the Tn site identification rate are shown in Fig. 2 (Mann-Whitney U-test, P ϭ 0.002), and the improvement is directly correlated with the number of TA sites in the genome (n ϭ 30; r ϭ 0.5418; P ϭ 0.002) (Fig. 3). However, the increase in the number of TA sites explains only about 30% of the variation in the identification rate ratio of our method compared to the present system (R 2 ϭ 0.29, Fig. 3). The improvement in the identification rate is also highly dependent on the abundance of repetitions in the genome and is therefore difficult to predict simply on the basis of GC content, number of TA dinucleotides, or genome length. For example, Salmonella enterica and Shigella flexneri are very similar in genome size, GC%, and number of TAs, but the improvement seen with the proposed system compared to the present one was much more pronounced in S. enterica (ϫ1.47 versus ϫ1.1). In another example, Streptococcus agalactiae is relatively similar to Staphylococcus aureus in the parameters mentioned above, but the improvement in S. aureus was ϫ2.18 compared to ϫ1.2 in S. agalactiae.

TABLE 1
Thirty bacterial organisms were analyzed in silico to examine how many Illumina-generated sequences would not be amenable to definitive mapping to the genome (i.e., would have more than one possible mapped location), if the genomic library is prepared in the current system (TA plus N 14 ) or the proposed system (N 12 plus TA plus N 12 ), assuming full saturation of the transposon library a Validation of the proposed method in Mycobacterium abscessus Tn-mutant library. Specifically, we validated our proposed method by analyzing a M. abscessus ATCC 19977 Himar-1 transposon library (smooth colony morphology), where the antibiotic selection marker in the transposon was zeocin. As the gene conferring zeocin resistance contains an MmeI recognition site, we introduced a silent mutation abolishing the site, with no effect on the amino acid sequence. Previous transposon libraries created in M. abscessus were based on kanamycin selection and used a Himar-1 transposon-albeit without the modification enabling MmeI digestion and highthroughput analysis (6).
We first opted to characterize genes where inactivation by the transposon caused an S¡R colony morphology transition, as this transition is highly associated with pathogenesis (7). Whereas some of the gene inactivation events responsible for this (such as inactivation of genes encoding the glycopeptidolipid [GPL] complex) are well characterized, others remain unknown (7). We therefore infected M. abscessus ATCC 19977 with a mycobacteriophage carrying our novel transposon (zeocin selection and MmeIcompatible inverted repeats) and picked approximately 200 rough, zeocin-resistant colonies (representing no more than 200 separate transposition events, but probably fewer, due to the same event occurring two or more times). These 200 colonies were pooled, genomic DNA (gDNA) was extracted, and the genomic library was prepared (see Materials and Methods). PCR was performed on the self-ligated fragments with the primers shown in Fig. 4A, and the 400-bp PCR product is shown in Fig. 4B. Illumina- based, mass parallel sequencing was performed using a kit providing a minimum of 250-bp reads. The experiment was repeated four independent times to test for bias and reproducibility, and the details are shown in Table 2. These sequences were mapped to the annotated ATCC 19977 genome, including both the positive and negative strands (https://mycobrowser.epfl.ch/). Each of the sequences was mapped to a single location, with no equivocal Tn site (TA dinucleotide) identification. (Code details are available at https://github.com/MarkBio/In-silico-Analysis-Transposon-Sequencing.) To reduce the incidence of false-positive results, we continued the analysis only for those sequences that came up in both the R1 and R2 analyses. As shown in the table, we found a total of 89 unique TA insertions (79 in coding regions and an additional 10 in noncoding regions). The list of identified genes is shown in Table 3 (for interrupted coding regions) and Table 4 (for insertions located in noncoding regions) and represents genes/areas where the transposon's effect was seen as a switch in morphology from smooth to rough, a key step in the pathogenesis of M. abscessus. Whereas some of the affected genes (such as MAB_4098 and MAB_4099) are known to be involved in the S¡R transition, others represent novel findings that we intend to explore in the near future.

FIG 3
Comparison of the improvement ratio as a function of the number of TA sites. In all analyzed organisms, the proposed system performed better than the current one (the ratio is always above 1), and the degree of improvement was mostly associated with the total number of TA dinucleotides in the genome (R 2 ϭ 0.29). We then continued to perform preliminary analysis of the full mutant library, to demonstrate the effectiveness of the proposed method. The procedure of creation and analysis of Tn-mutant libraries in M. abscessus is notoriously difficult, and the most comprehensive one created to date contained ϳ6,000 kanamycin-resistant mutants (6). After pooling all the zeocin-resistant mutants, we performed the procedure twice, with the results shown in Table 5. As seen, the library was found to contain at least 8,008 unique mutants, with at least 2,889 genes (of a total of 4,992 annotated genes in the M. abscessus chromosome) hit by the transposon. This therefore represents the most highly saturated Tn-mutant library in M. abscessus created to date. The preparation of the genome for the sequencing as described here was done within 2 days. The in silico analysis was performed as described here, with simple identification of the insertion site and no ambiguously identified sequences. Complete lists of the genes (and noncoding regions) hit and of the exact gene annotation and number of hits per gene are shown in Tables S1 and S2 in the supplemental material.

DISCUSSION
We present a simplified procedure for generating the genomic library for Illuminabased deep sequencing from transposon-mutant pools. The proposed method does not necessitate the design of multiple oligonucleotides and their hybridization and is not dependent on end-to-end ligation. The ease and simplicity of the proposed method does not come at the cost of lower yield-instead, the sequence generated from a library prepared in this method is ϳ50% longer, reducing the number of insertion sites that are equivocally mapped (i.e., the number whose location remains undetermined after analysis).
In our opinion, the main advantages of our proposed method are the simplicity and ease of performance, relying on well-known and inexpensive molecular biology techniques. Other Tn-seq techniques have been successfully used for several years now and are definitely of great value. However, we think that use of the proposed technique would be advantageous overall compared to the existing techniques. Compared to the original description of the MmeI-based method (1), our proposal does not necessitate the same use of at least 16 different oligonucleotides (and even more if barcoding is needed), their hybridization, and end-to end ligation of short fragments. Although the method described in reference 1 does not fully mandate running the gDNA digest on a gel and isolating only the fragments that are close to the size of the transposon, this step makes the procedure more specific, as otherwise a vast majority of the heterogeneous primers in the PCR (intended to bind the area next to the MmeI digestion site) would bind nonrelated fragments, making the PCR less specific. In our proposed system, both primers bind inside the transposon, making the reaction more specific even without isolation of the digest by size. Other researchers (3) also proposed modification to the original protocol. In the procedure that they described, they first enriched the junction areas by unidirectional PCR using biotinylated beads, enabling pulldown of the junction-containing fragments specifically. However, that method requires two sequential PCRs, one performed with specialized, bead-affinity primers a The analysis of 200 rough colony morphology clones, isolated from a transposon-mutant library, was repeated four independent times. The procedure used was as follows: (i) gDNA extraction from pooled rough mutants, (ii) MmeI digestion, (iii) blunting, (iv) self-ligation, and (v) PCR and NGS. The analysis resulted in a total of 89 hits (unique TA insertions). The details of the sequencing and analyses and the combined number of identified insertion sites (representing a disrupted gene or affected region between genes) are shown. and the other with the same mixture of multiple primers as that described in reference 1. It is therefore not certain that the method is technically simpler than that described in reference 1.
Our proposed method belongs to the "circularization" methods, as opposed to the "linear PCR" methods. Other circularization methods have been described before. Some involve shearing of the gDNA (mostly by sonication) (8) and then repairing the ends (by a Klenow reactions), self-ligation, and a PCR in which both primers bind inside the transposon (like our proposal). Shearing by sonication is not always simple, and the end repair results depend not only on the presence of an efficient 3=¡5= exonuclease (that needs no dNTPs) but also on that of the much less efficient and dNTP-dependent 5=¡3= polymerase. Another option for self-circularization is the digestion of the genome by one enzyme (or by several, if they all produce compatible ends) that does not digest inside the transposon, followed by self-ligation and PCR from inside the transposon outward. Both these options allow longer stretches of DNA sequences to be identified, which may be an advantage. However, by their nature they produce PCR product that are not of uniform size, ranging from ϳ50 to thousands of nucleotides-thus introducing substantial size preference bias during the PCR. Some PCR fragments may even be too long to be synthesized, precluding the identification of the insertion.
There are three other steps in the genomic preparation of the library where bias may theoretically be introduced. The first is the blunting step. If, for example, the ends with CC-3= are blunted much less efficiently than the ends with AA-3=, then bias may be introduced, as fragments with CC-3= do not undergo self-ligation. However, we are not aware of such a quality of the blunting reaction/enzyme. In the second step, if the ligase  were more effective in ligating blunt ends of one sort or another, this would introduce a similar bias. However, again, we are not aware of such limitations of the T4 DNA ligase that we used. In the third step, all the PCR products from each clone are identical, except the variable 26-bp stretch in the middle. If that stretch is highly problematic for a PCR (unusual secondary structures, highly repetitive sequence, etc.), then the clone may be underrepresented. Such bias would also appear in the "previous" method, as a partially similar stretch needs to be amplified. Additionally, the chances of having such a PCR-unfavorable sequence, considering the modern PCR enzymes available, appear to us to be extremely small. Using our method, we were eventually unable to match a substantial number of the 26-bp sequence reads generated in the sequencing process to the genome. This was due to mismatch reads introduced during either the outward-looking PCR (producing the 400-bp fragment for the sequencing) or the Illumina procedure itself. For amplification, we used a robust but intermediate-fidelity polymerase. The number of mismatches could potentially be reduced by using a high-fidelity enzyme, as well as by designing the primers to bind at locations closer to (but not directly to) the IR (thus shortening the product and improving the reads). Another factor is that we performed 30 PCR cycles, whereas some resources suggest using fewer (between 12 and 16 cycles). However, these "wrong" sequences are not a source of "false-positive" reads, as we had an eventual "demand" that every mapped read should have a counterpart read from the second direction (every read in R1 mapped to the genome has to appear in R2 and has to map to the complementary strain). In a definitive trial of the system performed on a highly saturated library in M. abscessus, two "runs" consisting of 4.5 million reads each were sufficient to identify over 8,000 Tn mutants, despite what may appear as a low percentage of mapping.
Overall, from the moment one has high quality gDNA until the final 400-bp-long PCR product is ready to be sent for deep sequencing, the procedure can be completed in just under 6 h (assuming a 1-h ligation).
Our Tn-mutant library in M. abscessus is the most saturated library created to date, and the genetic data are presented in a comprehensive way. The analysis of this library, using the presented method, was undertaken in two "steps" of 48 h each, excluding the deep sequencing itself (performed by a commercial company).
In summary, we propose here a technique that is simple to perform-and simple to understand-for the preparation of the genomic library of Himar-1 transposons. Moreover, this simplicity does not come at the cost of lesser performance-in contrast, the results obtained are better (in terms of length of sequence and hence of mapping to genome) than previously published methods. We thus think this that method will both make Tn-seq simpler for those who already know how to do it and make it accessible for the large fractions of biologists who were discouraged by the apparent complexity of, and the preparations needed for, the previous methods.

MATERIALS AND METHODS
To compare the performance of the "present" system to that of the proposed one, calculating the number of unique compared to ambiguous "hits" for an N 14 -plus-TA system or an N 12 -TA-N 12 system was a The same library analyzed previously only for rough mutants ( Table 2) was analyzed for all the resulting mutants. The analysis was done two independent times. The procedure used was as follows: (i) gDNA extraction from pooled rough mutants, (ii) MmeI digestion, (iii) blunting, (iv) self-ligation, and (v) PCR and NGS. A total of 8,008 unique TA sites were hit, representing 2,889 genes (out of 4,992 genes in M. abscessus ATCC 19977). A total of 1,499 TA sites were hit in what are annotated as noncoding regions. A total of 6,508 unique TA sites in 2,889 genes were hit in coding regions.
done by the use of a code that can be found at https://github.com/MarkBio/In-silico-Analysis-Transposon -Sequencing. Creation of the genomic library in M. abscessus transposon mutants. Transposition of M. abscessus ATCC 19977 was done using a mycobacteriophage with a newly designed transposon that carries a zeocin selection marker. Bacteria were plated on 7H10 plates with 50 g/ml zeocin. For the "limited," R¡S transition analysis, 200 rough-appearing colonies were pooled and grown overnight in 7H9/oleic acid-albumin-dextrose-catalase (OADC)/glycerol/Tween media. High-quality gDNA was extracted using lysozyme and phenol-chloroform-isoamyl alcohol (IAA). DNAs (15 g) were digested by the use of 30 units of MmeI (NEB [New England Biolabs]) in a total volume of 300 l in the presence of 32 mM SAM (S-adenosyl methionine) for 2 h at 37°C. After purification on a column (Macherey-Nagel; Nocleospin [reference 740609.250]), half of the digested DNA was blunted using T4 DNA polymerase (NEB) at 12°C for 15 min. The mixture was cleaned on a similar column and eluted in 50 l H 2 O. T4 DNA ligase (NEB) was added to all the elute volume in the appropriate buffer, and the ligation reaction mixture was left overnight at room temperature and was finally purified on a column into 50 l H 2 O. Of that elution, a 2-l volume was used as the template in a PCR with Illumina-compatible primers, using GoTaq Green master mix polymerase (Promega) (30 PCR cycles; elongation for 30 s; annealing temperature, 63°C).
For the analysis of the full library, the same procedure was repeated, but with pooling of all zeocin-resistant colonies.
Parallel sequencing on an Illumina platform. Parallel sequencing was performed at the nextgeneration sequencing (NGS) facility of Hy-Laboratories Ltd. Per the procedure of the sequencing facility, an additional PCR step consisting of 10 cycles was performed on each sample. However, researchers performing their own deep sequencing may forgo this step, as long as the primary PCR was performed with Illumina-compatible primers. Products were checked by Qubit and by Tapestation. Sequencing was done in Illumina MiSeq sequencer, using a MiSeq V2 kit for 500 cycles to generate 2 ϫ 250 paired-end reads.
Extracting the actual genetic sequence from the mass-sequencing data and mapping to the M. abscessus genome. As noted above, the code details can be found at https://github.com/MarkBio/In -silico-Analysis-Transposon-Sequencing. Raw data from the sequencing can be found at that link as well, through a Dropbox link.
Data availability. Accession numbers for the sequences of the 30 bacterial organisms that were analyzed in silico are listed in Table 1.

SUPPLEMENTAL MATERIAL
Supplemental material is available online only.