Characterization of Three Novel SINE Families with Unusual Features in Helicoverpa armigera

Although more than 120 families of short interspersed nuclear elements (SINEs) have been isolated from the eukaryotic genomes, little is known about SINEs in insects. Here, we characterize three novel SINEs from the cotton bollworm, Helicoverpa armigera. Two of them, HaSE1 and HaSE2, share similar 5′ -structure including a tRNA-related region immediately followed by conserved central domain. The 3′ -tail of HaSE1 is significantly similar to that of one LINE retrotransposon element, HaRTE1.1, in H. armigera genome. The 3′ -region of HaSE2 showed high identity with one mariner-like element in H. armigera. The third family, termed HaSE3, is a 5S rRNA-derived SINE and shares both body part and 3′-tail with HaSE1, thus may represent the first example of a chimera generated by recombination between 5S rRNA and tRNA-derived SINE in insect species. Further database searches revealed the presence of these SINEs in several other related insect species, but not in the silkworm, Bombyx mori, indicating a relatively narrow distribution of these SINEs in Lepidopterans. Apart from above, we found a copy of HaSE2 in the GenBank EST entry for the cotton aphid, Aphis gossypii, suggesting the occurrence of horizontal transfer.


Introduction
Transposable elements (TEs) form a substantial fraction of eukaryotic genomes and are categorized based on their mode of transposition as class-I elements or retrotransposons and class-II elements or DNA transposons [1]. Copy and paste retrotransposons replicate via an RNA intermediate, which is reverse transcribed prior to its reintegration into the genome, whereas cut and paste DNA transposons move through a DNA intermediate. Retrotransposons are the most widespread and enriched class of eukaryotic transposable elements, and can usually be classified into several groups by their replication strategy and structure, including long terminal repeat (LTR) elements, long interspersed nuclear elements (LINEs), and short interspersed nuclear elements (SINEs) [2].
SINEs are nonautonomous repetitive elements with a length of 80-500 bp and depend on LINE elements for their amplification [3]. SINEs can be categorized into families based on sequence similarity and into subfamilies based on the presence of diagnostic nucleotides and/or deletions [3]. Most families are derived from tRNA genes [4], while several are from 7SL RNA [4][5][6][7][8] or 5S rRNA [9][10][11]. A typical SINE is composed of three distinct regions: a 59 terminal RNA-related region (the head) which contains an internal RNA polymerase III (Pol III) promoter, a RNA-unrelated region (the body), and the 39 tail region that is recognized by the reverse transcriptase (RT) of autonomous partner LINEs during retrotransposition [3].
The majority of SINEs have long been regarded as ''selfish DNA parasites or junk DNA''. However, in recent years, striking evidence has accumulated indicating that SINEs are involved in molecular evolution and gene functionality by generating regulatory elements for gene expression, alternative splicing, mRNA polyadenylation, and even functional RNA genes [12][13][14][15]. For example, more than 5% of the alternatively spliced internal exons in the human genome are derived from Alu SINE [13]. Furthermore, due to the irreversible and random nature of their insertions, SINEs provide excellent molecular markers for phylogenetic analysis via comparison of their insertion sites between different isolates, strains and species [16].
While SINEs are widespread in the animal kingdom in vertebrates and invertebrates, only a few SINEs have been characterized in several insect species, such as Bm1 and BmSE in silkworm, Bombyx mori [17][18], Lm1 in African migratory locust Locusta migratoria [19], Feilai in yellow fever mosquito, Aedes aegypti [20], Sine200 in Anopheles gambiae [21][22], Twin in Culex pipiens [23], Talua, Talub, Taluc and Talud in termites [24][25][26], and SINE3-1_TC in Tribolium castaneum [27]. Here, we described two tRNA-derived SINE families and one 5S rRNA-derived SINE family in the cotton bollworm, Helicoverpa armigera, which is one of the most important pests of cotton in the world. We investigated the genomic structures and insertion regions of these SINEs. The distribution of these SINEs in closely related insect species was also surveyed. Our findings significantly increase the number of known SINEs in insects, and might assist in understanding the roles of retroelements in H. armigera genome evolution.

Insect strains
A laboratory strain of H. armigera, generously provided by Dr. Yongan Tang (Institute of Plant Protection, Jiangsu Academy of Agricultural Science, China), was maintained in an insectary kept at 28uC with a photoperiod of 16 h light:8 h dark on artificial diet. Several third instar larvae randomly picked from the colony were individually flash-frozen in liquid nitrogen and stored in 280uC for subsequent genomic DNA isolation.

DNA extraction and genome walking
A previous study has shown that transposable elements were enriched within or in close proximity to xenobiotic-metabolizing cytochrome P450 genes in Helicoverpa zea [28]. To identity putative novel SINEs in H. armigera, we performed genome walking to obtain the 39-flanking sequence of an insecticide resistanceassociated cytochrome P450 gene, CYP6AE12, in H. armigera [29]. Genomic DNA was isolated from individual third instar larva, using the procedure described by Wang et al [30]. Genespecific primers (Table S1) based on the known sequence of the cDNA (accession number: DQ256407) and four general primers provided by the Genome Walking Kit (TaKaRa, Dalian, China) were used for every genome walking. PCR products were cloned into pGEM-T Easy vector (Promega, Madison, WI, USA) and sequenced.

Genomic sequences
While the full genome of H. armigera has not been sequenced, a total of 1.963 Mb of genomic DNA sequences covered by 18 BACs was recently available, which approximately represents 0.5% of the H. armigera genome [31]. In this work, almost all sequences of HaSE1 and HaSE2 family were retrieved from these BAC sequences downloaded from GenBank (accession numbers: FP340418-FP340435).

Database search strategy
Database searches were performed and comprise four steps. Firstly, the 39-flanking sequence of the P450 gene, CYP6AE12, was compared with non-redundant databases using the NCBI server with blastn (www.ncbi.nlm.gov/cgibin/BLAST), and sequences of high homology as well as 200 bp upstream and downstream flanking regions were extracted and analyzed for hallmarks of SINEs such as internal RNA polymerase III promoter (Box A and Box B) and target site duplications (TSDs), and the consensus sequence of the first tRNA-derived SINE family in H. armigera, HaSE1, was determined. Secondly, the tRNA-related region in the consensus sequence of HaSE1 was searched against nonredundant database using blastn, and the second tRNA-derived SINE family in H. armigera, HaSE2, was identified. Thirdly, the unknown 39 -region of HaSE1 was searched against nonredundant database using blastn, and the 5S rRNA-derived SINE family in H. armigera, HaSE3, was identified. Finally, both nucleotide (nr/nt) and EST (est_others) collections were searched using consensus sequences of these three SINE families as queries to detect new members of SINEs in species other than H. armigera.

Sequence analysis
Sequence alignment was performed using CLUSTAL X [32] with default settings. The A and B boxes of the promoter for RNA polymerase III in HaSE1 and HaSE2 and boxes A, IE, and C in the promoter region of HaSE3 were labeled manually. REPFIND [33] was used to identify identical direct repeats. SINE's tRNAlike structure was checked with tRNAscan-SE [34], using mixed model and cove score cutoff value = 0.01 as default.
In order to test the presence of potential genes in the flanking regions of each SINE insertion, 10 kb sequences in each direction (upstream and downstream) were extracted from the BAC clone sequences and used to search against the non-redundant databases using the NCBI server with BlastX (www.ncbi.nlm.gov/cgibin/ BLAST). In the analysis of the GC-content distribution of the elements, 2 kb sequences in each direction were used to calculate GC percent with GEECEE (http://bioweb.pasteur.fr/seqanal/ interfaces/geecee.html).
For evolutional analysis, the sequences were aligned using SOAP software [35]. The phylogenetic trees were constructed using the software Mega 5 [36] with the principles of maximum likelihood (ML) and maximum parsimony (MP). To confirm the reliability of ML and MP trees, we also used MrBayes 3.1.2 to conduct a Bayesian inference analysis assuming the optimal models estimated by Modeltest [37]. The reliability of the trees was tested using 1000 bootstrap replications.

Sequence deposition
The sequences of the HaSE1.1 inserted in the 39-flanking region of CYP6AE12 gene and HaSE3 elements obtained by PCR amplification in this study were deposited in GenBank under the accession no. JQ308191-JQ308206.

Results
A novel tRNA-derived SINE family, HaSE1, in Helicoverpa armigera A novel tRNA-derived SINE family, designated as HaSE1 (the first SINE family discovered in H. armigera), was identified by genome walking and subsequent database searches. The HaSE1.1 element identified in the 39-flanking sequence of the P450 gene, CYP6AE12, is 324 bp in length and located at 649 bp downstream of the translation stop codon of the CYP6AE12 gene in the reverse orientation. A total of 21 full length sequences with high homology to HaSE1.1 were identified from non-redundant database, and named HaSE1.2-HaSE1.22 (Table 1). Figure S1 shows the alignment of these sequences and the deduced consensus sequence. As shown in Figure S1, these sequences present the typical structural features of the tRNA-derived SINE elements: all HaSE1 copies are flanked by 5 to 18 bp short direct repeats (DRs) or target site duplications (TSDs), presumably generated during retroposition; the A and B boxes of the RNA polymerase III promoter as well as a tRNA-like region were found at the 59-end; various numbers of perfect or imperfect TGA trinucleotide repeats were found at the 39 -end.
The consensus sequence of the HaSE1 family members is 386 bp long ( Figure 1A and B). It includes a short A-tag followed by a 74-bp tRNA-related region at the 59-end (58% identity to 72-bp tRNA of Drosophila melanogaster, Figure 1B). The tRNA-related head retains the ability to fold into a cloverleaf structure, and the scanning for secondary structures indicated that HaSE1 was derived from a pseudogenic tRNA Glu ( Figure S2). The body part is not related to any known sequences. Comparative analysis showed that the match between the HaSE1 elements and the consensus sequence ranged from 77% to 97%, with a median similarity of 89%.
In a computer assisted similarity search, we found that the 39terminal 43 bp fragment of the HaSE1 consensus sequence was very similar (82% identity) to the 39-end of HaRTE1.1 ( Figure 1B), a member of the novel LINE family HaRTE1 we identified in a BAC clone of H. armigera (accession number: FP340435; position: 64987-66631). Thus, this region was designated as 39 -LINErelated region. The HaRTE1.1 element was 1,645 bp long, flanked by 15 bp TSDs, encoded an RTdomain with 42% amino acid sequence identity to that of the RTE-3-BM in B. mori [38], and was terminated by a region of TGA trinucleotide repeats in the short 39 UTR (Figure 2A and B).
In addition, we also identified several 59-truncated forms of HaSE1 and these elements were named as HaSE1.23-HaSE1.43 ( Table 2). The lengths of these truncated HaSE1 sequences varied between 75 bp and 356 bp. These sequences are flanked by 5-17 bp perfect or nearly perfect TSDs ( Table 2). The second tRNA-derived SINE family, HaSE2, in Helicoverpa armigera Using 74-bp tRNA-related region of HaSE1 as a query, our additional search has identified 7 sequences (HaSE2.1-HaSE2.7) of the second tRNA-derived SINE family, HaSE2 (Table 1, Figure  S3). The 59-structure of the 286 bp consensus sequence of HaSE2 is highly similar to that of HaSE1 (94% identity over a 132-nt region), including a 70-bp tRNA-related region immediately followed by a 62 bp conserved central domain ( Figure 1A and C). Note that the 154 bp 39-region showed high identity (98%) with one marinerlike element, Hamar1.3, in H. armigera [39] ( Figure 1C). Five out of the 7 HaSE2 elements are flanked by TSDs, but no simple repeat or long stretches of A and T was found in 39-tail of HaSE2 elements ( Figure S3). We found that HaSE2.7 is more degenerate than other SINEs, sharing only 66% sequence identity with the HaSE2 consensus sequence.  (Table S2, Table S3), suggesting the tendency of HaSE1 and HaSE2 to reside within intronic regions. There are no copies of HaSE1 and HaSE2 located in exons. A total of 5 copies of full length HaSE1, 7 copies of 59-truncated HaSE1 and 2 copies of HaSE2 were found to be located close to other TEs (Table S2,   Table S3). One copy of HaSE1 was found in microsatellite DNA loci.

HaSE1 and HaSE2 flanking regions characterization
Interestingly, analysis of the 59-and 39-flanking sequence of HaSE2.4 revealed a new LINE element in H. armigera, which was named as HaRTE2.1. The inserted HaSE2.4 is in the sense orientation compared to the ORF of HaRTE2.1 ( Figure 3A). The HaSE2.4-nesting HaRTE2.1 is 994 bp long in total, flanked by 10 bp TSDs, encoded an RTdomain with 81% amino acid sequence identity to that of the Bm-RTE in B. mori (accession number: ADI61812), and was terminated by a region of TGA trinucleotide repeats in the short 39 UTR (Figure 3A and B).
In addition, we examined the GC contents of the flanking regions of HaSE1 and HaSE2 elements. We found that the insertion sites of these two SINE families had a similar GC content with that of the sequenced part of the H. armigera genome (36.1%) [31]. The average GC contents of 59-and 39-flanking regions of HaSE1 were 36.5% and 36.4% for full length elements, and all 35.8% for 59 -truncated elements, and that of HaSE2 were 36.4% and 35.1%, respectively ( Table 1, Table 2).

Distribution of HaSE1 and HaSE2 in other insect species
All the above results suggested that HaSE1 and HaSE2 are two novel tRNA-derived SINEs. GenBank homology searches were further performed to detect HaSE1 and HaSE2 sequences in insect species other than H. armigera. Two copies of HaSE1-like elements including one full-length copy (named as HzSE1.1) and one 59-truncated copy (HzSE1.2) were identified in the first and seventh introns of one cytochrome P450 gene, CYP9A14, in H. zea (accession number: DQ788840) (Figure 4). These two SINE elements were previously recognized as TE-like elements HzIS1-1 and HzIS1-2, respectively [28]. In Heliothis subflexa, one 59truncated copy of HaSE1-like elements (HsSE1.1) was found in an intron of the ABC transporter family C protein gene (accession number: GQ332573) (Figure 4). By searching with concensus HaSE2 sequence as a query, two full length copies of HaSE2-like elements (HsSE2.1, HsSE2.2) were found in introns of the ABC transporter family C protein gene from H. subflexa (accession number: GQ332573). Three copies (SfSE2.1, SfSE2.2 and SfSE2.3) were also found in BAC clones of Spodoptera frugiperda (accession number: FP340417, FP340410 and FP340416) ( Figure 5).
The HaSE2 consensus sequence was also used to search the EST database using blastn and 50, 14,7 and 5 matches were detected in Heliothis virescens, Spodoptera littoralis, Spodoptera litura and Danaus plexippus, respectively, with an E-value less than 1e-25; of which 30, 12, 7 and 5 matches were 150 bases or longer. Representative examples of these sequences with perfect direct repeats are shown in Figure 5. Interestingly, one 267 bp EST sequence from Aphis gossypii (accession number: GW550229) showed high identity (96%) with the HaSE2 consensus sequence.
The HaSE2 sequences were determined and analyzed for their evolutionary interrelationships with similar sequences from other insect species including A. gossypii. The results obtained by different phylogenetic methods were mostly congruent. We chose to present the topologies obtained by ML method (Figure 6). All the other trees obtained by the different methods are provided in Figure S4. The result indicates the existence of two major groups separated by a high bootstrap value of 100%. Elements from species of Heliothinae including H. armigera, H. virescens and H. subflexa together with one element from A. gossypii form a group, while elements from Amphipyrinae including S. frugiperd, S. littoralis, S. litura together with one element from Papilionoidea (D. plexippus) are clustered into the second group. Compared with the phylogenetic relationship deduced from cytochrome oxidase subunit I (CO I) gene sequences in these insect species, our data indicates that horizontal transfer might have occurred between heliothine species and A. gossypii.
A 5S rRNA-derived SINE family in Helicoverpa armigera and related species Using 246-bp 39-region of HaSE1 as a query, our additional search has identified one putative repeat sequence in an intron of the diapause hormone-pheromone biosynthesis gene from H. armigera (accession number: AY382615) as well as two putative repeat sequences in introns of the ABC transporter family C protein gene from H. subflexa (accession number: GQ332573). Further analysis showed that the 59-regions of these putative repeat sequences were most similar to 5S ribosomal RNA in B. mori (Accession number: K03316), suggesting these were 5S rRNA-derived SINEs. These sequences were named as HaSE3.1, HsSE3.1 and HsSE3.2, respectively ( Figure S5, Figure 7). While the HsSE3.1 and HsSE3.2 were flanked by 19 bp and 8 bp TSDs, respectively, no TSDs were detected in HaSE3.1.
For more comprehensive survey of the HaSE3 elements in H. armigera, we performed PCR using genomic DNA as template and two oligo-nucleotide primers that were specific for the 5S rRNArelated region and 39 -LINE-related region, respectively (Table  S1). A total of 15 different repeat sequences were obtained, and named as HaSE3.2-HaSE3.16 ( Figure S5). Comparative analysis showed that the 39 -region of HaSE3 is very similar to that of HaSE1 (94% identity between the two consensus sequences). The match between the HaSE3 elements and the consensus sequence ranged from 73% to 99%. The HaSE3.1 and HaSE3.16 showed notable differences with the other elements (only 73% and 80% identical to the HaSE3 consensus sequence, respectively), whereas the HaSE3.7, HaSE3.8 and HaSE3.9 are only 1% divergent from the HaSE3 consensus sequence.

Conserved central domain in two tRNA-derived SINE families
Since the first discovery of rodent B1 and B2 elements as well as primate Alu elements about 30 years ago, more than 120 SINE families have been isolated from the genomes of mammals, reptiles, fishes, mollusks, ascidia, flowering plants, and insects [3,40]. The majority of SINEs in eukaryotic genomes characterized to date are derived from tRNAs. They generally have a composite structure comprising a 59 terminal tRNA-related region, which is followed by a non-tRNA-related region. In this study, we have identified two tRNA-derived SINE families, HaSE1 and HaSE2. Apart from the high identity between tRNA-related regions in their 59 -regions, the following 62 bp regions also showed high identity. The finding of a conserved central domain in part of the body shared by distant SINE families  is not surprising. To date, four such domains have been described: CORE domain in vertebrates [41], V-domain in fishes [42], Deudomain in deuterostomes [10] and Ceph-domain in cephalopods [43]. The sequence identity between the conserved central domains of the HaSE1 and HaSE2 consensus sequences is 96.8%, whereas that of their tRNA-related regions is 95.7%. Such a high level of sequence identity between the conserved central domains of the two different SINE families suggests that the conserved central domain has been under strong selective constraint, most likely because it is functionally important for their retropositions.
HaSE2, the first SINE family with its 39 -region derived from DNA transposon While the 59 -structure of HaSE2 family shares high similarity with HaSE1, this SINE family has some features that distinguish it from the canonical SINEs. For instance, we did not detect a poly(A) tail or short direct repeats at the 39 end of the HaSE2 elements, while conserved purine-rich ATAAAAA sequence was observed near the 39 end. SINEs lacking a poly(A) tail have been found in most mammalian genomes and probably utilize other mechanisms for interacting with reverse transcriptase during retroposition [44][45].
Quite notably, the 154 bp 39 -region showed high identity with one mariner-like element Hamar1.3 in H. armigera [39]. To the best of our knowledge, it is the first report of a SINE family with its 39region derived from DNA transposon. It is quite likely that HaSE2 elements resulted from integration of reverse transcriptases transcripts into the genomic copies of mariner-like elements or recombination of tRNA and transcripts of mariner-like elements during the template switching common for reverse transcriptases [46][47].
HaSE3, a putative chimera of 5S rRNA and tRNA-derived SINE Since Weiner predicted a new class of SINEs derived from 5S rRNA [48], such elements have been found in zebrafish and later in other fish species [9] and in a few mammals [10], fruit bats [11], and springhare [49]. Until recently, very little was known about the 5S rRNA-derived SINEs in insect genomes. Only one such repeat, SINE3-1_TC, has been found in the genome of the red flour beetle, T. castaneum [27]. In this study, the 59-region of HaSE3 elements displays a considerable similarity with 5S rRNA, the highest similarity region was found in the promoter region composed of boxes A, IE, and C, indicating that HaSE3 is one novel 5S rRNA-derived SINE family in insects.
The body of SINEs is usually unique for each SINE family and its origin is largely obscure. Quite notably, the 39 -region of HaSE3 including body part and 39 tail are very similar to HaSE1. SINEs sharing a same 39 -tail are not unique for HaSE1 and HaSE3. For example, an approximately 70-bp-long conserved 39-tail were conserved in three SINE families including SImI, HpaI and OS-SINE1, which exhibits a significant homology to the 39 -tail of the salmonoid RSg-1 LINE [50]. In this study, the high homology between HaSE3 and HaSE1 in 39 -region suggested that the novel HaSE3 family might be generated by template switching from the tRNA-derived HaSE1 RNA to 5S rRNA during the process of cDNA synthesis in retroposition [10]. Thus, our study may represent the first description of a chimera of a 5S rRNA and a tRNA-derived SINE in insect species.

Partner LINE
Unlike DNA transposons, integration of new copies of SINE into the new genomic locations occurs via a mechanism of reverse transcription, which relies entirely on the enzymatic machinery of autonomous partner LINE in the same genome. Recognition of SINE transcripts by LINE reverse transcriptase (RT) is guaranteed either by the common 39 -tail shared by the SINE and its partner LINE (stringent recognition) or by the presence of the poly(A) tail (relaxed recognition ) [51][52].
Until recently, little was known about LINEs in H. armigera. Several LINE elements were identified in the first intron of cadherin gene and several microsatellites clones [53] as well as BAC clones [31]. In the present study, we identified one repeat element HaRTE1.

Abundance and distribution of SINE families in H. armigera
The number of copies of SINEs varies from family to family. In H. armigera, assuming that the BAC contigs are representative and a genome size for H. armigera of 400 Mb [54], then a total of 11, 000 copies of HaSE1 and 1, 200 copies of HaSE2 would be predicted. We did not find any HaSE3 in BAC clones, thus the abundance of HaSE3 seems much lower than that of HaSE1 and HaSE2. The variety of copy numbers among these three SINE families is probably because of the differences of retroposition efficiency. The lacking of a poly(A) tail or short direct repeats at the 39 end of HaSE2 might contribute to its relatively low retroposition efficiency and copy number. On the other hand, while HaSE3 shares the same 39 end with HaSE1, their promoters might have quite different transcriptional activities. As proposed by Kapitonov and Jurka [9], the type 1 promoters in 5S rRNAs depend much more on upstream signals than do type 2 promoters in tRNAs. As a result, the Pol III promoter in a retroposed 5S rRNA copy presumably remains silent or is expressed at a low level.
SINEs are generally thought to be transmitted vertically from parents to offspring, and the probability of independent emergence of the same SINE families in unrelated species is negligible. Thus, the species sharing same SINE family or a SINE inserted in the same locus are considered as related. Consistent with this notion, our searches against various GenBank databases revealed the presence of HaSE1 in H. zea and H. subflexa, two closely-related species of H. armigera. Likewise, HaSE2 were found in H. subflexa, H. virescens, S. frugiperd, S. littoralis, S. litura and D. plexippus, and HaSE3 were found in H. subflexa, H. virescens, T. ni and M. brassicae. However, both the HaSE2 and HaSE3 SINEs were not found in H. zea, which is the closest relative of H. armigera and is thought to be derived from a founder population of H. armigera approximately 1.5 million years ago [55]. This is possibly because of the current limited availability of H. zea sequence, rather than the true absence of the two SINE families in the H. zea genome. More extensive investigation of the distribution of SINEs described in this paper would be very useful in phylogeny analysis of large and diverse families in Lepidopterans.
The fact that we did not find all the three SINE families in the fully-sequenced B. mori suggests that these three SINE families are probably narrowly distributed in subgroups of Lepidopterans. Thus, our finding of one HaSE2 element from the A. gossypii, a non-lepidopteran species, suggests the occurrence of horizontal transfer of HaSE2. This is not unusual as intensive horizontal transfer was observed for salmonid retroposons to the schistosome genomes [50,56]. A recent study shows that poxviruses are possible vectors for the horizontal transfer of Sauria SINE from reptiles to mammals [57]. It is possible that some disease pathogens or natural enemies (e.g. parasitoids) that infect or attack both heliothine species and A. gossypii may serve as the vectors for the horizontal transfer of HaSE2. Further research is necessary to test this possibility and to identify the possible vectors.