Genome-wide characterization of PRE-1 reveals a hidden evolutionary relationship between suidae and primates

We identified and characterized a free PRE-1 element inserted into the promoter region of the porcine IGFBP7 gene whose integration mechanisms into the genome, including copy number, distribution preferences, capacity to exonize and phyloclustering pattern are similar to that of the primate Alu element. 98% of these PRE-1 elements also contain two conserved internal AluI restriction enzyme recognition sites, and the RNA structure of PRE1 can be folded into a two arms model like the Alu RNA structure. It is more surprising that the length of the PRE-1 fragments is nearly the same in 20 chromosomes and positively correlated to its fracture site frequency. All of these fracture sites are close to the mutation hot spots of PRE-1 families, and most of these hot spots are located in the non-complementary fragile regions of the PRE-1 RNA structure. Sequence homology analysis showed that the PRE-1 element seemed to share a common ancestor 7SL RNA with primates but was generated by different evolutionary model, which suggests that the suidae may be the closest relatives to primates in laurasiatheria.

element is the presence of two "A-rich" regions.. The first "A-rich" region is near the center of the element with a consensus sequence of A5TACA6. It separates the sequence into two halves, left arm and right arm, and is often called the "A-linker". The 5' half of each sequence contains an RNA-polymerase-III promoter (A and B boxes). The second "A-rich" region is located 3' of the right arm, 150 bp downstream of the first A-rich region, near the end of the Alu body, and almost always consists of a run of As that is only occasionally interspersed with other bases. It is often called the "A-tail". (j) Alignment of the AluY_12 element with the PRE-1 element.

The copy number of the PRE-1 repeats in pig genome
To identify more PRE-1 elements, we used BLASTn to search "somewhat similar sequences" to this first identified PRE-1 element against the database of the NCBI Pig Genome, and 1037475 blast hits (length:12-350bp) of the PRE-1 repeat were found on both strands in the genome, which together account for nearly 8.21% of the sequenced region of the pig genome.
The correlation coefficient between the copy number of PRE-1s and the chromosome length is 0.973 (Fig. 2a), large chromosomes were necessarily associated with more PRE-1 repeats. Additionally, we performed a cluster analysis to investigate the correlation among the number of PRE-1 repeats with other important genomic features (numbers of genes, protein-coding sequences, splice variants, exons, introns, and the length of chromosome); all these data are grouped in terms of the chromosome, and Ward hierarchical clustering was applied (Supplementary Table. 1). The dendrogram revealed that the number of PRE-1 repeats is correlated with six variables in the following order of correlation strength: chromosome length > exons > introns > splice variants > genes > protein-coding sequences. (Fig. 2b), so we selected a small sample including the gene that is composed of the most exons in each chromosome and calculated that the correlation coefficient of the numbers of introns and PRE-1s in each gene region is only 0.287 (Fig. 2c). As primate Alu elements accumulate preferentially in gene-rich regions, we further compared the density distribution of genes and PRE-1 elements for 20 chromosomes, where each chromosome was split into 1000 kb intervals and PRE-1 size and gene size was calculated individually for all of these regions. We found that the overall trend of the density distribution between of the two is very similar, with some exceptions in a few areas in chromosome 1;, the average correlation coefficient j of 20 chromosomes is 0.482. (Fig. 2d,  We deduced that gene enrichment in a genomic region is likely to attract more PRE-1s parasited in the introns, but the genomic feature in the gene-rich region, such as the intron size instead of intron number, may be a necessary condition for accommodating more repeats.     BLASTn search showed that the dynamic size range of these "somewhat similar sequences" is approximately 12 -350 bp. We calculated the length-frequencies of PRE-1 repeats on both strands of 20 chromosomes. Surprisingly, not only the total numbers, but the length-frequency distributions of sense and antisense PRE-1 elements are highly similar on each chromosome. When plotting the positive Y-value of the length-frequency of sense PRE-1 elements and negative Y-value of antisense PRE-1 elements from 10-350 bp at 10 bp segments along the X-axis, the length-frequency distributions of sense and antisense PRE-1 elements mirrored each other vertically, thus presenting a perfect symmetrical "tadpole" type curve (Fig. 3a). Only the Y chromosome is limited by insufficient PRE-1 repeats (Fig. 3b), all the other 19 chromosomes follow the same rule ( Supplementary Fig. 2 For investigating how the copy number of PRE-1s affect the degree of symmetry of the length-frequency distribution curve, we selected 13 pairs of data (calculating the correlation coefficients corresponding to the number of PRE-1 elements) by an increasing interval step in a region ranging from 1 to 20000000 bp on chromosome 1 and plotted a growth curve of the correlation coefficient (Fig. 3e). By performing non-linear regression analysis, the 4 parameter logistic non-linear regression model had been tested to fit the curve well. According to this non-linear regression model, we deduced that when the total number of PRE-1 elements cumulated to approximately 784, the Pearson coefficient will reach 0.9; when the number of PRE-1 elements increased to 1000, the Pearson coefficient will plateau at 0.92-1 (Fig. 3f, Supplementary Table 6).
According to the regression equation, we know that the number of PRE-1 elements is only 721 on the Y chromosome, so the correlation coefficient is only 0.902, which is short of the plateau and explains the lack of high symmetry of the curve.
We further calculated the percentage of integrating and break sites of these repeats initiated from the complete PRE-1 elements on all 20 chromosomes by the hit location information extracted from the results of the BLASTn search. In total 1037475 PRE-1 blast hits were used, and the proportion of integrating sites of the PRE-1 blast hits was displayed in a plot chart as shown in Fig.  3g We found that 64.82% of the integrating sites in the PRE-1s are initiated from position 1 and that 91.45% of them are from the A region; there is only one significant integration site in the B region at position 136 ( Fig. 3 g). Meanwhile, 42% of the fracture sites are concentrated in the tail region. Positions 216 and 235 are located in the uncharacteristic region and have the highest fracture probabilities, accounting for 8.79% and 5.80%, respectively, but in the trinucleotide repeat region, the fracture probability showed a normal distribution (Fig. 3 h). There are two fracture hotspots in the B region, one is at position 120 (0.89%) and the other is at position 139 (0.57%). The frequency distribution of fracture sites and fragment lengths showed a highly positive correlation, with a 0.946 correlation coefficient (Fig. 3i, 3j). The correlation of frequency number between fracture sites and fragment lengths on the whole pig genome (The interval of frequency distribution for fracture sites and fragment lengths is 10 bp).

Exonization of the PRE-1 elements in pig genome
Although Alu elements are ubiquitously interspersed in gene-rich regions throughout the human genome, with the majority located in introns, most Alu elements have remained silent, whereas a small subset have undergone exonization. More than 5% of the alternatively spliced internal exons in the human genome are derived from Alu elements 27 . Compared to normal cells, Alu exonization occurred in one of the non-coding transcripts or in most leukemia/cancerous cells at a very high level 28 . The exonization of Alu sequences has been shown to be regulated not only by splice sites but also by splicing regulatory sequences within the Alu elements 29 .
In total, 2337 PRE-1 blast hits were found on the 1913 transcripts, including 673 splice variants from 1240 genes found by BLASTn search against the NCBI reference RNA sequence of Sus scrofa, consisting of 311 non-coding RNAs (accounting for 18% of total non-coding RNAs) and 929 mRNAs (accounting for 4% of total coding RNAs) (Fig. 4a).
We human Alu element (Fig. 4b) 30 , although most of the PRE-1 blast hits in the protein-coding regions are less than 100 bp and adjacent to the coding region boundary. A few large fragments of PRE-1 elements were found integrated in coding regions and involved in the formation of new proteins such as MHC class I related antigen 2 (MIC-2) (Fig. 4c). We assembled all of these 347 PRE-1 sequences into 5 consensus sequences using DNA Dragon. By aligning these sequences, we found that each integration initiation site or termination site is located ahead of a hypervariable region in the multiple alignment; three mutation regions with the largest degree of variation are all linked with sites: position 216, position 235 and the trinucleotide repeat region. (Fig. 5b).
We found 98% of the 347 PRE-1 elements contain two conserved AluI restriction sites by performing restriction enzyme analysis.

Discussion
Unlike all other transposons, most SINE families do not share a common origin and were independently generated in different host lineages from cellular and LINE modules 12 . The specific SINEs or SINE families are usually restricted to respective orders, rarely crossing order boundaries 33 . In recent years, a novel kind of 7SL RNA-derived SINE, Tu type I, had been identified to be a 7SL RNA-derived SINE in scandentians (tree shrews) 11 . These 7SL RNA-derived SINEs have already been considered a supraprimates-specific SINEs 13 . The 5'end of SINEs are derived from 7SL RNA (Alu, B1, and their derivatives in primates, rodents, and tree shrews), the origin of the central part of most SINE families is obscure, and the 3' end of SINEs descend from the 3' end of the partner LINE 34 .
The first PRE-1 element was identified in the genome of the miniature swine by Singer et al. in 1987 35 and was thought to be an arginine-tRNA gene 36 , which diversified at least 43.2 million years ago 37 . Approximately 98% of PRE-1 elements contain two conserved AluI restriction sites, ( Supplementary Fig. 4), but only 40% Alu representative sequences in the AF-1 database have conserved double-AluI recognition sites in the B arm region, and the evolution of the PRE-1 element from 7SL RNA is more evident than that for the Alu element (Fig. 6a). The conclusion from our results is that the PRE-1 element maybe a unique SINE in the pig genome. It is more like a 7SL RNA-derived SINE than a tRNA-derived SINE. In the process of integration into the genome PRE-1 will fracture into fragments. As 7SL RNA was generated from tRNA, these small indirect tRNA derived fragments are usually considered to be mistakenly derived from tRNA.
Just like human Alu elements, PRE-1 elements tend to be clustered in the introns of gene-rich regions, so we abandoned the optimal interval of each chromosome to calculate the correlation instead of a unified interval length (1000 kb), concluding that the average correlation coefficient of the density distributions of genes and PRE-1 elements for all 20 chromosomes is only 0.482. As we observed on the COL7A1 gene, which has 119 exons on chromosome 13, most introns are extruded into "narrow introns," within which 12 PRE-1 elements are accommodated, so we concluded that the numbers of "wide introns" may play a key role to explaining why PRE-1 elements are parasitic in gene-rich regions.
Although the PRE-1s propagated the nearly same number of repeat on both sense and antisense strands, and sensitive between them, but the length distribution of PRE-1 fragments is most intriguing and indicates that PRE-1 propagation is a continuous fragmented insertion process. The amplification mechanism of PRE-1 repeats is very similar to that of the primate Alu repeats, the variation of repeat size is because any position of the complete replicated strand can be inserted into the target site using a nick at its genomic integration site to allow target-primed reverse sites. According to the alignment of 347 complete PRE-1 elements, we found that all fracture hotspots are followed by a mutation region; it seemed that a higher degree of variation indicated a higher probability of fracture.
We further predicted the secondary structure of PRE-1 RNA using RNAstructure Webservers (http://rna.urmc.rochester.edu/RNAstructureWeb/Servers/Predict1/Predict1.html), The modle was chose for it's the most minimal energy value shows that even without an A-linker, the structure of PRE-1 RNA can still be folded into a two-arm model where most mutations in the tail region are located in the non-complementary single-stranded region, which is the most fragile region of the PRE-1 RNA structure. The PRE-1 RNA structure further revealed that it evolved from 7SL RNA as it shares similarity to the primate Alu element based on the RNA structure. The PRE-1 RNA maybe also transcribed from the adjacent genomic site and is thought to assemble into a SRP9/14 heterodimer, with polyA-binding protein (PABP) and at least one other unidentified protein binding to the RNA structure. The SRP9/14 proteins and PABP help the PRE-1 RNA associate with a ribosome to associate with ORF2. PRE-1 RNAs then utilize the purloined ORF2 protein to copy themselves at a new genomic site using a process termed target-primed reverse transcription 23 . and now, we speculated that the gene cluster seems has dual effects of attraction and insulation on the PRE-1 or other SINEs during the integration process, the number of PRE-1 inhaled in the non-coding region of the gene region is significantly higher than that of in the non-gene regions, We are not sure about how long the fragment is at some insertion site, but only get the probability of the fragment.
In the past, we were always much more concerned with the evolution of genes rather than with other genomic elements and used genes as the most important markers to trace phylogeny, regardless of the fact that the genome was coevolved by the gene together with other genomic elements. In our research, the genomic performance of PRE-1 in terms of 7SL RNA-derived SINEs seemed convincing enough to classify the suidae into a family mainly inhabited by primates, so the divergence time of 7SL RNA evolutionary events may be re-adjusted back to 80-100 million years ago 38 , before the boreoeutherians diversified into Laurasiatheria and Euarchontoglires 39 . 7SL RNA began to diversify or hybridize from different tRNAs to form a variety of 7SL RNA-like SINE in boreoeutherian ancestors. One branch was found in the ancestor of primates and evolved in supraprimates; the other was found in the ancestor of suidaes and evolved in laurasiatherias. The similarity of PRE-1 and Alu elements further revealed a hidden kinship, such that suidaes are more closely related to the suraprimate/primate than any other laurasiatherias based on the 7SL RNA derived SINE composition of their genomes.

Methods
The numbers and positions of PRE-1 repeats in complete genome were identified using BLASTn (http://blast.ncbi.nlm.nih.gov/Blast.cgi). Information about the associated genes, protein coding sequences, introns, exons, and splice variants of pig genome was retrieved from the annotations of each chromosome available on NCBI's gene database. Correlation between PRE-1 repeats and gene density was calculated for non-overlapping windows along the whole chromosome at 1000 kb intervals in term of chromosome size. The total PRE-1 size (base pairs in an interval occupied by a PRE-1 elements) was taken as measures of PRE-1 density, rather than its numbers.
Statistical analyses were performed using Excel, XLSTAT, and Graphpad Prism 6 software packages. Sequence alignments and graphic displays were generated using the Geneious Pro software package (version 4.8.3). The phylogenetic tree was constructed using the Maximum Likelihood method in MEGA (V.6.06), and the PRE-1 contigs were assembled by DNA Dragon 1.6.0 with 80% identity of complete overlapping fragments. Fig. 1 Clustering tree of porcine Alu-like SINE/PRE-1 with representative primate Alu sequences. A maximum likelihood tree produced by MEGA 6.06 using 40 primate Alu sequences from the AF-1 database with porcine Alu-like SINE/PRE-1 and 7SL RNAs of pig and human.     chromosome. Because the length and gene cluster of every chromosome isn't same, every chromosome has its own optimal bin width value for its highest correlation coefficient. Through a series of calculations, we find that the bin width value of 10000000 or 5000000 is an ideal option to obtain a relatively high correlation coefficients, except for chromosome Y (bin width value is 200000), every correlation coefficient is chosen from the higher value between the frequency distribution of bin width values of 10000000 and 5000000. Table 4 Length-frequency of PRE-1 repeats on all 20 chromosomes of the pig.

Supplementary information
A table describing the frequency value of PRE-1 repeats on both strands of each pig chromosome.
The frequency value of repeats on antisense strands are set negative. of each chromosome to perform a paired T-test. Table 6 Growth curve of the Pearson coefficient of the length frequency in the region ranging from 1 to 20000000 bp on chromosome 1. A table describing the increase in the Pearson coefficient of the length frequency between repeats on both strands with the range increased, including the number of the various sizes of sense and antisense repeats in a gradually increasing range.