Repetitive Elements in the Siberian Larch (Larix Sibirica Ledeb.) Nuclear Genome

Background: Repetitive elements (REs) or repeats are sequences that occur multiple times in the genome. They represent a signicant part of the gigantic conifer genomes (70-80%) relative to mammals and other plants and complicate whole genome sequencing and annotation. However, REs play important roles in evolution and adaptation processes in both plants and animals. Moreover, amino acid repeats play an important role in plant immunity being a structural element of the products of some disease resistance genes. Analysis of REs in conifer genomes is an important fundamental task. Results: REs were identied de novo and partly classied in the Siberian larch (Larix sibirica Ledeb.) nuclear genome for the rst time. In total, 20.9 million REs were detected with the total size of 4.8 Gbp, which comprises about 39% of the 12.3 Gbp larch genome. Resistance genes with leucine-rich repeats (LRRs) were also identied and analyzed in the transcriptome data of autumn buds obtained using RNA-seq. Conclusions: For the rst time, REs were identied and classied in the Siberian larch genome and transcriptome. In addition, LRRs and resistance genes were identied and analyzed in the Siberian larch transcriptomes from autumn buds. The larch genome contains twice as less RE compared to other conifers in the same Pinaceae family (39 vs 70-80%), and it might explain why it also has almost twice as smaller genome size (12 vs 18-31 Gbp).

more complex in structure and practically do not form blocks that follow each other. Interspersed repeats mainly occurred as a result of the mobile element or transposon activity and are scattered throughout the genome with unstable location for active transposons [18,19].
These repeats are divided into autonomous and nonautonomous transposons. The former ones encode their own enzymes able to excise themselves and paste into another genome location. Nonautonomous elements use enzymes of autonomous transposons for these aims. For example, SINEs are nonautonomous retroelements that use the enzymatic machinery of autonomous LINEs. Many SINE families have been ampli ed recently leading to insertional polymorphism between closely related individuals, which is a desirable trait for a molecular genetic marker. In addition, this class has been identi ed in many plant families including Gramineae, Commelinaceae, Rosaceae, Solanaceae, Fabaceae, and Brassicaceae and can be used as a meaningful classi cation criterion to evaluate phylogenetic relations among species [20].
In addition to noncoding nucleotide repeats, there are also tandem codon repeats in protein-coding genes that encode amino acid repeats in protein sequences. They can represent a structural motif that performs a speci c function. Leucine-rich repeats (LRRs) have been found in many functionally diverse proteins.
They form horseshoe structures and might be involved in protein-protein interactions. The most representative class of plant-pathogen resistance proteins or R proteins in plants are NBS-LRR proteins.
As their name implies, the NBS-LRR proteins include a nucleotide-binding site (NBS), also called a central nucleotide-binding domain NB-ARC, and a domain containing LRR. The LRR regions demonstrate high variation in type and number of the LRR units between and within species, which provides the speci city of pathogen molecules recognition. Thus, LRRs play an important role in plant immunity [21][22][23][24].
In the normal condition, the host defence mechanism supresses transposon activity and does not allow new transposon insertions. The regions of transposon activity are very prone to the mutation accumulation. According to [25], this ultimately leads to the formation of the so-called 'genomic dark matter'. In such genome regions with high transposon activity, transposable elements (TEs) can overlay each other, which in combination with a higher mutation rate can result in hardly identi able TEs [19,[25][26][27].
Nystedt et al. [2] suggested that mechanisms of removing TEs (e.g., via unequal recombination) were less active in conifers than in most other organisms. The chromosomes of the most recent conifer ancestor expanded at a slow but steady pace due to the activity of different mobile elements such as Gypsy and Copia, which are very abundant in conifer genomes. The insertion of TEs into genes can lead to the appearance of large introns and in combination with translocation to an abundant number of pseudogenes.
Siberian larch represents an unusual genus, which is the only genus with deciduous species in the Pinaceae family. Like other conifers it has also a relatively large genome size of 12.3 Gbp, but it is much smaller than in pines, spruces, and Douglas-r [13]. Here we provide pioneering preliminary data on the larch repeatome.

Search and identi cation of REs
RepeatModeler allowed for the generation of species-speci c de novo repeat library containing 1721 mobile elements that were found in the current assembly of Siberian larch. Assembling the consensus sequences of 21 million clusters (with cluster size threshold of 200 reads per cluster) with Inchworm from TrinitRnaSeq package resulted in 31 thousand consensus sequences that likely represent repeated regions of Siberian larch genome. To validate these sequences, we compared them to the RepeatModelerderived library and to a PIER repeat library (Wegrzyn et al. 2013;Neale et al. 2014). Homologs were found for 12000 consensus sequences among the RepeatModeler-derived library, and for 7000 sequences among PIER database. Reciprocal BLAST showed that 1045 out of 1721 RepeatModeler-derived sequences had a close homology to a clustering-derived consensus sequences.
The proportion of classi ed families observed in the Siberian larch genome was the same as in other conifers. The RepeatMasker and combined repeat library methods allowed to nd 20.9 million REs with the total size of 4.8 Gbp, which comprises about 39% of the 12.3 Gbp genome (Table S1). LTRs were the most common REs and comprised 37.5% of all repeats in total. Gypsy superfamily prevailed over the Copia one, which is typical for conifers ( Fig. 1). They also abundantly presented in other conifers [2,7,28,29] and angiosperm species [30]. However, most of the LTR-retrotransposons have not been classi ed into smaller families; they were named "other LTR" in Table S1. Substantial portion of LTRs was homologues to a loblolly pine bacterial arti cial chromosome (BAC) library and fosmid sequences [31,32], PtTalladega (3646 copies in the Siberian larch genome), PtOuachita (1025), IFG (990), PtAppalachian (773), PtConagree (731) and eight more repeat families were identi ed (Table 1). The portion of non-LTR is not much lower than LTR, which distinguishes larch from other conifers. LINE elements take the largest part of the all classi ed REs in larch and comprised the biggest part of non-LTR retrotransposons and all classi ed REs in other conifers [2,4,7,8]. As for genome coverage, retrotransposons LINE, Penelope and SINE together comprise the majority in the non-LTR group ( Fig. 1).
DNA transposons comprised 11.63% of all REs in total. The largest part of these transposons was not classi ed to families, and they were marked as "other DNA" (Table S1). The majority of classi ed DNA repeats consisted of TIR and Helitron repeat families (Fig. 1).
The majority of repeats among different repeat families was also relatively small in length, less than 1 Kbp. A small part of the longest repeats reached almost 15 Kbp, they belong to LINE elements and uncharacterized LTR. The most frequent REs for each family were shorter than 1 Kbp. Some repeat groups have a bimodal length distribution (Gypsy, DIR, LINE/I, Helitron, Penelope), but both of peaks in the distributions were less than 1 Kbp (Figs S1-S5). The most of LTRs homologous to a loblolly pine BAC library did not exceed 500 bp. The relatively large number of the PtTalladega LTRs (relative to other families in the loblolly pine BAC library) were shorter than 1 Kbp and did not reach 4 Kbp (Figs S2 and S3).

LRR domains identi cation
In total, 195 transcripts containing the LRR domain were detected among 22116 transcripts in the autumn buds of Siberian larch using hidden Markov model (HMM) method HMMER3 to correctly assign homologous sequences to one or more Pfam families of LRR (0.9%). In comparison, in the transcriptome of the Sitka spruce bud, only 51 among 10105 transcripts contained the LRR domain (0.5%), which was signi cantly less than in the transcriptome of the Siberian larch bud according to the Fisher's exact test (two-tail p = 0.0002; Table 2). The LRR-1, LRR-4, LRR-8, and LRR-6 families encompassed the largest portion of the found putative LRR domains identi ed in both the larch and Sitka spruce transcriptomes (Fig. 2). As it can be seen in Fig. 2, the largest number of transcripts contained LRR that matched several LRR families, and the LRR-4 family was the most common match in both species. There were also a few sets of unique sequences that contained LRR that matched only one of the LRR families (highlighted in green in Fig. 2). The largest set of 24 such sequences contained LRR that was detected only by matching to the LRR-4 family. In Sitka spruce, this family matched LRR in almost all the sequences that were also revealed by other LRR families, except one set of 10 unique sequences found only by using this LRR-4 family in the search ( Fig. 2B).

Identi cation of NB-ARC among all transcripts and transcripts containing LRRs
In total, 38 sequences among all the sequences in the larch transcriptome contained the NB-ARC domain, but only seven of them contained also the LRR domain. Meanwhile in spruce, six sequences with the NB-ARC domain were identi ed, but none of them contained the LRR domain.
NBS-LRR proteins are among the largest proteins known in plants, usually ranging from ~ 860 to ~ 1,900 amino acids [33], but most of the proteins encoded by transcripts containing the NBS-LRR genes were shorter than 300-400 amino acids in both Siberian larch and Sitka spruce (Fig. 3).

Veri cation of transcripts with LRRs by OmicsBox
The functional annotation by InterProScan did not reveal the presence of other functional domains in these sequences (Fig. 4). ARC-and LRRs-containing sequences in Siberian larch could be resistance genes because they include P-loop NTPase and LRRs families. Sequences containing only ARC might be incomplete or truncated in both Sitka spruce and Siberian larch transcriptomes because they included only P-loop NTPase and some parts that were classi ed as gene families connected with pathogen resistance and playing binding function (Fig. 5). ARC domain-containing sequences mostly consisted of NB-ARC. The LRR-containing sequences in both species mostly were assigned to the LRR superfamily by InterProScan.
Based on the blast data, most of the transcripts with LRR and ARC in both species matched mostly homologous sequences of other conifers (Figs S6 and S7). Among all other best matches were also homologous sequences representing Amborella trichopoda, Nelumbo nucifera, Prosopis alba, Glycine max, Glycine soja, Nymphaea colorata, Ceratodon purpureus, and Selaginella moellendor i.

Discussion
The REs classi ed in the Siberian larch genome were typical for other studied other studied conifers. Among them LTRs, represented by Gypsy and Copia elements, were the most common. A large amount of LINE elements observed in the Siberian larch genome was speci c for larch genome and distinguishes larch from other conifers [2]. In other conifers LTRs mainly prevailed over non-LTR and the number of LINE elements was substantially lower (Fig. 2 in [2]).
RE identi ed in this study cover ~ 39% of the entire genome of larch, which is signi cantly less compared to other conifers. The reason for this can be due to fragmented assembly and/or smaller genome size. It is possible that we found only fragments of full-size repeats. The bulk of repeats of different families are rather short, less than 1 Kbp (Figs S1-S5).
The largest number of the LRR-containing transcripts were detected using the models LRR-1, LRR-4, LRR-6 and LRR-8. The LRR-4 family was the most frequent with more than 86% of all LRR transcripts, as well as it accounted for 98% of all LRR domains in the Sitka spruce transcriptome based on the autumn bud sample and used merely for comparison. OmicsBox functional annotations con rmed the presence of domains ARC and LRRs in the identi ed transcript sequences. Only several sequences contained both ARC and LRR domains. The exact function of the sequences that contained either ARC or LRRs domains is uncertain because most of them are too short to be R-genes, which also must contain both of these domains as well.

Conclusions
For the rst time, REs were identi ed and classi ed in the Siberian larch genome. Their analysis helps us better understand organization of such large genomes as in conifers. The Siberian larch genome contains twice as less RE compared to other conifers in the same Pinaceae family (39 vs 70-80%), and we think that it might explain why it also has almost twice as smaller genome size (12 vs 18-31 Gbp).
In addition, LRRs and resistance genes were identi ed and analyzed in the Siberian larch transcriptomes from autumn buds. The LRR-containing sequences are associated with immune response of plants to biotic stress. Their further study will also help us better understand genetic mechanisms of disease resistance in larch and other plants.

Methods
Larch RNA-sequencing and transcriptome assembly The Siberian larch nuclear genome and autumn bud transcriptome were sequenced and assembled in the Laboratory of Forest Genomics of Siberian Federal University (Krasnoyarsk, Russia) [13]. RNA was isolated from autumn buds from a reference Siberian larch tree using the Qiagen RNeasy Mini Kit (Qiagen, Hilden, Germany). The RNA-seq library was prepared using the TruSeq RNA Sample Preparation Kit v2 (Illumina Inc., San Diego, CA, USA). The PE-sequencing of the obtained library was carried out on the Illumina MiSeq platform using the MiSeq Reagent Kit v2 (300-cycles) (Illumina Inc., San Diego, CA). FastQC software v. 0.11.9 was used to evaluate the quality of the sequencing data. The raw sequencing data were processed using Trimmomatic program v. 0.39 [34] (9-bp headcrop, minimum read quality of Q=23, and minimum read length of 35 bp). SortMeRNA version 4.0.0 [35] was used for ribosomal RNA removal. In addition, Rcorrector [36] was used for sequencing error correction. Finally, de novo assembler Trinity v. 2.9.1 [37] was used to assemble the Larix sibirica transcriptome.

Search and identi cation of REs
To assess the relative abundance of previously characterized repeat families, RepeatMasker was used on a whole assembly of Siberian larch (12.3 Gbp).
To search for REs, we used RepeatModeler v.1.0.11, which is based on de novo RE detection programs RepeatScout and RECON [38,39]. Since RepeatScout does not use all scaffolds or contigs for the analysis, but only a part of them that is randomly selected, it was decided to analyse 2 869 scaffolds from a larch genome assembly longer than 100 Kbp (Table 3). This derived RepeatModeler de novo library was augmented by clustering of frequently occurring reads from whole-genome sequencing data. Clusters of reads were assembled with Inchworm from TrinityRnaSeq v2.2.0, which resulted in consensus sequences that should represent highly repeated regions of the larch genome. Unrecognized elements from de novo repeat library generated by RepeatModeler and frequently occurring reads were classi ed by TEclass v2.1.3. This program classi es transposons using the Support Vector Machines (SVM) and LVQ neural network [40]. . Also LRR clan contains families LRR-10, LRR-11, LRR-12, FNIP, DUF285, Recep_L_domain, and TTSSLRR, but they were excluded because they represent bacteria, animals, and myxomycetes [42]. Statistics of ORFs in transcriptome assemblies are presented in Table 4. The computing was mostly performed using a 96-core server with symmetric parallel multiprocessing (IBM x3950 X6) and 3 TB of RAM. The computing cluster also included an IBM dx360 M4 hybrid computational server with two NVIDIA Tesla K20 GPUs, as well as an IBM Storwize V3700 48Tb storage subsystem. The cluster runs on SuSe Linux Enterprise Server 11 with installed parallel le system IBM GPFS, monitoring system Ganglia and Torque batch processing system.

Gene ontology (GO) analysis
The OmicsBox [43] was used for BLASTing, GO mapping, annotation and statistical analysis. Gene ontology (GO) terms associated with the obtained BLAST results were extracted, and evaluated GO annotation was obtained. The annotation step was reduced because graph construction was not possible with sequences extracted from total sequence. Enzyme codes were inferred by mapping with equivalent GOs, while InterPro motifs were directly queried at the InterProScan web service. The GO annotation was visualized by reconstructing the structure of the GO relationships and pathways.

Declarations
Ethics approval and consent to participate Not applicable.

Consent for publication
Not applicable.

Competing interests
The authors declare that they have no competing interests.

Funding
This work including the study and collection, analysis and interpretation of data, and writing the manuscript was supported by research grant № 14.Y26.31.0004 from the Government of the Russian Federation. The funding agencies played no role in the design of the study and collection material, analysis and interpretation of data, and in writing the manuscript. Publication cost has been funded by the Open Access Publication Funds of the University of Göttingen.