Characterization of repeated DNA sequences in genomes of blue-flowered flax

Background Members of different sections of the genus Linum are characterized by wide variability in size, morphology and number of chromosomes in karyotypes. Since such variability is determined mainly by the amount and composition of repeated sequences, we conducted a comparative study of the repeatomes of species from four sections forming a clade of blue-flowered flax. Based on the results of high-throughput genome sequencing performed in this study as well as available WGS data, bioinformatic analyses of repeated sequences from 12 flax samples were carried out using a graph-based clustering method. Results It was found that the genomes of closely related species, which have a similar karyotype structure, are also similar in the repeatome composition. In contrast, the repeatomes of karyologically distinct species differed significantly, and no similar tandem-organized repeats have been identified in their genomes. At the same time, many common mobile element families have been identified in genomes of all species, among them, Athila Ty3/gypsy LTR retrotransposon was the most abundant. The 30-chromosome members of the sect. Linum (including the cultivated species L. usitatissimum) differed significantly from other studied species by a great number of satellite DNA families as well as their relative content in genomes. Conclusions The evolution of studied flax species was accompanied by waves of amplification of satellite DNAs and LTR retrotransposons. The observed inverse correlation between the total contents of dispersed repeats and satellite DNAs allowed to suggest a relationship between both classes of repeating sequences. Significant interspecific differences in satellite DNA sets indicated a high rate of evolution of this genomic fraction. The phylogenetic relationships between the investigated flax species, obtained by comparison of the repeatomes, agreed with the results of previous molecular phylogenetic studies. Electronic supplementary material The online version of this article (10.1186/s12862-019-1375-6) contains supplementary material, which is available to authorized users.


Background
Repeated sequences are important components of eukaryotic genomes, and often constitute a significant part of plant genomes [1]. They can be dispersed across the genome or arranged in large stretches of tandem repeat sequences (satellite DNA). Repeated sequences are predominantly located in functionally important regions of eukaryotic chromosomes (e.g., centromeres, telomeres and some chromosomal interstitial regions) being the main constituent of heterochromatic bands [2][3][4]. Dispersed repeat sequences often occupy a significant fraction of genomes being one of the major determinants of genome size differences in eukaryotes. They are mainly represented by different classes of transposable elements (TEs) capable of sometimes moving from one location in the genome to another. Since it has been shown that TE insertion can directly affect the gene function, it is suggested that activation of TEs movements in response to environmental stress, and the associated generation of new mutations may play some role in the process of adaptation of an organism to a changing environment [5][6][7][8][9][10]. Satellite sequences are the most rapidly evolving part of the eukaryotic genome. Functions of satellite sequences are still not clear enough but there is evidence of their involvement in the spatial chromosome organization, the mechanisms of chromosome pairing and segregation [11]. It was found that satellite DNA transcripts participate in formation and maintenance of heterochromatin structure [12]. It is suggested that the high rate of variability of centromeric satellite DNAs can contribute to speciation, creating interpopulation reproductive barriers [13,14]. Even though the study of repetitive sequences is essential for understanding the principles of functional regulation and evolution of genomes, until recently information on the structure, organization and abundance of repetitive sequences in genomes was very fragmentary. This situation was mainly related to the technical difficulties of studying the repetitive elements in genomes [15]. The further development of NGS methods and new bioinformatic approaches has made it possible to change this situation radically. In particular, a set of software tools called "Repeat Explorer", which allows identifying repetitive sequences using raw WGS reads [16,17], was developed. Based on this approach, the repeatomes of many plant species have already been successfully investigated [18][19][20][21][22][23][24][25][26][27].
The genus Linum L. (Linacea) includes about 200 wild species. It's assumed that the genus originated in Eurasia, and later it spreaded to Africa, Australia, North and South America. Recent molecular phylogenetic studies have shown that the genus Linum L. is not monophyletic but includes representatives of two sister clades: yellow-flowered and blue-flowered flax [28,29]. Besides, each of these clades is subdivided into several groups of closely related species. So, yellow-flowered flax includes groups of species corresponding to the sect. Cathartolinum (Reichen b.) Griseb., Syllinum Griseb., Linopsis (Reichenb.) Engelmann and the genera Cliococca Bab., Hesperolinon (A. Gray) Small, Radiola Hill. and Sclerolinon C. M. Rogers. Blueflowered flax is also subdivided into groups according to the botanical sections Stellerolinum Juz. ex Prob., Dasylinum (Planchon) Juz., Adenolinum (Reichenb.) Juz. (syn. Linum perenne L. group) and Linum L. The sect. Linum includes the essential cultural species L. usitatissimum. Many of the wild flax species are important medicinal or ornamental plants. Also, they are considered as potential donors of economically valuable traits to improve the cultivated species. In the last decade, studies of the genome of cultivated flax L. usitatissimum have been intensively conducted [30][31][32][33][34][35][36][37][38][39][40][41][42][43], and the chromosome-scale genome assembly has recently become available [30,43]. However, the number of studies of the genomes of wild representatives of the genus Linum is still very limited. Many wild species have been studied using molecular-karyological approaches [44][45][46][47][48]. For four species, transcriptomes were recently sequenced [49]. Besides, some molecularphylogenetic studies of the genus Linum were carried out [28,29,45,[49][50][51][52][53][54]. In particular, it was shown [28,54] that the only representative of the sect. Stellerolinum, L. stelleroides Planch. (2n = 20), formed the basal phylogenetic branch of the extant blue-flowered flax. After this branch, the phylogenetic branch, represented by modern species of the sect. Dasylinum (2n = 16, 32), separated from the common trunk of the phylogenetic tree. Then, the phylogenetic branches corresponding to the sections Adenolinum (2n = 18, 36) and Linum were separated from each other. Unlike species of most sections of blueflowered flax, representatives of the sect. Linum were not a homogeneous group, and in turn, subdivided into several subgroups of karyologically distinct species. Particularly, L. narbonense L. (2n = 4x = 28), 16-chromosome species -L. grandiflorum Desf. and L. decumbens Desf., 30 chromosome species -L. usitatissimum L. and L. angustifolium Huds. and also high polyploid species from Australia and New Zealand (L. marginale A.Cunn. ex Planch and L. monogynum G. Forst. (2n = 84)) represented individual subbranches inside the sect. Linum. In the present study, for better understanding of the organization and evolution of the flax genomes, we conducted low-depth sequencing and comparative repeatome study of the representatives of blue-flowered flax using NGS data.

Genome size estimation
The nuclear DNA content of the flax species was determined by comparative Feulgen photometry according to Boyko et al., [55]. Briefly, the modification of this method includes synchronization of cells of root meristems by cold treatment at 0-4 0 C for night, fixation in ethanol:acetic acid fixative (3:1) at 2-4 0 C, hydrolysis at 50 0 C in 1 N HCl for 40 min, maceration with Cellulysin (Calbiochem, USA) and root tips squashing on the microscopic slides. About 50 prophase nuclei have been measured for each flax sample with Opton scanning microphotometer. Measurements were made in relation to to diploid rat hepatocytes containing 7.8 pg DNA per 2C nucleus [56] and Hordeum vulgare var. Odesski 31 containing 22.6 pg DNA per 4C nucleus [57].

DNA extraction, library construction, and sequencing
Flax seedlings were used for DNA extraction as described earlier [50]. High-quality DNA was used for DNA library preparation with TruSeq DNA Sample Prep Kit (Illumina, USA): 1000 ng of each sample was fragmented by nebulization, and then end repair, 3′-end adenylation, and adapter ligation were performed following the manufacturer's protocol. DNA fragments about 500-700 bp were excised from agarose gel and purified with MinElute Gel Extraction Kit (Qiagen, USA). Enrichment of DNA fragments was performed using PCR Master Mix and Primer Cocktail (Illumina, USA). Quality and concentration of the obtained libraries were evaluated using Agilent 2100 Bioanalyzer (Agilent Technologies, USA) and Qubit 2.0 fluorometer (Life Technologies, USA). The libraries were sequenced on MiSeq sequencer (Illumina, USA), and paired-end reads (300 + 300 nucleotides) were obtained.
To identify and classify repetitive sequences in flax genomes, raw paired-end WGS reads were analyzed using the RepeatExplorer toolkit [17]. For each studied genome, the WGS reads were filtered by quality, and then they were randomly sampled to final genome coverage about 0.1X, trimmed to 100 bp length and clustered using the graphbased clustering algorithm. A read similarity cut-off of 90% was used for clustering. The reads belonging to the same clusters were assembled into contigs. A minimum sequence overlapping length of 55% was used for assembly. The obtained sequence clusters were identified based on a similarity search against a repeat database implemented in Repeat Explorer. Identification and characterization of tandem repeat sequences were conducted by TAREAN tool of the Repeat Explorer. Clusters containing satellite repeats were identified based on a globular-or ring-like shape of cluster graphs. The monomers reconstruction of satellite repeats were generated using k-mer analysis. We took into consideration only putative satellite sequences having the probability of being a satellite DNA of at least 0.1 and constituting not less than 0.01% of the genome. The obtained putative satellite repeats were compared with known sequences from NCBI by BLASTN.

Primers design and PCR-amplification of putative satellite DNAs
For all flax samples except L. usitatissimum and L. bienne, tandem organization of the found putative satDNAs families were additionally examined by PCR amplification. For this purpose, primers were designed in opposite orientation to most conserved regions of monomers consensus sequences ( Table 1).
The characteristic ladder pattern of tandem repeats was checked after electrophoresis in a 2% agarose gel. The reliability of the PCR product was confirmed by Sanger sequencing. For L. usitatissimum and L. angustifolium, the reliability of putative satDNA families were verified by BLASTN comparison with a BioNano genome (BNG) optical map of L. usitatissimum cv. CDC Bethune [59] available in GenBank (NCBI) under GenomeProject ID #68161 (accession numbers CP027619 -CP027633).

Phylogenetic analysis and statistical evaluation
For phylogenetic tree construction based on repeatome compositions, the previously published approach has been used [60]. Clusters corresponding to plastid and mitochondrial DNA sequences were filtered out prior to phylogenetic inference. Each abundance was divided by the correcting factor (largest abundance/65) to make all numbers in the matrices ≤ 65 as it is required for TNT tree searches [61,62]. Data matrix containing relative proportion of top 200 the most abundant clusters was converted to the TNT format (modified Hennig86). Resampling was performed using 100 replicates. The tree was visualized by the iTOL tool [63].

General characteristics of genomes
The analysis of genome size and repeatomes showed that species from different sections differed significantly in genome size and repeatome composition (Table 2). So, the only representative of the sect. Stellerolinum, diploid species L. stelleroides, had the largest genome (1C = 1376 M bp), 39.47% of which was represented by mobile elements, and only 0.55% of satellite DNA. The genome of L. hirsutum from sect. Dasylinum (1C = 1066 M bp) contained 34.52% of TEs and 2.27% of satellite DNA. Genomes of diploid species of sect. Adenolinum were similar in size and repeatome composition. They had an average genome size 446 M bp, TEs -29.57%, unclassified dispersed repeats -4.46% and satellite DNÃ 4.67%. Unlike previous sect. Linum united, several groups of species differed significantly in the structure of their karyotypes. We investigated representatives of three such karyo-groups, which were most widely represented in the European flora. It was found that, together with karyological differences, representatives of these three groups also differed significantly in the molecular organization of their genomes. Among them, the group represented by the autotetraploid species L. narbonense (2n = 4x = 28) had the largest genome size 1C = 1617 M bp (or 808 M bp per one subgenome). Dispersed repeats represented a significant portion of its genome (54.78%), and the proportion of satellite DNAs (0.87%) was smaller than in the representatives of the previous sections. The second group of sect. Linum constituted two karyologically similar diploid species, L. grandiflorum and L. decumbens (2n = 16). Their genomes were also very similar (1C~480 Mbp), and contained~25% of dispersed repeats and~10% of satellite DNAs. The third group included karyologically similar allotetraploids (2n = 30): the cultivated species L. usitatissimum and its wild ancestor L. angustifolium. Both had the smallest genomes (1C~330 M bp, an average of 165 M bp per one subgenome),~18% of dispersed repeats and~13% of satellite DNAs.

Dispersed repeats
Comparison of the repeatomes of the studied species showed that they all contained spectra of TE similar in composition ( Table 2). In blue flax, Ty3-gypsy LRT elements were most widely represented. Among them, the most common was Athila. Its amount in the genomes of different species varied from 25% for the L. hirsutum to 0.6% for the 30-chromosome species of sect. Linum. In L. stelleroides, together with Ty3-gypsy LRT-element Athila, Ty3-gypsy LRT-element Tekai also was widely represented (~16%), while the genome of L. narbonense contained considerable amount (~39%) of Ty3-gypsy LRT-element Ogre. In contrary, the genomes of other species contain minimal amounts of Ty3-gypsy LRT-element Ogre. Amplification of Ty3-gypsy LRT-element Tekai in L. stelleroides and Ty3-gypsy LRT-element Ogre in L. narbonense resulted in increasing of the genome sizes of these species.

Satellite DNA
In total, 44 families of tandem-organized repeats were identified in the genomes of the studied species. Our results showed that the satellite DNA of blue-flower flax evolved much faster than dispersed repeats. Common families of satellite DNAs were only found within the three most closely related groups of species that had similar karyotypes (representatives of sect. Adenolinum, 16-and 30-chromosome species of sect. Linum). Only one family of putative satellite repeat common to sect. Adenolinum and sect. Linum was identified using the Repeat Explorer. However, this putative satellite had low confidence, and PCR amplification did not confirm its tandem organization. For other karyologically different species, strictly specific sets of families of satellite DNAs were identified. At the same time, DNA sequences homologous to satellite repeats of one species were often found in genomes of other non-closely related species but in a low number of copies (Additional file 1). The genomes of different species also differed in the content and diversity of satellite DNA (Table 2 and Additional file 1). For instance, in L. stelleroides, only one family of a putative satellite was detected with low confidence which made up about 0.55% of its genome. After PCR amplification, only very weak bands corresponding to the monomers and dimers of this repeat were detected. In our opinion, the question of whether putative satellite of L. stelleroides really has a tandem organization needs further investigation. In the L. hirsutum genome, four families of satellite DNAs were detected and confirmed by PCR. Their total content was 2.3% of the genome. In the genomes of species of sect. Adenolinum, the presence of five families of satellites was confirmed. Their total content and the abundance of each of the families varied between species (mean value was 4.4%). In L. narbonense, only two families of satellite DNA were discovered which totally made up~0.9% of the overall genome. In 16-chromosomal species of sect. Linum, two families of satellite DNAs common to both L. grandiflorum and L. decumbens were identified. One more family was found only in L. grandiflorum, and one was specific to L. decumbens. The highest number of families of satellite repeats (28 families) was found in 30-chromosomal species of the sect. Linum. In these species as well as in representatives of sect. Adenolinum, the amount of each satellite DNA family varied between L. usitatissimum and L. angustifolium. Thus, the studied species had a distinct tendency to increase the total number and diversity of satellite repeats simultaneous with the decreasing the content of dispersed repeats and the size of the genome (Fig. 1).

Phylogeny
Based on the comparison of the repeatomes, a phylogenetic tree of blue-flowered flax was constructed (Fig. 2). It was found that L. stelleroides and L. hirsutum were the most phylogenetically remote from the rest of the species. The representatives of sect. Adenolinum formed one common cluster, and the members of the sect. Linum formed three separate subclusters which corresponded to L. narbonense, 16-and 30-chromosome species.

Discussion
Phylogenetic connections in the clade of blue-flowered flax The phylogenetic relationships of representatives of the genus Linum have been investigated using various molecular phylogenetic markers, including different methods of genomic fingerprinting, and also by comparison of a large number of coding sequences from cytoplasmic and nuclear genomes [28,29,45,[49][50][51][52][53][54]. Besides, for most taxa of the Linaceae family, the origin time was estimated [28,29,53]. It was shown that the genus Linum comprised the representatives of two related clades: yellow-flowered flax and blue-flowered flax. Both clades were subdivided into a number of subclusters. In particular, the blue-flower clade was subdivided into seven subclusters. The two basal ones were formed by representatives of the botanical sections Stellerolinum and Dasylinuym, respectively. Then the trunk of the phylogenetic tree was subdivided into two branches which corresponded to the members of the sections Adenolinum and Linum. The branch represented by the species of sect. Linum was splitted into four subclusters. The first three of them included L. narbonense, 16-chromosome species (L.grandiflorum and L. decumbens) and 30-chromosome species (L.angustifolium and L. usitatissimum), respectively. The fourth cluster was represented by highly polyploid and closely related species from Australia and New Zealand (L. marginale A.Cunn. ex Planch and L. monogynum G. Forst.) which were not examined in this study. Thus, the reconstructed here phylogeny of blueflowered flax, which was based on the similarities and differences of the repeatomes, agreed with the results of molecular phylogenetic studies reported earlier.

Interspecies variability of dispersed repeats
As in the case of other angiosperms [64][65][66], genome size in flax species was mainly determined by the content of repeated sequences, especially by the dispersed repeats. The data on the size and composition of the genomes in the studied group of flax were in a good agreement with the results of the karyological investigations performed earlier [45][46][47]54]. Particularly, the species with the largest genome size (L. stelleroides, L. hirsutum and L. narbonense) had the largest chromosome sizes, whereas 30-chromosome species of the sect. Linum with the smallest chromosome sizes had the smallest genomes.
We revealed that all the studied genomes were characterized by similar sets of transposable elements. Among them, LTR type retroelements were the most widely represented. It was probably due to the common origin of those genomes. At the same time, differences were found both in the total content of dispersed repeats and in the content of individual classes of mobile elements in different species. Like other plants, in the studied species there is a direct relationship between the content of dispersed repeats in the genome and its size. Thus, the extremely large genome sizes in L. stelleroides, L. hirsutum, and L. narbonense could be related to a significant amplification of certain mobile elements. The smallest genome sizes were found in 30-chromosomal species of sect. Linum (L. usitatissimum and L. angustifolium) which also contained the lowest amount of dispersed repeats (20 and 18%, respectively). It should be noted that the content of dispersed repeats in L. usitatissimum, determined by using Repeat Explorer, was lower than that obtained earlier (24%) [30,59]. It could be related to the genomic differences among the investigated flax varieties or/and the smaller accuracy of the estimation method applied here. The small genome sizes of the 30-chromosomal allotetraploid Linum species [59] could be a result of the genome downsizing occurring after the formation of allotetraploids that was described for many plant species [64][65][66][67][68][69].

Interspecific variability of satellite DNAs
It is known that satellite DNA families is often retained in related species during long evolutionary periods. In particular, in some plant groups, there are families of satellite DNAs specific to taxa of different rank up to families and orders [70][71][72][73]. However, satellite DNAs sometimes evolve very rapidly and can differ considerably even in closely related species [74]. It is believed that the rapid divergence of satellite sequences occurs in reproductively isolated populations through a mechanism of concerted evolution [75][76][77]. In blue-flowered flax, the karyologically distinct species had different sequences of satellite DNAs, whereas closely related species with similar karyotypes had similar satellite DNAs. At the same time, we found that the content and diversity of satellite DNAs could differ even in closely related species and in different samples of the same species. The most significant differences in sets of satellite repeats were found in phylogenetically distant species. There was a tendency to increase the total content and the number of families of satellite DNA with decreasing genome size. We revealed that the sequences homologous to satellite DNAs of certain species could be also found in the genomes of phylogenetically distant species but in low copy numbers and these findings were in a good agreement with the 'library' hypothesis [78]. According to this hypothesis, related species shared a library of ancestral repeated sequences in low copy numbers. Some of these sequences could be differentially amplified creating a satellite DNAs of the particular species. It is suggested that a change in the amount of satellite DNAs might appear as a result of unequal crossover, chromosomal translocations, segmental chromosomal duplications or deletions as well as rolling-circle replication of extrachromosomal circular DNAs and their reinsertion in chromosomes [77,79,80]. The opposite directions of changes observed in the amount of tandem and dispersed repeats during evolution of blue-flowered flax were probably not accidental. Both types of repeated sequences were predominantly located in the heterochromatic regions of the chromosomes, where they were often interspersed with each other [81][82][83]. Currently, many data have been indicated that mobile elements can influence the evolution of satellite DNAs participating in the formation of new families of tandem repeats and their distribution along the genomes [84][85][86][87][88].

Conclusions
The obtained results showed that the evolution of the blue-flowered Linum species was accompanied by waves of amplification of satellite DNAs and LTR retrotransposons. Those events together with polyploidization resulted in significant differences in genome size and karyotype structure of blue-flowered flax. The genomes of the studied flax species contained similar sets of mobile elements but differed in their amount. At the same time, comparison of satellite DNAs showed that most of the detected tandem repeats were species-specific (except for some very closely related species) which indicated the rapid and concerted evolution of this genome fraction. The phylogenetic relationships between the studied flax species, obtained by the estimation of the similarity of their repeatomes, agreed with the previous data based on other phylogenetic markers. A direct relationship between the sizes of the genomes and the total content of dispersed repeats and the inverse relationship between the sizes of the genomes and the total content and diversity of satellite DNAs were revealed. The lowest amount of dispersed repeats and highest content of satellite DNAs in the genomes of 30-chromosome flax from the sect. Linum was probably a result of their allotetraploid origin. Thus, our findings gave new useful information about Linum genome evolution and provided a valuable set of data that could be used in future investigation of Linum genome.

Additional file
Additional file 1: Putative satellite DNA sequences in genomes of blueflowered flax. (DOCX 1037 kb)