Development of SSR markers in Paeonia based on De Novo transcriptomic assemblies

Peony is a famous ornamental and medicinal plant in China, and peony hybrid breeding is an important means of germplasm innovation. However, research on the genome of this species is limited, thereby hindering the genetic and breeding research on peony. In the present study, simple sequence repeat (SSR) locus analysis was performed on expressed sequence tags obtained by the transcriptome sequencing of Paeonia using Microsatellite software. Primers with polymorphism were obtained via polymerase chain reaction amplification and electrophoresis. As a result, a total of 86,195 unigenes were obtained by assembling the transcriptome data of Paeonia. Functional annotations were obtained in seven functional databases including 49,172 (Non-Redundant Protein Sequence Database: 57.05%), 38,352 (Nucleotide Sequence Database: 44.49%), 36,477 (Swiss Prot: 42.32%), 38,905 (Clusters of Orthologous Groups for Eukaryotic Complete Genomes: 45.14%), 37,993 (Kyoto Encyclopedia of Genes and Genomes: 44.08%), 26,832 (Gene Ontology: 31.13%) and 37,758 (Pfam: 43.81%) unigenes. Meanwhile, 21,998 SSR loci were distributed in 17,567 unigenes containing SSR sequences, and the SSR distribution frequency was 25.52%, with an average of one SSR sequence per 4.66 kb. Mononucleotide, dinucleotide, and trinucleotide were the main repeat types, accounting for 55.74%, 25.58%, and 13.21% of the total repeat times, respectively. Forty-five pairs of the 100 pairs of primers selected randomly could amplify clear polymorphic bands. The polymorphic primers of these 45 pairs were used to cluster and analyze 16 species of peony. The new SSR molecular markers can be useful for the study of genetic diversity and marker-assisted breeding of peony.

Introduction Peony, which is a deciduous woody plant that belongs to genus Paeonia, is one of the traditional famous flowers in China. Peony had been cultivated for more than 1,600 years as an ornamental plant with high ornamental, edible and medicinal values. Paeonia is divided into three sections: Paeonia, Moutan, and Onaepia. Hybridization, as an important means of peony germplasm innovation, is the source of many excellent varieties of peony [1]. The offspring obtained from the distant hybridization between sect. Paeonia and sect. Moutan shows obvious heterosis. 'Itoh' known as the future of Paeonia, is the most famous hybrid with rich color, long flowering period, and vigorous growth [2]. However, in China, the distant hybridization of Paeonia is still in the primary stage. Although most of the interspecific distant hybridization of Paeonia can complete fertilization, embryos in the developmental process often shrink or die, and even individual full hybrids cannot develop into normal seedlings [3]. Embryo abortion can be caused by the abnormal development of embryo sac and endosperm, genetic incoordination between hybrid embryo, and endosperm, gene regulation and endogenous hormone regulation [4]. In recent years, many studies on embryo abortion after hybridization among peony have been reported. The Hybridization between P. ostii 'Fengdanbai' and P. veitchiias parents was conducted, and the result indicated that the free nuclear endosperm of aborted breeding embryo can not develop normally to provide nutrition for the original embryo [5]. However, only limited research has been done on the molecular mechanism of embryo abortion, and relatively complete genomic studies are lacking at present.
Transcriptome refers to the collection of all RNA, including mRNA and non-coding RNA, transcribed by cells or tissues of an organism in a specific state. Genes are annotated and labeled, through RNA-sequencing (RNA-seq) to analyze the expression and function of genes in different organisms [6]. The transcriptomic analysis of different tree peony organs, such as carpel, floral organ, and seed were reported. A comparative transcriptome was performed between two cultivars of P. rockii with different development patterns of carpel. From the transcriptome data, a total of 66,563 unigenes and 28,155 differentially expressed genes (DEGs) were identified. Among these DEGs, the genes PsMYB114-like, PsMYB12, and PsMYB61-like from the MYB gene family were probably the main characters that regulated the carpel quantitative variation [7]. A total of 29,275 unigenes were obtained from the bud transcriptome of tree peony. Among the DEGs, 64 flowering-related genes, and the genes of PsAP1, PsCOL1, PsCRY1, PsCRY2, PsFT, PsLFY, PsLHY, PsGI, PsSOC1, and PsVIN3 probably regulated reblooming [8]. A large number of DEGs on the transcriptome data of tree peony (P. ostii) seeds were related to oil biosynthesis and fatty acid metabolism [9]. In the present study, RNA-seq was used to discover the related pathways and important genes involved in the hybridization process of Paeonia.
Simple sequence repeat (SSR) is a widely distributed nucleotide sequence in the eukaryotic genome. SSR also has the characteristics of repeated occurrence and short sequence. In addition, SSR is polymorphic because of the different number of repeat sequences. Moreover, SSR is one of the most effective molecular markers among numerous molecular markers due to reproducibility, multi-allelic nature, high stability, relative abundance, co-dominant inheritance and good genome coverage [10,11]. SSR loci are uniformly distributed in the entire genome of eukaryotes. The mutation frequency of SSRs is 30 times higher than the general random mutation, providing a good choice for the construction of plant gene map, genotype analysis and the study of genetic diversity. SSR markers are generally divided into genomic SSRs (G-SSRs) and expressed sequence tags (EST-SSRs) according to their position in the genome. EST-SSRs are more conservative than G-SSRs [12][13][14]. EST-SSR molecular markers have been widely used in the genetic diversity, genetic linkage map, and genome-wide association study analyses of many plants. The first high-density genetic linkage map of peony was constructed by using EST-SSRs [15]. A total of 2,253 potential SSRs were detected from 1,969 unigenes in the transcriptional sequencing analysis of tree peony flower buds, and 17,705 SSR motifs distributed in 13,797 sequences were obtained from the transcriptome of tree peony's underground renewal buds [16,17]. The detected EST-SSR markers were used in genetic diversity evaluation. The distribution frequency and regulation of SSR in the EST sequence of Phalaenopsis aphrodite were analyzed, and the polymorphism and universality of the SSR were evaluated [18].
Peony has a long history of cultivation and a complex genetic background. Reliable and effective molecular markers are needed for the analysis of the relationship among peony species, identification of hybrid seedlings, establishment of genetic map, and marker-assisted breeding [19]. SSR is one of the most useful molecular markers among many molecular markers, but SSR is rarely used in peony due to the insufficient number of SSR markers. Therefore, developing a large number of SSR molecular markers is urgently needed for peony research. In the present study, 21,998 SSR loci were identified, and 45 pairs of SSR primers were developed on the basis of the transcriptome sequence of the hybrid embryos of Paeonia to analyze the genetic diversity of peony germplasm, which provides a theoretical basis for the research on the abortion of hybrid embryos.

Plant materials
Hybrid embryos were obtained from the hybridization of P. ostii 'Fengdan' and P.lactiflora 'Fenyunu' from Henan Agricultural University with evident abortion differences in three different periods (pre-stage, mid-stage and late-stage). Three individual embryos from each stage were collected and pooled for each sample, and this collection was repeated three times to provide biological replicates. RNA extraction was conducted after mixed sampling as further research materials.
The 16 peony materials for screening polymorphic markers were provided by Zhengzhou Botanical Garden (Table 1, Fig 1). Three or four leaves of each species were harvested, immediately frozen in liquid nitrogen and placed at −80˚C for DNA extraction.

Method
DNA and RNA extraction. Genomic DNA was extracted using a DNA extraction kit (DP301) of Tiangen Biotech (Co., Ltd., Beijing). DNA quality was detected in 1% agarose gel electrophoresis.
RNA was extracted using a total RNA extraction kit (DP441) of Tiangen Biotech (Co., Ltd., Beijing), and the concentration and quality of RNA were investigated by the NanoDrop ND-5000 Ultraviolet-visible spectrophotometer produced by Thermo Company. Qualified RNA was sent to BGI Genomics (Wuhan) for library construction, and three biological replicates and technique replicates were used in high-throughput the next-generation sequencing conducted on the BGISEQ -500 platform.
Transciptome analysis. Raw data were filtered using SOAPnuke software, and reads with joint contamination, low quality, and unknown bases were removed. De novo transcriptome was assembled using Trinity software. Assembled transcripts were clustered, and redundancy was removed to form non-redundant unigenes.
Functional annotation of unigenes. All the unigene sequences obtained from the three sequencing libraries were compared with the seven databases, including Non-Redundant Protein Sequence Database (NR), Nucleotide Sequence Database (NT), Gene Ontology (GO), Swiss-Prot, Kyoto Encyclopedia of Genes and Genomes (KEGG), Pfam, and Clusters of Orthologous Groups for Eukaryotic Complete Genomes (KOG) databases, by using Blastx software to obtain the functional annotation, biological process, and metabolic pathway information of relevant unigenes.
Identification of SSR loci and design of SSR primers. SSR loci from transcriptome sequence were identified using the Microsatellite (MISA) software (Microsatellite, http://pgrc. ipk-gatersleben.de/misa/). One to six nucleotides minimum repeats of 12, 6, 5, 5, 5 and 5 times were used as the identification criteria. The number and proportion of SSR types in Paeonia transcriptome and the frequency of different repeat times of each repeat type were counted.
The primers of each unigene sequence containing SSR loci were designed using Primer 3.0 software (http://bioinfo.ut.ee/primer3). Default values were selected as software parameters. One hundred pairs of primers were randomly selected and synthesized by Sangon Biotech (Co. Ltd., Shanghai) for the analysis of polymorphism and universality of primers.
Selection, amplification and validation of SSR markers. Polymerase chain reaction (PCR) was performed at a volume of 10 μl containing 5 μl of 2 × Taq PCR Master Mix (Biomiga), 1 μl of template DNA, 0.5 μl of forward primer (10 ng /μl) and 0.5 μl of reverse primer (10 ng /μl), 3 μl ddH2O. Amplification was performed according to the following procedure: pre-degeneration at 94˚C for 5 min, degeneration at 94˚C for 30 s, annealing temperatures for 30 s (annealing temperatures were set according to each primer in the range of 50˚C-65˚C), 40 cycles of 72˚C extension for 40 s, and final extension 72˚C for 10 min. The PCR products were processed in 1% agarose gel for 50 min at 90 v, and SSR primers with good stability were screened out.
Primers with good stability were used for PCR amplification in the SSR fluorescence primer reaction system. Fluorescent PCR was performed at a volume of 10 μl that contained of 5 μl of 2× Taq PCR Master Mix (Biomiga), 2 μl of template DNA (50 ng /μl), 0.15 μl forward primers (10 ng /μl) and 0.3 μl of reverse primer (10 ng /μl), 2.55 μl ddH2O. Forward primers were synthesized with an additional 18 nucleotides from the M13 universal primer appended to the 5 0 end. The fluorescence amplification procedure was the same as the ordinary amplification procedure. PCR amplification products (0.5 μl) with fluorescence labeling were pooled and combined with Hi-Di formamide (9 μl) and LIZ-500 size standard (0.5 μl) (Applied Biosystems, USA) and analyzed using an ABI3730xl DNA Analyzer (Applied Biosystems, USA). Polymorphic locus information was identified through Genographer v2.1 software (http:// sourceforge.net/projects/genographer). Genetic diversity analysis. Sixteen peony cultivars were analyzed using SSR primers with polymorphism. Raw data, including polymorphic information content (PIC), observed number of alleles (Na), effective number of alleles (Ne), heterozygosities (Ho), expected heterozygosities (He), and Shannon's diversity index (I) were analyzed using POPgene v1.32 software [20]. The genetic similarity of these materials was analyzed using NTSYS-PC software [21]. Based on these data, the cluster dendrogram of these varieties was drawn, and then the relationship among these varieties was analyzed.

Sequencing and assembly
A total of 92.02 Gb of data were measured on the BGISEQ-500 platform, and 86,195 unigenes were obtained by assembling clean reads with Trinity. All data related to this study has been public available on NCBI with SRA accession PRJNA592882. The total transcriptome sequence In the NR database, Vitis vinifera (28.7%) had the highest matching degree with the unigene sequence, followed by Juglans regia (6.25%), Nelumbo nucifera (3.79%), and Hevea brasiliensis (3.54%).Ziziphus jujuba had the lowest matching degree (3.36%), and 54.38% of the unigenes did not match the protein sequences of other species (Fig 3). Short unigene sequences are less likely annotated. In this study, the unigenes that did not match the protein sequences of other species were the unigenes (24.31%) with fragment length of 201-400 bp.
All unigenes compared with the NR database were further annotated into the GO database, and 26, 832 genes were annotated. GO database divided the unigenes into three functional categories including biological process, cell composition and molecular function. Seventeen types of genes including 15,509 unigenes were related to biological processes. Among them, 8,753 unigenes were involved in cellular processes, 3,557 unigenes were involved in biological regulation, 1,921 unigenes were involved in metabolic process, and 1,178 unigenes were involved in response to stimulus (Fig 4).
In the KEGG database, 37,993 unigenes were further annotated, analyzed, and classified into five major metabolic pathways: cell process, environmental information processing, genetic information processing, organismal systems and metabolism. Metabolism accounted for the largest proportion (56.46%) among the pathways. The five metabolic pathways were further divided into 20 subcategories. A maximum of 3,350 metabolites were related to carbohydrate metabolism; 3,186 metabolites were related to translation; 2,787 metabolites were related to folding, classification and degradation; 1,745 metabolites were related to transport and catabolism; and 1,737 metabolites were related to amino acid metabolism (Fig 5).

Distribution and quantitative characteristics of SSR loci in peony transcriptome
SSR loci were detected in the assembled unigenes using MISA (http://pgrc.ipk-gaterslebende/ misa/) software. The results showed that 21,998 SSRs were distributed in 17,567 unigenes, and the distribution frequency of SSR was 25.52%, with an average of one SSR per 4.66 kb. The main SSR motifs were mononucleotide, dinucleotide and trinucleotide, which account for 55.74%, 25.58%, and 13.21% of the total number of repeats, respectively ( Table 2). In addition, tetranucleotide, pentnucleotide and hexonucleotide accounted for 0.65%, 1.32%, and 3.48% of the total number of repeats, respectively. The distribution frequency of SSR loci of different types of peony was consistent with their proportion. The repeat times of SSR motifs were distributed between 4 and 56 times, and the repeat times of mononucleotide were more than 12 times. The repeat times of dinucleotides were mainly 6, 7, and 8 times. The main repeat times in trinucleotides and tetranucleotides were 5 and 4 times respectively. Moreover, the main repeat times of pentnucleotides and hexonucleotides were 4 times.

Screening of SSR primers and analysis of polymorphism
One hundred pairs of primers were randomly selected from the designed SSR primers for synthesis, and then 16 peony germplasm were amplified by RCR. Sixty pairs of primers capable of amplifying specific bands, among which 45 pairs showed good polymorphism, accounting for 45% of the total primers.
According to the analysis by POP gene software, 290 alleles were obtained by polymorphic primer amplification; with the distribution range of 2-11 alleles, and Na was 6.4444 (Table 3). SSR 43 was the primer that obtained the greatest number of alleles in the amplification. The mean value of the Ne of SSR locus was 3.9070, and the variation range was between 1.6187 and 7.0244. The mean value of Ho was 0.7785, and the variation range was 0.2667 to 1.0000. The mean value of He was 0.7302, and the variation range was 0.3954 to 0.9231. The average value  NTSYS-PC software was used for the clustering analysis of 16 species of peony (Fig 6). The genetic similarity coefficient of the 16 peony species ranged from 0.11 to 0.81, and these germplasm were divided into two categories. 'Yaohuang' was a separate species, and the 15 other peony species were grouped. Such a group had a genetic similarity coefficient of 0.59, which indicated that 'Yaohuang' was far different from the other species. The genetic similarity  coefficient of 'Yinlin Bizhu' and 'Fugui Mantang' was the highest at 0.81, indicating that the two were closely similar to each other. The next closest relatives were 'Wulong Yaohui' and 'Shengge Jin'. 'Jingyu' and 'Qingxiang Bai' of the same color series and peony varieties of the same flower type were not grouped together.

Transcriptome sequencing, assembly, and functional annotation
Peony is a non-model plant, and only little genetic information about peonies is available for research. The continuous development of new varieties makes the genetic diversity of its germplasm resources increasingly difficult to study. In this research, RNA-seq was used for investigating the peony transcriptome, and a total of 92.02 Gb of data were obtained. Taking N50 as an indicator to evaluate the assembly quality, the longer the length of N50 is within the range of 1000-2000 bp, the better the assembly quality can be [22]. In this study, N50 was 1,780 bp, indicating a good assembly. The average length of measured sequence obtained was 1,189 bp, which was higher than that of other tree peony researches [8,17], Syringa oblata (853 bp) [23], flowering Chinese cabbage (779 bp) [24], and bamboo (736 bp) [25]. After assembly, a total of 86,195 unigenes were obtained, among which 49, 172 were annotated into the NR database.
Among the homologous sequences of other species, the matching degree with Vitis vinifera was the highest. Transcriptome data annotated into the KEGG database showed that 37,993 unigenes were distributed in five categories of metabolic pathways. Among which, 3,350, 1737, 1,540, and 1,209 metabolites were related to carbohydrate metabolism, amino acid metabolism, lipid metabolism, and nucleotide metabolism. In this study, the abundant transcriptome information of peony was obtained through transcriptome sequencing, and the expressed genes were compared with various major databases for functional annotation, functional classification and metabolic pathway analysis. Hence, this study laid a theoretical foundation for the further screening of the key genes for embryo abortion in peony hybridization.

Analysis of SSR markers in peony
In this study, 86,195 unigenes were searched for SSR loci based on the transcriptome sequence of Paeonia, and the results showed that 21,998 SSR loci were distributed in 17,567 unigenes. One SSR site was found per 4.66 kb on average, with a distribution frequency of 25.52%, which was higher than that of Rhododendron fortune (20.58%) [26] and Sorbus pohuashanensis (16.68%) [27], but lower than that of Tagetes erecta (28.29%) [28]. The results showed that the frequency and density of SSR site distribution in different types of plants were different, which might be caused by the differences in species, in the number and length of unigenes, in analysis tools and SSR site screening criteria. In this study, the distribution of repeating motif types in the SSR loci of Paeonia was evidently unbalanced. Mononucleotide to trinucleotide were the main repetitive motifs, accounting for 94.53%, whereas tetranucleotide to hexanucleotide only accounted for 5.46% of the total. The partiality of these higher motifs was affected by the EST population size and was more likely limited by their own length. Except for mononucleotide, the frequency of dinucleotide was the highest, followed by that of trinucleotide. This result was similar to that of Sorbus pohuashanensis [27], and Actinidia eriantha [29], but different from that of Hordeum vulgare [30], flowering Chinese cabbage [24], and other plants, whose main type of SSR repeat motif was trinucleotide. In Gai's research, dinucleotide motifs were found to be the most frequent motif type (62.78%) in P. subaequalis, followed by trinucleotide (35.61%), tetranucleotide (1.21%) and hexanucleotide (0.40%) [16]. In other peony studies, tri-and hexanucleotide repeats were found to be the most frequent motif types [12]. The proportion of SSR repeat motifs was different in various plant species probably due to the different evolutionary events experienced by different plant genomes [31].
In this study, 60 pairs of primers were used for the PCR amplification of 16 peony germplasm. Among these primers, 45 pairs showed polymorphism, accounting for 75% of the total primers, which was higher than other peony (47.3%) [12], Sorbus pohuashanensis (11.54%), [27] and Rhododendron fortune (71.11%) [26] but lower than Tagetes erecta (80.56%) [28]. The number of samples, source, genetic diversity, and other factors affected this ratio. The polymorphism ratio of peony was high, which provides an important basis for the subsequent use of these primers. Genetic diversity of peony germplasm was analyzed using 45 pairs of primers. PIC value is an index that indicates the information content of molecular markers, and is widely used in the evaluation of molecular marker polymorphism. PIC value > 0.5 indicates that the primer has high polymorphism, PIC value in the range of 0.25-0.5 indicates moderate polymorphism, and PIC value < 0.25 indicates low polymorphism. In this study, 38 of the 45 pairs (84%) of primers showed high polymorphism, and the rest showed moderate polymorphism. The highest PIC value was 0.8418, and the average value was 0.6619, indicating that the molecular markers of peony were rich in information and highly recognizable. Among the 45 pairs of primers, the number of alleles was 290, Na was 6.4444, Ho was 0.7785, and He was 0.7302, and the I was 1.4839. An et al. used 15 pairs of primers to amplify 65 polymorphic loci in 57 samples of P. rockii, which has an average number of alleles of 4.333 at each locus, an average PIC of 0.618, and an average He of 0.671 [32]. Yu et al. amplified 42 polymorphic loci from 48 peony samples with 12 pairs of primers, and found that the average PIC was 0.468, average Ho and He were 0.785 and 0.541, respectively, and the average I was 0.906 [33]. The experimental results showed that different cultivars of peony had differences in various genetic parameters. Nevertheless, the polymorphism information content, allele number, Shannon's diversity index, and other indicators revealed that peony was rich in genetic diversity. Therefore, the 45 pairs of polymorphic primers in this study were feasible for the study on the genetic diversity of peony.
In this research, the classification of the genetic clustering of peony varieties had no significant relationship with the flower type and color of peony. The peony used here was different from the previous classification of peony varieties with flower type and color as the first and second level criteria. 'Wulong Yaohui' and 'Shengge Jin' were closely clustered probably because they have the same color and flower type. 'Manjiang Hong' and 'Doulv' were grouped together but have different color and flower type. The incongruence between the dendrogram and the division of cultivars might be due to the unclear genetic background of Chinese peony cultivars, the limited number of samples and markers, and the environment [12]. The genome of peony is complex, and the location of gene variation is uncertain. Hence, the genetic diversity of peony must be further studied.

Conclusions
In conclusion, using transcriptome sequences is a convenient way to identify SSR markers in tree peony. In this study, RNA-seq was used to assemble the transcriptome data of peony, and a total of 86,195 unigenes were obtained. Finally, 49,172 (NR: 57.05%), 38,352 (NT: 44.49%), 36,477 (Swiss-Prot: 42.32%), 38, 905 (KOG: 45.14%), 37,993 (KEGG: 44.08%), 26,832 (GO: 31.13%), and 37,758 (Pfam: 43.81%) unigenes were annotated via functional annotation. The distribution frequency of SSR was 25.52%, with an average of one SSR sequence per 4.66 kb, and mononucleotide, dinucleotide, and trinucleotide were the main repeat types. Based on the above analysis results, the SSR primers of peony were developed, and 45 pairs of primers, which showed good polymorphism, were selected to amplify the polymorphic bands. These novel SSR markers have high polymorphism and transferability and thus represent a powerful tool for the genetic linkage map construction, germplasm identification, genetic diversity analysis, genetic improvement and molecular marker-assisted breeding of Paeonia.