Application of EST-SSR markers developed from the transcriptome of Torreya grandis (Taxaceae), a threatened nut-yielding conifer tree

Torreya grandis (Taxaceae) is an ancient conifer species endemic to southeast China. Because of its nutrient-rich and delicious seeds, this species has been utilized for centuries by the Chinese. However, transcriptome data and transcriptome-derived microsatellite markers for population genetics studies are still insufficient for understanding of this species’ genetic basis. In this study, a transcriptome from T. grandis leaves was generated using Illumina sequencing. A total of 69,920 unigenes were generated after de novo assembly, and annotated by searching against seven protein databases. In addition, 2,065 expressed sequence tag–simple sequence repeats (EST-SSRs) were detected, with the distribution frequency of 2.75% of total unigenes and average number of 0.03 SSRs per unigene. Among these EST-SSRs, 1,339 primer pairs were successfully designed, and 106 primer pairs were randomly selected for the development of potential molecular markers. Among them, 11 EST-SSR markers revealed a moderate level of genetic diversity, and were used to investigate the population structure of T. grandis. Two different genetic groups within this species were revealed using these EST-SSR markers, indicating that these markers developed in this study can be effectively applied to the population genetic analysis of T. grandis.


INTRODUCTION
The Chinese nutmeg tree, Torreya grandis Fort. et Lindl. (Taxaceae), is a conifer species restricted to only a few mountainous regions in Zhejiang, Anhui, Jiangxi, Hunan, and Fujian, China (Kang & Tang, 1995;Cheng et al., 2007). Its seeds, well-known as a nutrient-rich delicacy, have been utilized by Chinese people for centuries (Kang & Tang, 1995;Li et al., 2005;Cheng et al., 2007). Moreover, its wood, being highly valued for its durability, along with pest and decay resistance, was widely used to make furniture, resulting in overexploitation (Yang & Luscombe, 2013). In combination with habitat destruction, the population of T. grandis has decreased drastically in recent years, and was listed as a threatened species on the International Union for Conservation of Nature Red List (Yang & Luscombe, 2013). Additionally, T. grandis "Merrillii," the only commercial cultivar, has been subjected to long-term planting, which has resulted in declined quality and yields (Li et al., 2004;Cheng et al., 2009). Therefore, the objective evaluation of wild resources is crucial for improving breeding programs and conservation of T. grandis.
The application of genetic information is often fundamental for the conservation and sustainable use of wild resources (Wayne & Morin, 2004;Hoban et al., 2013). The best way to develop strategies for the breeding and conservation of wild populations of T. grandis may be to obtain genomic information, thus revealing genetic diversity and population structure using molecular markers such as simple sequence repeats (SSRs). Due to the large and complex conifer genome, only a few conifer species have been subjected to genome sequencing (e.g., Picea abies, Nystedt et al., 2013;Pinus taeda, Neale et al., 2014). Alternatively, transcriptome sequencing by next-generation sequencing technologies, like expressed sequence tag (EST), is a promising strategy to obtain genomic information. Although a transcriptome have been generated for the cultivar T. grandis "Merrillii" that was selected by conventional breeding (Li et al., 2004;Yi et al., 2016), its limited genomic information might be insufficient for the investigation of wild genetic sources of T. grandis. Meanwhile, many studies on T. grandis have mainly focused on its medicinal component (Saeed et al., 2007), physiological ecology (Shen et al., 2014;Tang et al., 2015), nutrient content (Li et al., 2005;Feng et al., 2011), and field investigation (Li et al., 2005;Wu et al., 2006). Few studies have researched its genetic diversity. Thus far, a series of molecular markers were developed, but only amplified fragment length polymorphism and chloroplast microsatellites loci (cpSSR) have been applied to evaluate the genetic diversity of T. grandis (Min et al., 2009;Yi & Qiu, 2014;Zeng et al., 2014). However, these markers were dominant and/or uniparental inheritance, and underestimated the genetic diversity of T. grandis. Despite a set of codominant EST-SSRs developed from T. grandis "Merrillii" (Yi et al., 2016), these EST-SSRs were not employed to evaluate the genetic diversity and population structure of wild T. grandis.
In the present study, a comprehensive transcriptome from T. grandis leaves was obtained by Illumina sequencing platform, then the sequencing data were assembled and annotated, and a set of novel EST-SSR markers were developed from the transcriptome data. To effectively use and conserve wild T. grandis, the genetic diversity and population structure were also evaluated for six populations across its natural distribution using these EST-SSR markers.

Plant materials and DNA and RNA extraction
The fresh leaves of 84 individuals were collected from six wild populations of T. grandis across its natural distribution in China: 15 from Xinning, Hunan (XN), eight from Tonggu, Jiangxi (TG), 15 from Lichuan, Jiangxi (LC), 15 from Huangshan, Anhui (HS), 14 from Songyang, Zhejiang (SY), and 17 from Zhuji, Zhejiang (ZJ) ( Fig. S1; Table S1). All sampled leaves were immediately preserved in silica gel. Total genomic DNA was extracted using a modified cetyltrimethylammonium bromide (CTAB) procedure (Doyle & Doyle, 1987). In addition, fresh young leaves of three different individuals from the ZJ population were rapidly frozen in liquid nitrogen after sampling and stored at -80 C. Total RNA for these individuals were extracted using the TRIzol kit following the manufacturer's procedures (Life Technologies, Carlsbad, CA, USA). The purity, concentration, and integrity of each RNA sample were determined using the NanoPhotometer spectrophotometer (IMPLEN, Westlake Village, CA, USA), the Qubit RNA Assay Kit in Qubit 2.0 Flurometer (Life Technologies, Carlsbad, CA, USA), and the RNA Nano 6000 Assay Kit of the Agilent Bioanalyzer 2100 system (Agilent Technologies, Santa Clara, CA, USA), respectively. Equal amounts of qualified RNA samples from each individual were pooled together for cDNA library construction and transcriptome sequencing.

cDNA library construction and transcriptome sequencing
A three mg RNA sample was used for the RNA sample preparations. Sequencing libraries were generated using NEBNext Ultra RNA library Prep Kit for Illumina (New England BioLabs, Ipswich, MA, USA) following manufacturer's recommendations. mRNA was purified using poly-T oligo-attached magnetic beads and sheared into short fragments using divalent cations under elevated temperature in NEBNext First Strand Synthesis Reaction Buffer (5Â). First strand cDNA was synthesized using random hexamer primer and M-MuLV Reverse Transcriptase (RNaseH -). Second strand cDNA synthesis was subsequently performed using DNA Polymerase I and RNase H. Remaining overhangs were converted into blunt ends via exonuclease/polymerase activities. After adenylation of the 3′ ends of the DNA fragments, NEBNext Adaptors with hairpin loop structure were ligated to prepare for hybridization. In order to preferentially select cDNA fragments of 150-200 bp, the library fragments were purified with AMPure XP system (Beckman Coulter, Brea, IN, USA). Then three ml USER Enzyme (New England BioLabs, Ipswich, MA, USA) was used with size-selected, adaptor-ligated cDNA at 37 C for 15 min, followed by 5 min at 95 C before PCR. Then, PCR was performed with Phusion High-Fidelity DNA polymerase and universal PCR primers. Finally, PCR products were purified (AMPure XP system) and library quality was assessed on the Agilent Bioanalyzer 2100 system. The cDNA library was sequenced on the Illumina HiSeq 2500 platform (Illumina, San Diego, CA, USA) with 101-base paired-end reads at Novogene Biological Information Technology Co., Ltd., Beijing, China. The raw sequences of the transcriptome are available at figshare (https://doi.org/10.6084/m9.figshare.7006220.v2).

Transcriptome assembly and functional annotation
After removing reads containing adaptor, reads containing more than 10% poly-N, and low-quality reads (>50% with Phred quality score (Q) 5 bases) from the raw sequencing data using in-house Perl scripts (Novogene Biological Information Technology Co., Ltd., Beijing, China), high-quality clean data were obtained and then assembled using Trinity software with min_kmer_cov set to 2 and all other parameters set to default (Grabherr et al., 2011). Briefly, the clean reads were firstly partitioned into many smaller pieces and these sequence data were assembled into contigs using the de Bruijn graph with k-mer 25 after a trial of different k-mer sizes. Next, the reads were mapped back to contigs with paired-end reads to detect contigs from the same transcript and measure the distances between these contigs. Finally, the contigs were linked to get assembled transcripts that could not be extended on either end. The longest transcripts were defined as the unigenes for gene functional annotation.
Gene function was annotated based on public protein databases: Nr (NCBI nonredundant protein sequences); Nt (NCBI non-redundant nucleotide sequences); KOG (Clusters of Orthologous Groups of proteins); Swiss-Prot (a manually annotated and reviewed protein sequence database) by NCBI blast 2.2.28+ with e-value 1.0e-5; Pfam (Protein family) by HMMER 3.0 package with e-value 0.01; and KO (KEGG Ortholog database) by KEGG Automatic Annotation Server with e-value 1.0e-10. Based on the annotated results of Nr and Pfam databases, the unigenes were assigned in Gene Ontology (GO) using Blast2GO 2.5 (Conesa et al., 2005) with e-value 1.0e-6. The distribution of GO functional classification was plotted using the Web Gene Ontology Annotation Plot (Ye et al., 2006) by three categories: biological process, molecular function, and cellular component.

Identification of EST-SSRs
The MIcroSAtellite program (Thiel et al., 2003) was used to detect and locate EST-SSRs from unigenes. We set the detection criteria as at least six repeats for dinucleotide motifs, and five repeats for tri-, tetra-, penta-, and hexanucleotide motifs. Primer pairs flanking each EST-SSR were designed using PRIMER 3 (Untergasser et al., 2012), with PCR product size ranging from 80 to 300 bp, GC content between 40% and 60%, primer length ranging from 18 to 25 bp, and melting temperature between 55 and 65 C.
PCR amplification was performed in a 10 ml reaction that included five ml sterile water, five ml 2 Â Taq PCR MasterMix (Tiangen, Beijing, China), one ml of each primer (2.5 mM), and one ml 20-40 ng genomic DNA. The PCR procedure included an initial denaturation for 5 min at 94 C, followed by 30 cycles of 40 s at 94 C, 30 s at 55-65 C for each locus, and 50 s at 72 C, ending with a final extension of 7 min at 72 C. The PCR products were initially detected using silver-stained non-denaturing polyacrylamide gels, then the polymorphic PCR products were sequenced by capillary electrophoresis using an ABI 3730 DNA Analyzer (Applied Biosystems, Foster City, CA, USA). Allele sizes were determined using GeneMarker version 2.2.0 (SoftGenetics, State College, PA, USA).

Transcriptome sequencing and assembly
A total of 39,526,668 raw reads were generated from the Illumina sequencing of T. grandis. After removing adaptor-containing, more than 10% poly-N, and low-quality reads, 38,626,498 clean reads were obtained with 97.12% Q20 bases and 0.03% sequencing error rates. The total clean bases were 4.83 Gb with 43.54% GC content. Using the Trinity assembly software, the clean reads were assembled into 93,133 transcripts with mean length of 959 bp and N50 length of 1,838 bp. Of these, 64,356 (69.10%) transcripts ranged in length from 201 to 1,000 bp (Table 1). The transcripts were assembled into 69,920 unigenes, with mean length of 766 bp and N50 length of 1,503 bp. Of these, 54,564 (78.04%) transcripts ranged in length from 201 to 1,000 bp (Table 1).
All assembled unigenes were subjected to further functional prediction and classification by KOG and KO databases. There were 8,914 unigenes divided into 26 groups in the KOG database; the largest group was "General function prediction only" (1,902 unigenes), followed by "Posttranslational modification, protein turnover, chaperones" (1,135 unigenes) and "Signal transduction mechanisms" (743 unigenes) (Fig. S2). For KO analysis, 7,797 unigenes were assigned into 32 groups of five clusters; the largest group was "Signal transduction" (663 unigenes) of Environmental Information Processing cluster, followed by "Carbohydrate metabolism" (650 unigenes) of Metabolism cluster and "Translation" (642 unigenes) of Genetic Information Processing cluster (Fig. S3).

Polymorphism of EST-SSR markers and population genetic diversity and structure
We estimated the polymorphism of 11 EST-SSR markers of T. grandis, and analyzed the population genetic diversity for six populations using these markers (Table 2).   (Table 3).
For the population structure of T. grandis, the most likely population genetic cluster K was estimated using lnP(D) and DK statistics with the STRUCTURE analysis. Because the lnP(D) increased gradually as K increased from 1 to 4, and did not show an apparent peak value, it was difficult to determine the most likely population genetic cluster. On the contrary, the DK statistic detected the highest peak at K = 2 (Fig. S4). The six populations of T. grandis were then assigned into two groups: Group 1 included the XN, TG, LC, HS, and SY populations; and Group 2 included only the ZJ population (Fig. 3A). Consistent with the STRUCTURE analysis, the NJ phylogenetic tree also revealed two groups-based D A distances (Fig. 3B). The D A distances between Table 2 Primer sequences, repeat motif, size range of alleles, and annealing temperature (T m ) of 11 polymorphic EST-SSR markers developed for T. grandis.

Locus
Primer pair (5′-3′) Repeat motif Size range (bp) T m ( C)     populations from Group 1 ranged from 0.041 to 0.137, while the D A distances ranged from 0.151 to 0.229 between the ZJ (Group 2) population and the Group 1 populations (Table S4).

Transcriptome characterization
Transcriptome sequencing is an effective and rapid approach to obtain genomic information for non-model plants (Ward, Ponnala & Weber, 2012;Strickler, Bombarely & Mueller, 2012), especially for conifer species with a large and complex genome (Morse et al., 2009;Zimin et al., 2014). In this study, a total of 38,626,498 clean reads (4.83 Gb) were generated from the transcriptome sequencing of T. grandis. The total number of assembled unigenes (69,920) was much larger than previously reported for T. grandis "Merrillii" (46,000) (Yi et al., 2016). Similarly, the mean and N50 length of the unigenes in our study, 766 and 1,503 bp (Table 1), respectively, were also longer than those reported for the cultivar (731 and 1,426 bp, respectively) (Yi et al., 2016), indicating that the transcriptome sequencing data were well assembled for T. grandis in the present study.
All of the unigenes identified in T. grandis were successfully annotated through seven protein databases (Nr, Nt, KOG, Swiss-Prot, Pfam, KO, and GO) ( Table S2). The entire annotation for all unigenes was reported only in this study, but was not achieved for the other conifers, including Pinus, Picea, Taxus, Cephalotaxus, and Sequoia (Lorenz et al., 2012). It may suggest that the unigenes of T. grandis have more conserved functions. For example, complete genome sequencing revealed that the conifer species, Picea abies, has a large genome, but its number of functional genes is nearer to that of Arabidopsis thaliana (Nystedt et al., 2013). In fact, the success rates of functional annotation will be continually increased following the development of sequencing technologies and the abundance of protein databases. In addition,18,907,8,914,and 7,797 unigenes of T. grandis were assigned into 56 functional groups in GO category (Fig. 1), 26 in KOG (Fig. S2), and 32 in KO (Fig. S3), respectively, suggesting that these unigenes have various functions and will be valuable for functional gene discovery of T. grandis.

Frequency and distribution of EST-SSRs
In this study, 1,923 unigenes containing 2,065 EST-SSRs were detected from total 69,920 unigenes in the transcriptome of T. grandis. The distribution frequency (2.75% of the total unigenes) and average number (0.03 SSRs per unigene) of EST-SSRs were lower than those reported for T. grandis "Merrillii" (3.72% and 0.037 SSRs) (Yi et al., 2016). The detection criteria of EST-SSRs may be generated by the difference between T. grandis and other conifer species. For example, in this study, a minimum of six, five, five, five, and five repeats were, respectively, used for di-, tri-, tetra-, penta-, and hexanucleotide motifs, whereas six, five, four, four, and four repeats were applied in T. grandis "Merrillii" (Yi et al., 2016), and six, five, five, four, and four repeats employed in Pinus dabeshanensis (Xiang et al., 2015). Therefore, the uniformly detected criteria of EST-SSRs are important and should be considered for the evaluation of distribution frequency and density of EST-SSRs.

Polymorphism of EST-SSR markers and population genetic diversity and structure
In the present study, 106 primer pairs of T. grandis were randomly selected from 1,339 primer pairs that were successfully designed for EST-SSR loci. Within these primer pairs, most of primer pairs (70.75%) yielded multiple or dispersive PCR bands, 31 (29.25%) had stable and clear PCR bands with expected size; the successful amplification was lower than that for T. grandis "Merrillii" (34.26%). Finally, 11 primer pairs (10.38%) were found with polymorphism (Tables 2 and 3), the success rate was also lower than that for the cultivar (17.59%) (Yi et al., 2016). Because the Torreya species are typical diploid plants (2n = 22) (Murray, 2013), the failed amplification that produced multiple or dispersive bands may be due to the highly repetitive sequences in conifer species (Kovach et al., 2010;Nystedt et al., 2013). Furthermore, N A across six populations of T. grandis (2.636) was lower than that reported for the cultivar (3.059) (Yi et al., 2016), C. japonica (3.227) (Ueno et al., 2012), and P. dabeshanensis (3.000) (Xiang et al., 2015), while higher than that for P. squamata (2.200) (Mao et al., 2015). The average PIC of 0.357 was higher than that reported for C. japonica (0.325) (Ueno et al., 2012) and P. squamata (0.283) (Mao et al., 2015), while lower than that for P. ponderosa (0.618) (Lesser, Parchman & Buerkle, 2012) and P. dabeshanensis (0.380) (Xiang et al., 2015). In addition, the average H O across six populations of T. grandis (0.540) was higher than that reported for T. grandis "Merrillii" (0.439), while the average H E (0.432) was lower than that reported for the cultivar (0.522) (Yi et al., 2016). These results suggest that the polymorphism level of the 11 EST-SSR markers developed in this study is moderate in comparison with other conifer species and comparable to T. grandis "Merrillii" (Yi et al., 2016). Thus, the EST-SSR markers developed here, along with those previously developed, will be useful to investigate population genetic diversity and structure of T. grandis.
The STRUCTURE analysis and NJ phylogenetic tree revealed a distinct genetic population (ZJ) among the six populations of T. grandis (Fig. 3). Meanwhile, nine of the 11 EST-SSR markers developed in this study significantly deviated from HWE for the ZJ population, while one to three EST-SSR markers deviated from HWE for the remaining five populations (Table 3). A number of mechanisms could cause this disequilibrium and some are hard to be easily excluded, such as mutation, selection, and migration (Frankham, Ballou & Briscoe, 2010). But we prefer to attribute it to the introduction of some foreign genes. Because Torreya species are dioecious, wind-pollinated plants (Kang & Tang, 1995), gene introgression from related species into the population is possible. Moreover, T. grandis cultivation has been carried out for thousands of years in China and the ZJ population is located in Zhuji (Zhejiang, China) where is the main cultivation region of T. grandis, while cultivars are usually directly screened from wild populations (Li et al., 2004;Cheng et al., 2007), gene introgression from the cultivar could not have generated such a strong divergence among the T. grandis populations. Gene introgression from congeneric species, such as T. jackii, sympatric distribution with T. grandis, is more possible, because the two species have large genetic divergence; interspecific hybridization has been found between them, based on molecular and morphological evidence (Kou et al., 2017). In addition, it is also possible that the distinct genetic composition of the ZJ population came from another related species, T. nucifera, endemic to Japan. T. nucifera is the most closely related species with T. grandis in morphology and genetic relationship (Kou et al., 2017), and the species has records of cultivation because its seeds are also edible (Katsuki & Luscombe, 2013). However, this speculation needs to be verified by more population samples and further investigation of molecular markers.

CONCLUSION
In summary, a whole transcriptome from the leaves of the conifer species T. grandis were produced by Illumina sequencing technology. A total of 69,920 unigenes were assembled, and all were annotated by searching against public protein databases. Across the transcriptome sequences, EST-SSRs (2,065) were detected and characterized with the distribution frequency of 2.75% of total unigenes and average number of 0.03 SSRs per unigene. Among these EST-SSRs, 1,339 primer pairs were successfully designed, and then 106 were randomly selected for the development of potential molecular markers. Of the 106 primer pairs, 11 polymorphic pairs, as the EST-SSR markers, showed a moderate level of genetic diversity (N A = 2.636; PIC = 0.357; H O = 0.540; H E = 0.432). In addition, genetic structure and phylogenetic analysis revealed two different genetic groups, indicating that the set of EST-SSR markers developed in this study can be effectively applied to population genetic investigation, and may be used for genetic breeding and conservation of T. grandis in the future.