Transcriptome Analysis of Cinnamomum chago: A Revelation of Candidate Genes for Abiotic Stress Response and Terpenoid and Fatty Acid Biosyntheses

Cinnamomum chago, an endangered species endemic to Yunnan province, possesses large economic and phylogenetic values in Lauraceae. However, the genomic information of this species remains relatively unexplored. In this study, we used RNAseq technology to characterize and annotate the C. chago transcriptome and identify candidate genes involved in special metabolic pathways and gene-associated simple sequence repeats (SSRs) and single-nucleotide polymorphism (SNP). A total of 129,097 unigenes, with a mean length of 667 bp and an N50 length of 1,062 bp, were assembled. Among these genes, 56,887 (44.07%) unigenes were successfully annotated using at least one database. Furthermore, 47 and 46 candidate genes were identified in terpenoid biosynthesis and fatty acid biosynthesis, respectively. A total of 22 candidate genes participated in at least one abiotic stress response of C. chago. Additionally, a total of 25,654 SSRs and 640 SNPs were also identified. Based on these potential loci, 55 novel expressed sequence tag (EST)-SSR primers were successfully developed. This work provides comprehensive transcriptomic data that can be used to establish a valuable information platform for gene prediction, signaling pathway investigation, and molecular marker development for C. chago and other related species. Such a platform can facilitate further studies on germplasm conservation and utilization of Lauraceae species.


INTRODUCTION
Genomic/transcriptomic techniques are information tools used to assess the physiological conditions of organisms in response to multiple stressors (Allendorf et al., 2010). Next-generation sequencing (NGS) is an inexpensive technology that can produce larger volumes of sequencing data than conventional methods, and NGS has revolutionized genomic and transcriptomic approaches in the field of biology (Metzker, 2010;Davey et al., 2011). With the advent of NGS and de novo assembly technology, RNA sequencing (RNA-seq) has been proven to be an efficient and cost-effective approach for studying the transcriptome of non-model organisms without using established genomic databases (Wang et al., 2009;Berdan et al., 2016). Also, RNA-seq can be used for the simultaneous study of nucleotide variations, gene expression levels across different tissue types, disparate time periods, and ecologically relevant variables in high resolution and large dynamic ranges (Wang et al., 2009;Harris et al., 2015). Therefore, this approach accelerates the examination of expression differences in ecologically important traits (Bonneaud et al., 2011;Schvartzman et al., 2018), phenotypic and behavioral plasticity (Aubin-Horth and Renn, 2009;Brawand et al., 2014), and genes with potential adaptive significance in changing environments for non-model species (Niu et al., 2015;Harris et al., 2015;Todd et al., 2016;Palma-Silva et al., 2016). This new sequencing tool is also valuable for the investigation, validation, and assessment of allelic information [e.g., single nucleotide polymorphisms (SNPs) and simple sequence repeats (SSRs)]. Furthermore, RNA-seq can be used to develop genetic markers for the population-level consequences of genetic variation (Davey et al., 2011;Zhang et al., 2017).
Species of Lauraceae are recognized for their significant ecological and economic value. Most of these species are not only conspicuous elements of tropical and subtropical evergreen broad-leaved forests but also important sources of camphor, spices, perfumes, nutritious fruits, phytomedicine, and highquality wood (Ravindran et al., 2003;Wang et al., 2007;Huang et al., 2016). To date, the genome of Lauraceae species has not been completely sequenced. Scholars have obtained the transcriptome sequences of few economic species through RNAseq. These species include Lindera glauca (Niu et al., 2015), Litsea cubeba (Han et al., 2013), Neolitsea sericea , and Persea americana (Chanderbali et al., 2009). Thus, the conversion and utilization of Lauraceae are complicated and restricted because of insufficiently systematic and in-depth research on the genomic/transcriptomic information of specific compounds and on the adaptability of these plants to a heterogeneous environment.
To date, transcriptome studies on Lauraceae have focused on the biosynthesis of terpenoids (Niogret et al., 2013;Niu et al., 2015;Yan et al., 2017) and fatty acids (FAs; Kilaru et al., 2015;Ibarra-Laclette et al., 2015). Few works have investigated the potential genes associated with the responses of laurel plants to multiple stresses under changing environments. Similar to other living organisms, plants are subjected simultaneously to different abiotic stresses, such as high irradiance, extreme temperature, high salinity, and water deprivation (Rizhsky et al., 2004). These stresses can disturb cellular homeostasis and lead to severe retardation in growth and development and death (Kotak et al., 2007). Therefore, the complex regulatory networks and metabolic pathways of plants in response to single and multiple concurrent abiotic stresses must be delineated to promote plant growth and development (Rasmussen et al., 2013). Elucidation of the transcriptome-level responses of plants to abiotic stresses offers a considerable opportunity to assess genes involved in the process of adaptation to environmental changes (Alvarez et al., 2015). Some studies described transcriptome changes in response to abiotic stresses in plants, such as Arabidopsis thaliana (Coolen et al., 2016), Cucumis sativus , Tamarix hispida (Yang et al., 2014), and Solanum dulcamara (Lortzing et al., 2017). These results revealed related genes or metabolism pathways that mediate stress tolerance.
Cinnamomum chago, an endangered Lauraceae species, was first reported by Sun and Zhao (1991). C. chago possesses evident morphological features similar to the two Asian Cinnamomum sections (Sect. Camphora Meissn. and Sect. Cinnamomum) (Sun and Zhao, 1991;Dong et al., 2016;Huang et al., 2016). Hence, C. chago plays an important role in the phylogeny and evolution of Cinnamomum. Furthermore, C. chago is a potential source of timber and oil. Dong et al. (2016) detected that the C. chago seeds contain higher protein content (14.5%) and lower fat content (45.66%) than most nuts. However, this perennial species presents a restricted and fragmented distribution along the mountains upstream of the tributaries of Lancang River of Yunlong County, Yunnan province (Sun and Zhao, 1991;Dong et al., 2016). The transcriptomic information of C. chago is indispensable for identifying and understanding the biological processes involved in the accumulation of terpenoid and FA in this plant and its environmental adaptation.
Considering the endangered status and the large ecological value of C. chago in Lauraceae, we performed the transcriptome sequencing of C. chago leaves based on an Illumina HiSeq 4000 sequencing platform. This study aims to (1) characterize and analyze the functional annotation of the C. chago transcriptome, (2) identify candidate genes involved in terpenoid biosynthesis, FA biosynthesis, and response to abiotic stress, and (3) identify and validate gene-associated SSRs and SNPs. To the best of our knowledge, this study is the first to report a de novo transcriptome analysis on C. chago. Characterization of the genetic elements in terpenoid biosynthesis, FA biosynthesis, and response to abiotic stress and the development of gene-associated molecular markers will facilitate the conservation and sustainable utilization of the germplasm resources of C. chago and other Lauraceae species.

Plant Materials and RNA Extraction
Leaf samples were obtained from three C. chago seedlings that grew for 1 year in a greenhouse. The samples were dissected, immediately frozen in liquid nitrogen, and stored at −80 • C until RNA extraction. According to the manufacturer's instructions, total RNA was isolated using the TRIzol R Reagent (Invitrogen, CA, United States) and purified using the Plant RNA Purification Reagent (Invitrogen, CA, United States) to remove the genomic DNA contamination. The RNA integrity was evaluated by agarose gel electrophoresis. The yield and purity of RNA were evaluated using a NanoDrop 2000 spectrophotometer (Thermo, MA, United States), with concentration higher than 50 ng/µL and 28S:18S higher than 1.8. The RNA integrity numbers were analyzed by Agilent 2100 (Agilent Technologies, CA, United States). Qualified RNA was subsequently used in the preparation of cDNA library and for Illumina deep sequencing.

cDNA Library Construction and Sequencing
The mRNA-seq library was constructed using the TruseqTM RNA Sample Prep Kit (Illumina, CA, United States). The poly (A) mRNA was isolated using poly-T oligo-attached magnetic beads, mixed with the fragmentation buffer, and randomly broken into small pieces of 300 base pair (bp) by divalent cations under increased temperatures. The first-strand cDNA was synthesized by a random hexamer primer and reverse transcriptase. The second-strand cDNA was synthesized by RNase H and DNA polymerase I. Subsequently, the synthesized cDNA with cohesive terminus was resolved with EB buffer for end reparation, poly (A) addition, and sequencing adapter ligation. The obtained fragments were separated by agarose gel electrophoresis. Suitable sequencing templates were selected to generate cDNA libraries through polymerase chain reaction (PCR) amplification. Finally, the libraries were sequenced at the Major Company (Shanghai, China) using the Illumina HiSeq4000 sequencing platform to obtain 2 × 150 bp paired-end reads. The raw sequencing data and the assembled data were deposited in the NCBI Sequence Read Archive (SRA) database and the Transcriptome Shotgun Assembly (TSA) sequence database, respectively.

Sequence Assembly and Annotation
Raw sequencing data were filtered by discarding contaminated adaptors, reads with excessive poly-N, reads of < 30 bp in length, empty reads, and reads with Q < 20 [Q = −10log10(E), E represents the sequencing error rate] using the SeqPrep program 1 to screen high-quality clean read data for de novo assembly. The obtained clean data were de novo assembled using the Trinity program 2 by default parameters, and the assembled contigs were further filtered and optimized using the TransRate 3 and CD-HIT-EST 4 programs. In addition, the Q20 and Q30 values, GCcontent, and sequence duplication level of the clean data were calculated (Grabherr et al., 2011). The fragments per kilobase per million reads (FPKM) values were generated using RSEM v.1.2.9 5 .
The assembled unigenes were aligned by BLASTX against public databases (E-value < 1e-5), including Nr, String, Swiss-Prot, Kyoto Encyclopedia of Genes and Genomes (KEGG), and Pfam, to obtain the protein functional annotation and classification information (Camacho et al., 2009). The Gene Ontology (GO) annotation information of these unigenes was obtained and categorized with respect to biological process, molecular function, and cellular component by the Blast2GO program (Conesa et al., 2005). We also classified and analyzed these unigenes in the COG database 6 according to the COG number obtained from String. The WEGO program (Ye et al., 2006) was used to classify all of the unigenes based on the GO annotation information. For removing possible contaminants from the transcriptome, we firstly filtered the contaminant sequences by BLAST searches and removed the non-green plant sequences (microbial and human contamination) based on the annotated categories in the KEGG and GO databases (Gray et al., 2010;Casseb et al., 2012). Secondly, we detected the candidate genes involved in terpenoid biosynthesis, FA biosynthesis, and response to abiotic stress, which are expressed in three of the libraries.

Identification of SSRs and SNPs
Potential SSRs were detected using the MISA version 1.0 program 7 . The SSR parameters were designed to contain mononucleotides, dinucleotides, trinucleotides, tetranucleotides, pentanucleotides, and hexanucleotides with minimum repeat numbers of 10, 6, 5, 5, 5, and 5, respectively. The expressed sequence tag (EST)-SSR primers were designed using Primer3 8 with default parameters. Later, the designed EST-SSR primers were synthesized by Sangon Company (Shanghai, China).
Candidate SNPs were detected using Samtools 9 and VarScan V.2.2.7 10 based on the following criteria: (1) minimum coverage of ten reads, (2) base quality where base calls show low Phred quality (<30), and (3) the frequency of mutated bases is higher than 30% among all reads covering the position.
Five accessions (each accession contains three individual plants) of C. chago obtained from Yunnan Province were selected to validate the putative SSRs and evaluate the polymorphism of related markers. The intact genomic DNA of each individual was extracted from dried leaves with the modified CTAB method (Doyle, 1991). The PCR reactions were referenced and modified into a total reaction volume of 25 µL, which contained DNA (15 ng), 10 × PCR buffer (2.5 µL), dNTPs (10 mM, 2 µL), primer (0.5 µL each), Taq DNA polymerase (5 U/µL, 0.5 µL; Takara, Shiga, Japan), and double-distilled water (18.5 µL). For each reaction, we used the following conditions: initial 5 min of denaturation at 94 • C, 35 cycles of 30 s at 94 • C, 30 s of annealing at Tm with different primers, 15 s of extension at 72 • C, and a final extension for 7 min at 72 • C. All purified PCR products of the EST-SSR marker were visualized on an ABI 3730xl Capillary DNA Analyzer (Sangon, Shanghai, China). The number of alleles and the observed heterozygosity (Ho) and expected heterozygosity (He) of all microsatellite loci were calculated by GenAlEx version 6.3 (Peakall and Smouse, 2006). The polymorphism information content (PIC) values were calculated by PIC_CALC version 0.6 (Nagy et al., 2012). The Hardy-Weinberg equilibrium (HWE) of all loci was determined by POPGENE version 1.31 (Yeh et al., 1997).

Sequence Analysis and de novo Assembly
To characterize the first transcriptome of C. chago, this study generated 56,891,547 raw reads from the constructed cDNA libraries (Supplementary Table S1). After a rigorous quality check and data filtering, 55,525,751 single clean reads with 98.29% Q20 bases (qualities that are larger than 20 for every base) were obtained, and these reads contained a large base number of 8,235,167,620 bp. The GC percentages for these clean reads were 48.57%. A total of 179,491 high-quality transcripts were assembled from clean reads, with a mean length of 628 bp and a N50 length of 1,025 bp (Table 1). Subsequently, these transcripts were attributed to alternative splicing of 129,097 unigenes with a mean length of 667 bp and a N50 length of 1,062 bp ( Table 1). The lengths of the unigenes of C. chago are less than those of N. sericea (733 bp, Chen et al., 2015), L. cubeba (834 bp, Han et al., 2013), and C. camphora (680 bp, Shi et al., 2016b). This result may be due to the fact that most of the unigenes (90,287, 69.95%) exhibit lengths ranging from 1 to 500 bp (Supplementary Figure S1) and due to the absence of a well-assembled reference genome in C. chago (Zhou et al., 2016). This result provided sufficient and highquality unigenes for the investigation of potential genes and identification of SSRs and SNPs in C. chago. Abundant candidate genes involved in a specific metabolic pathway and response to abiotic stress were identified. These transcriptome data will provide a basis for future studies on molecular biology, molecular breeding, physiology, and biochemistry, thereby, facilitating the protection and utilization of C. chago and other Lauraceae species.

Sequence Functional Annotation
Among the 129,097 unigenes, 56,887 (44.07%) unigenes were successfully annotated to at least one database (Figure 1). This ratio is comparable to those of other non-model organisms of Lauraceae, in which the percentage of annotated unigenes ranges from 38 to 56% (Han et al., 2013;Chen et al., 2015;Niu et al., 2015;Shi et al., 2016b). This result may be due to the fact that the unannotated unigenes represent untranslated mRNA regions, chimeric transcript sequences, nonconserved protein genes, assembly errors or specific novel genes to the species, and the different environmental stresses experienced by this species (Qiu et al., 2017;Horn et al., 2017).

GO Classification
The GO annotation was used to categorize the functions of the predicted C. chago unigenes (Gene Ontology Consortium, 2004). We categorized 17,166 (30.18%) unigenes to at least one main GO category, and these unigenes included 8,004 unigenes in cellular component, 9,043 unigenes in molecular function, and 32,193 unigenes in biological process (Figure 2). The distribution of the ontology categories was consistent with the transcriptomes of other plant species, such as Tetrix japonica (Qiu et al., 2017), N. sericea , Vaccinium cyanococcus (Rowland et al., 2012), and Salix integra (Shi et al., 2016a). This result demonstrated that most of the sequenced unigenes were responsible for the fundamental biological metabolism and composition of C. chago (Niu et al., 2015), such as cell, cell part, and membrane. Given that abundant unigenes were assigned with more than one GO term, the total number of assigned categories was larger than the total number of annotated unigenes (Qiu et al., 2017). Subsequently, these annotated unigenes were further grouped into 50 subcategories. For the molecular function, a majority of the annotated unique sequences were assigned to catalytic activity (9,560, 55.69%, GO: 0003824), binding (7,907, 46.06%, GO: 0005488) and transporter activity (973, 5.67%, GO: 0005215). For the cellular component, cell (5,955, 34.69%, GO: 0005623), cell part (5,888, 34.30%, GO: 0044464), and membrane (5150, 30.00%, GO: 0016020) were the most represented GO terms. In the biological process category, metabolic process (9,043, 52.68%, GO: 0008152), cellular process (8004, 46.63%, GO: 0009987), and single-organism process (5384, 31.36%, GO: 0044699) were the most represented categories. In addition, analysis of the top 20 subcategories of GO annotations showed that nine terms were from the biological process category, and seven terms were from the cellular component category (Figure 2).
Terpenoids are important raw materials for flavors, fragrances, spices and as medicines against cancer, malaria, inflammation, and a variety of infectious diseases (Wang et al., 2005;Zhao et al., 2014). All the obtained candidate genes serve as an important basis for further exploration of the regulatory mechanism of C. chago terpenoid biosynthesis and provide references for the investigation of other aromatic plants.

Candidate Genes Related to FA Biosynthesis
Plants represent a large reservoir of FA diversity and synthesize at least 200 different types of FAs (Van De Loo et al., 1995).
Currently, most plant oils constituting FAs are used primarily as edible oils and as a renewable and easily extracted resource for a variety of industrial applications (e.g., biodiesel, paints, lubricants, coatings, or inks) (Lee et al., 1998;Simopoulos, 2001;He et al., 2013). Therefore, the factors limiting the accumulation of unusual FA in plants should be further understood (Thelen and Ohlrogge, 2002).
Most enzymes involved in lipid biosynthesis were identified in the C. chago transcriptome. A total of 65, 38, 21, and 60 unigenes were homologous with 15, 12, 4, and 15 known enzymes in FA biosynthesis, biosynthesis of unsaturated FAs (UFAs), linoleic acid (LA) metabolism, and alpha-linolenic acid (ALA) metabolism, respectively (Supplementary Table S5). Among these unigenes, one unigene (FPKM value was 5.02) was identified to encode the acetyl-CoA carboxylase (K11262; Supplementary Table S6), which is a regulatory enzyme controlling the rate of FA synthesis and catalyzing the first reaction to generate an intermediate malonyl-CoA (Li et al., 2015). This enzyme was also detected in P. americana and oil palm (Dussert et al., 2013;Kilaru et al., 2015).
In conclusion, results showed that an active FA biosynthesis pathway caused by abundant key related genes was detected in C. chago. Considering the crucial role of the obtained candidate genes, further studies focusing on the identification of factors and encoding functional genes would probably reveal the molecular mechanisms underlying FA biosynthesis in C. chago and other related species.
We also discovered that 22 identified genes existed in multiple-stress responses in C. chago ( Table 2). The EREBP-like factor (EREBP, K09286), heat shock protein 90kDa beta (HSP90B/TRA1, K09487), and annexin A7/11 (ANXA7_11, K17095) were implicated in water deprivation, cold, and heat simultaneously. A total of 12 genes were both assigned to water deprivation and cold stress response, of which Gst, phospholipase D1/2 (K01115), and sucrose synthase (K00695) were the most abundant enzymes. Moreover, five genes synchronously participated in cold       and heat stress responses. The heat shock 70kDa protein 1/8, calcium-binding protein CML, and chaperonin GroEL (groEL/HSPD1, K04077) were the top three genes. The solute carrier family 15 (peptide/histidine transporter), member 3/4 (SLC15A3_4/PHT) was the only gene that concurrently existed in the process of water deprivation and pH stress response ( Table 2).
In general, the plant multiple-stress tolerance mechanism consists of a complex network by which several pathways overlap and interact with one another (Timperio et al., 2008). Among the 22 candidate genes, the interaction mechanism of some genes referring to abiotic stress, such as GST, HSP, and EREBP, was comprehensively studied in other species (Dietz et al., 2010;Kumar et al., 2013;Zhu et al., 2014). However, in C. chago, even in Lauraceae, the role of these putative genes in plant growth, development, and responses to environmental stresses is unknown. Investigation on these candidate genes related to abiotic stress will further facilitate our understanding of the genetic basis of adaptation and provide information to the growing body of research on the ecological and evolutionary consequences of environment in C. chago and other Lauraceae species.

Identification of SNPs and SSRs
The SSRs and SNPs, representing uncorrelated patterns of diversity and divergence, exhibit the advantage of being highly polymorphic, codominant, highly reproducible, stable, and reliable (DeFaveri et al., 2013). Transcriptome sequencing on multiple individuals is an effective way to identify and validate SNPs and SSRs (Huang et al., 2015). The identification and validation of gene-associated SSRs and SNPs will provide an important tool in understanding the genetic basis of adaptive trait, population genetic structuring, and species relatedness for C. chago and other related species in Lauraceae Huang et al., 2015;Jardine et al., 2016).
According to the putative EST-SSRs, we randomly selected and designed 100 primer pairs to synthesize and evaluate their ability to amplify and assess polymorphism in C. chago. Among these EST-SSR primers, 85 primers successfully amplified PCR products. The 15 remaining primer pairs failed to generate PCR amplification or amplified remarkably weak bands at various annealing temperatures. Among the 85 successful primers, 60 primers presented the expected correct size of amplifications, 16 of them were longer or shorter than the expected size, and 9 primer pairs generated significantly multiple bands. Finally, 55 of the 60 validated primers were polymorphic among the five C. chago accessions, and the five remaining primers were monomorphic (Supplementary Table S15).
Diversity estimation from microsatellites showed that a total of 329 alleles were obtained from the 55 novel EST-SSR primers. The number of alleles ranged from 2 to 13 with an average of 5.982 per locus. The H O and H E values were in the range of 0.00-1.00 (mean, 0.759) and 0.111-0.761 (mean, 0.519), respectively. The PIC values ranged from 0.124 to 0.872, with a mean value of 0.570. Furthermore, 14 loci showed PIC values that were smaller than 0.50, and 41 loci showed PIC values that were larger than 0.50 (Supplementary Table S15). The putative EST-SSR loci in C. chago obtained by the novel EST-SSR markers were more highly polymorphic than those in other Lauraceae species.

SNPs
We obtained a total of 640 putative EST-SNPs with the frequency per kb of 0.01 in C. chago (Table 3). Among them, transition SNPs were the most predominant, of which 396 (61.87%) SNPs were identified; these SNPs contained 202 A/G (31.56%) and 194 C/T (30.31%) ( Table 3). Additionally, the most common base variations were A/T (74; 11.56%) in transversion SNPs, sequentially followed by G/T (63; 9.84%) and A/C (63; 9.84%) ( Table 3). This result was consistent with the conclusion that transition mutations are better tolerated than transversions because their synonymous mutations are generated in proteincoding sequences during natural selection (Mantello et al., 2014).
The estimated locations were obtained for 291 of the total 640 SNPs. The remaining locations were uncertain because they extended over both estimated coding and non-coding regions. Most SNPs (145, 23.00%) occurred frequently in the third codon regions (Supplementary Figure S5), which may be due to selective pressures on SNPs in the coding regions (Li et al., 2013).

CONCLUSION
Considerably valuable transcriptome sequencing data and annotation resources were developed for C. chago using the RNA-seq technology. A total of 129,097 unigenes with a mean length of 667 bp and a N50 length of 1,062 bp were assembled from 55,525,751 clean reads with 98.29% Q20 bases. Among these unigenes, only 56,887 (44.07%) unigenes were successfully annotated using at least one database; these unigenes contained 40,323 unigenes (31.23%) for Pfam, 40,810 unigenes (31.61%) for Swiss-Prot, 29,660 unigenes (49.47%) for KEGG, 8,701 unigenes (22.97%) for COG databases, 17,166 unigenes (13.30%) for GO databases, and 42,549 unigenes (32.96%) for the Nr database. Furthermore, a total of 116 unigenes with 47 candidate genes and 184 unigenes with 46 candidate genes were identified in terpenoid backbone biosynthesis and FA biosynthesis, respectively. A large number of putative genes related to abiotic stress were also identified in C. chago. Among these genes, 22 candidate genes participated in at least one stress response. In addition, 25,654 SSRs and 640 SNPs were identified, respectively. A total of 55 novel EST-SSR primers were also successfully developed for C. chago. These assembled transcriptome data and annotation information can be used to establish a valuable information platform for future research on genetic and genomic mechanisms underlying the specific metabolic pathway and germplasm conservation and utilization in C. chago.

DATA ACCESSIBILITY
Raw sequence reads were deposited in the Short Read Archive (SRA) (http://www.ncbi.nlm.nih.gov/sra) under BioProject PRJNA387488 and SRA Accession No. SRP107900 (SRS2220410; SRS2220411; SRS2220412). Assembled data have been deposited in TSA Database with the above BioProject identification number.

AUTHOR CONTRIBUTIONS
S-KS, XZ, and Y-HW designed the study. S-KS and Y-HW obtained the funding. XZ, S-KS, and Y-HW performed the fieldwork and seedling propagation. XZ, YZ, and S-KS performed the laboratory work and analyzed the data. XZ and S-KS wrote and revised the manuscript. All authors read and approved the manuscript.