Identification, expression, and phylogenetic analyses of terpenoid biosynthesis-related genes in secondary xylem of loblolly pine (Pinus taeda L.) based on transcriptome analyses

Loblolly pine (Pinus taeda L.) is one of the most important species for oleoresin (a mixture of terpenoids) in South China. The high oleoresin content of loblolly pine is associated with resistance to bark beetles and other economic benefits. In this study, we conducted transcriptome analyses of loblolly pine secondary xylem to gain insight into the genes involved in terpenoid biosynthesis. A total of 372 unigenes were identified as being critical for oleoresin production, including genes for ATP-binding cassette (ABC) transporters, the cytochrome P450 (CYP) protein family, and terpenoid backbone biosynthesis enzymes. Six key genes involved in terpenoid biosynthetic pathways were selected for multiple sequence alignment, conserved motif prediction, and phylogenetic and expression profile analyses. The protein sequences of all six genes exhibited a higher degree of sequence conservation, and upstream genes were relatively more conserved than downstream genes in terpenoid biosynthetic pathways. The N-terminal regions of these sequences were less conserved than the C-terminal ends, as the N-terminals were quite diverse in both length and composition. The phylogenetic analyses revealed that most genes originated from gene duplication after species divergence, and partial genes exhibited incomplete lineage sorting. In addition, the expression profile analyses showed that all six genes exhibited high expression levels during the high-oleoresin-yielding phase.


INTRODUCTION
The loblolly pine's amenability to plantation management, high wood yields, and fast growth make it one of the most economically important forest species in the world (Lu et al., 2017). Loblolly pine is also a potential renewable feedstock for alternative energy and fuel, as well as an important source for oleoresin (Olatunde et al., 2017). Oleoresin is a viscous mixture of terpenoids stored in resin canals or blisters in the stems of conifers (Trapp massoniana (Liu et al., 2015), Lindera glauca (Niu et al., 2015), Thapsia laciniata (Drew et al., 2013), Isodon rubescens (Su et al., 2016), and Calotropis procera (Pandey et al., 2016) were conducted to identify candidate genes related to terpenoid biosynthesis. The loblolly pine (Pinus taeda) genome has been sequenced (Zimin et al., 2014) and contains 20.15 billion bases, approximately seven times that of the human genome. It is the largest sequenced genome and the most complete published conifer genome sequence, yet few studies have reported transcriptome analyses of Pinus taeda secondary xylem. We performed transcriptome analyses of loblolly pine secondary xylem for two reasons: first, the secondary xylem is the main tissue involved in the secretion of terpenoids; second, the collection of resin in South China is based on the bark streak method (Fig. 1A) instead of using the pine needles as the material for alcohol extraction. We conducted an integrated analyses of terpenoid biosynthesis encompassing the identification of related genes, the phylogenetic relationships among representative genes in other characterized species, the expression profiles of different oleoresin-yielding stages, and the prediction of conserved motifs. We provide an extensive perspective of regulatory factors involved in terpenoid biosynthesis in the loblolly pine, as well as a portfolio of candidate genes for future study.

Plant materials
Because of the content of oleoresin was different between individuals, and influenced by the meteorological factors. In this study, three 15-year-old loblolly pine trees from the Yingde Research Institute of Forestry in Guangdong Province, China, were used to RNA sequencing to find out the genes involved in terpenoid biosynthetic pathway. They shared no lineage relationships. During the high-oleoresin-yield stage (August) (An & Ding, 2010), secondary xylem tissue samples ∼1 mm in thickness were collected from the trees after removing the bark at breast height using a firewood chopper and removing the phloem and cambium using a 75% alcohol-treated grafting knife (Fig. 1B). The samples were immediately immersed in liquid nitrogen and then stored at -80 • C until RNA extraction.

RNA isolation
Total RNA was extracted from each sample following a protocol modified from Chang, Puryear & Caimey (1993). Briefly, secondary xylem tissues (∼150 mg) were ground into powder under liquid nitrogen, then immediately transferred to 1.5 mL RNase-free centrifuge tubes. Next, 1 mL CTAB extraction buffer (100 mM Tris-HCl, pH 8; 25 mM EDTA, pH 8; 2 M NaCl; 2% CTAB; 2% PVP) and 20 µL β-mercaptoethanol were added to the tubes, mixed by vortexing for 2 min, and incubated at 65 • C for 20 min. Following incubation, samples were centrifuged for 10 min at 12,000 rpm at 4 • C. The supernatant was collected into a new 1.5 mL centrifuge tube and an equal volume of chloroform/isoamyl alcohol (24:1) was added, mixed by vortexing for 2 min, and centrifuged for 5 min at 12,000 rpm at 4 • C. After an additional chloroform/isoamyl alcohol extraction, the supernatant was collected into a new tube and a 1 4 -volume of 10 M LiCl was added, mixed, and incubated at 4 • C for 6 h. After precipitation for 6 h, the sample was centrifuged for 10 min at 12,000 rpm at 4 • C, and the supernatant was discarded. The sediment was washed twice with 500 µL 75% EtOH and 500 µL anhydrous alcohol precooled at -20 • C, respectively. The sample was dried for 5 min, then resuspended in 50 µL DEPC-treated water. Total RNA concentration and integrity were determined, and high-quality RNAs were used for cDNA library construction. cDNA library construction, sequencing, and sequence assembly cDNA libraries were constructed following the protocol described by Foucart et al. (2006). Libraries were sequenced with 2 × 100 paired-end reads using the Illumina HiSeq 4000 platform at Science Corporation of Gene (Guangzhou, China). The Trinity method (Haas et al., 2013) was used to assemble the high-quality reads into unigenes. RNA sequence data were deposited in the National Center for Biotechnology Information (NCBI) Sequence Read Archive (SRA) under the study accession number PRJNA482703.

Functional annotation and identification of terpenoid biosynthesisrelated genes
To identify genes related to terpenoid biosynthesis, all unisequences were queried against five public protein databases (UniProt, Nr, KEGG, KOG, and GO) with an Evalue threshold <10 −5 to detect unigenes homologous to previously identified relevant genes. The Blast2GO software (version 4.1.5) with the default parameters (Conesa, Terol & Robles, 2005) was used to obtain GO annotation of unigenes. WEGO (Ye et al., 2006) was used to perform GO functional categorization for all unisequences. KEGG pathway analyses were performed using BLASTx queries against the KEGG database (http://www.genome.jp/kegg/).

Phylogenetic analyses
Phylogenetic analyses were performed based on the protein sequences of six completely assembled representative genes (if more than one homologous unigene was fully assembled, the unigene with the largest sequence length was used for subsequent analyses); alignments were performed using the MUSCLE (v3.8.30) program. Phylogenetic trees were constructed using two methods: maximum likelihood (ML) in RAxML (8.1.5) with 1,000 bootstrap replicates in view of the best-scoring model, and Bayesian inference (BI) in MrBayes (3.2.6) with 10,000,000 generations. The best models were selected using ProtTest3, and the BI tree was reliable with a standard deviation of split frequencies <0.01. The phylogenetic trees were drawn with MEGA 7.0 (Kumar, Stecher & Tamura, 2016).

Quantitative reverse transcription PCR (qRT-PCR) analyses
To investigate the expression profiles of six representative genes involved in the biosynthesis of terpenoids, total RNAs from three different oleoresin-yielding stages were extracted as described for the cDNA library preparation. Samples from two low-oleoresin-yield phases were collected in April and October; the high-oleoresin-yield RNA samples (August) were the same samples used for cDNA library construction. Reverse transcription was performed using a PrimeScript TM RT reagent kit with a gDNA Eraser (Code No: RR047A; TaKaRa, Dalian, China) following the manufacturer's instructions. The primers for the six candidate gene sequences were designed using Primer Premier 5.0, and Actin (F: gagcaaagagatcactgcacttg; R: ctcatattcggtcttggcaatcc) was selected as an internal control (primers are listed in The qRT-PCR analyses were performed using a LightCycler 480 System (Roche, Basel, Switzerland) and a TB Green Premix Ex Taq II kit (Code No: RR820A; TaKaRa). A 20 µL reaction containing 2 µL synthesized cDNA, 10 µL 2×TB Green Premix Ex Taq mix, 0.8 µL 10 µM forward primer, 0.8 µL 10 µM reverse primer, and 6.410 µL sterile distilled water was amplified using the following procedure: one cycle of 95 • C for 30 s, followed by 40 cycles of 95 • C for 5 s, 55 • C for 30 s, and 72 • C for 30 s. Three biological replicates (consistent with the samples used for library construction) of each oleoresin-yielding stage were performed, and each PCR reaction included three technical replicates. The relative expression levels of candidate genes were calculated using the 2 − Ct method (Livak & Schmittgen, 2001).

RNA-Seq and transcriptome assembly
To obtaine a summarization of the loblolly pine transcriptome in secondary xylem tissues during the high-oleoresin-yielding phase (August), total RNAs extracted from three samples were used for RNA-Seq on an Illumina HiSeq 4,000 platform. After quality appraisal and low-quality data screening, 29,559,842 high-quality reads were generated and assembled

Functional annotation and classification
A total of 31,586 unigenes (42.45%) were annotated with a threshold E-value <10 −5 after BLASTx searches against the five protein databases (Table S2) Fig. 3). Based on the Nr database, 5,807 (18.38%) unigenes exhibited significant homology with sequences from Picea sitchensis, and 1,877 (6.10%) and 1,305 (4.13%) unigenes shared high similarity with sequences from Amborella trichopoda and Nelumbo nucifera, respectively (Fig. S1). GO assignments based on sequence homology were used to classify the functions of annotated unigenes. In all, 22,433 unigenes were distributed into three main GO terms (''biological process'', ''cellular component'' and ''molecular function'') and 44 secondaryclass GO terms (Fig. 4). The two largest GO functional groups of the ''molecular function'' term were ''catalytic activity'' and ''binding''. The most represented ''cellular component'' categories were ''membrane part'' and ''cell part''. The most represented ''biological process'' categories were ''metabolic process'', ''single organism process'' and ''cellular process''. Several important secondary-class GO terms related to terpenoid metabolism were also represented at lower numbers, such as ''response to stimulus'', ''transporter activity'' and ''biological regulation'' (Fig. 4). In general, the GO analyses revealed that most unigenes were responsible for basic biological process regulation and metabolism.

Identification of oleoresin biosynthesis-related genes
After combining the functional annotation with previous research results, a total of 249 unigenes (not including the 123 unigenes involved in the terpenoid biosynthetic pathway) were identified as crucial for oleoresin biosynthesis. Among these, 105 ABC transporter genes were identified and mainly divided into seven subfamilies (A, B, C, D, F, G, and I) (Table S4). In all, 67 genes were annotated as cytochrome P450 family proteins, including many known subfamilies related to terpenoid biosynthesis, such as cytochrome P450 family members CYP720, CYP71, CYP701, CYP51, CYP76, CYP704, and CYP716 (Table S5). In addition, there were 20 alcohol dehydrogenase genes, 10 aldehyde dehydrogenase genes, 17 pathogenesis-related protein (PR5 and PR10) genes, 13 ethylene-responsive transcription factor genes, 13 terpenoid synthase genes (Table 2), three non-specific lipid-transfer protein genes, and one phosphomethylpyrimidine synthase gene (Table S6).

Multiple sequence alignment and conserved motif prediction
BLASTP searches against the Nr database for comparative sequence analyses of DXR, DXS, HMGR, IPPI, GGPS, and FPS proteins were conducted separately. A total of 14 additional representative plant species containing the six protein sequences were retrieved (if multiple protein sequences were retrieved for a species, the one with the highest BLASTP score was selected for analyses) (Table 3). Among them were three gymnosperm species, four monocotyledons, and seven dicotyledons. Multiple sequence alignment analyses revealed that the FPS (Fig. S5) and GGPS (Fig. S5F) sequences were less conserved compared to the other four protein sequences, and the C-terminal regions of the six protein sequences were more conserved than the N-terminal ends. The N-terminals of these protein sequences were quite diverse in both length and composition (Fig. S5). In our study, the MEME motif search tool was used to discover the motifs of the 15 DXR, DXS, HMGR, IPPI, GGPS, and FPS protein sequences, respectively. The results identified a total of 10 motifs for each sequence (except the IPPI protein sequence, which had 9), named motifs 1-10, separately ( Fig. 7 and Fig. S6). For the DXR protein sequences, six identical instances of motif 3, and two identical instances of motifs 7, 8, and 9 were discovered in different locations, respectively. All members contained motifs 2, 3a, 4, 3b, 5, 3c, 6, 7b, 8a, 9b, and 10. Interestingly, the gymnosperm clade were all missing motifs 5, 3d, and 8b (except the Picea sitchensis DXR sequence) (Fig. 7). Among the DXS protein sequences, motifs 5-10 were more conserved than other motifs. Excepting the Taxus × media DXS sequence, the other 14 members contained motif 1 (Fig. S6A). Analyses of conserved sequence motifs of IPPI sequences revealed that motifs 4-8 were present in all members, and all but Taxus × media contained motif 9. Motifs 1-3 were apparently less conserved as they were only detected in two, six, and one members, respectively (Fig. S6C). For the HMGR protein sequences, 13 members contained motifs 1-10; the Elaeis guineensis HMGR sequence lost motif 10 and Pinus taeda was missing motifs 6-10 (Fig. S6B). For FPS protein sequences, motifs 2-5 and 7-10 were present in all 15 members; only Oryza sativa japonica was missing motif 1. Interestingly, four of the five members containing  motif 6 came from the gymnosperm clade (Fig. S6E). For GGPS protein sequences, only four motifs were present in all 15 members, and only the three gymnosperm members contained motif 2 (Fig. S6D).

Phylogenetic analyses
To further investigate the evolutionary relationships of Pinus taeda terpenoid backbone biosynthesis-related genes to those from other plants, six genes annotated as encoding terpenoid backbone biosynthesis key enzymes were selected. Among them were three upstream genes (DXS, DXR, and HMGR), two downstream genes (FPS and GGPS), and one gene located at the branching point (IPPI ) (Fig. 8). For each gene, a phylogenetic tree was built using amino acid sequences from 15 species (Fig. 9). Phylogenetic analyses results showed that DXS, DXR, HMGR, and IPPI had the same unrooted topology. Amino acid sequences of DXS, DXR, HMGR, and IPPI genes from the 15 species were divided into three distinct monophyletic clades representing gymnosperms, dicotyledons, and monocotyledons (Figs. 9A, 9B, 9E, 9F). According to the branch lengths of each species in the phylogenetic tree, we speculate that the degree of evolution of these four genes roughly corresponds to the species divergence and traditional classification. However, FPS and GGPS exhibited anomalous unrooted molecular trees. The four FPS genes from the monocotyledons were divided into two branches, and the FPS gene of Ananas comosus had the longest branch length (Fig. 9D). In the phylogenetic tree of GGPS genes, the branch containing the two monocotyledons Oryza sativa japonica and Elaeis guineensi s displayed a significantly higher degree of evolution than the other branches. In addition, the Zea mays (a typical monocotyledon) GGPS gene was clustered into the dicotyledon clade (Fig. 9C). Based on the branching patterns of the phylogenetic trees, roughly similar tree topologies were seen for all genes in the gymnosperm and monocotyledon clades, respectively. For Pinus taeda in the gymnosperm clade, except for the DXS gene (orthologous to the Ginkgo biloba gene), the other five genes were orthologous to the Picea sitchensis genes (Fig. 9). A similar situation also occurred in monocotyledon clade. For Oryza sativa japonica, excepting the GGPS gene (orthologous to the Elaeis guineensis gene), the remaining five genes were orthologous to the Zea mays genes. However, in the dicotyledon clade, Glycine max and Morus notabilis were clustered on the same branches in the DXR and DXS gene trees, while they were on different branches in phylogenetic trees of the other four genes. The DXR and HMGR genes from Populus trichocarpa and Arabidopsis thaliana were more closely related than the DXR and HMGR genes of the other dicotyledons. This implies that many gene duplication events have occurred independently in different species, regardless of their true species trees. Still, the DXS and IPPI genes of Arabidopsis thaliana, a representative dicotyledon, branched separately from the other dicotyledon sequences.

Expression profile analyses by qRT-PCR
To further understand the expression profiles of the six representative genes, RNA samples from three different oleoresin-yielding stages were subjected to qRT-PCR analyses. All six were up-regulated in August compared to April and October (Fig. 10). Compared to expression levels in April, four genes (IPPI, DXS, FPS, and DXR) increased more than 6.6-fold in August, while GGPS and HMGR increased 2.8-and 3.6-fold, respectively. Interestingly, compared to levels in October, all six genes increased ∼2-fold in August (Fig. 10). In summary, the six representative genes involved in terpenoid biosynthesis exhibited high expression levels during the high-oleoresin-yielding stage.

DISCUSSION
Terpenoids play a critical role in the chemical and physical defense systems of conifers, allowing them to cope with attacks from herbivores and pathogens. In addition, as important secondary metabolites, terpenoids are widely used in industrial chemicals.
Research on the mechanisms of terpenoid biosynthesis and increasing terpenoid yields has important economic value and biological significance. In this study, transcriptome analyses were performed on secondary xylem tissues of Pinus taeda. A total of 29,559,842 clean reads were obtained and assembled into 74,402 unisequences with a mean length of 1,459 bp. Only 31,586 unigenes (42.45%) were annotated by the five public databases. A total of 5,807 unigenes exhibited significant homology with sequences of Picea sitchensis. However, only 450 unigenes matched proteins from Pinus taeda. One possible explanation for this is the lack of available genomic information on Pinus taeda in public databases. Another possibility is related to the transcript material. Many studies on the genome of Pinus taeda are based on pine needles or seeds, and few studies have examined secondary xylem tissues (Zimin et al., 2014;Westbrook et al., 2013). After GO annotation, a mass of unigenes were identified as being involved in catalytic activity, binding, cellular processes, and metabolic processes. These results indicate that oleoresin biosynthesis in Pinus taeda involves many unique processes and pathways. Furthermore, KOG functional classification and KEGG pathway analyses classified only 18,786 and 10,266 unigenes, respectively, far less than the number of unigenes annotated by the KOG (24,871) and KEGG (27,942) databases, as many genes in these databases have not been assigned functional classifications or pathways. Based on the functional annotation results and previous research, a total of 372 unigenes were identified as involved in oleoresin biosynthesis. These unigenes were mainly speculated to encode terpenoid synthase, terpenoid backbone biosynthesis enzymes, cytochrome P450, ABC transporters, pathogenesis-related proteins, alcohol dehydrogenase, and aldehyde dehydrogenase, all of which have been previously reported as crucial for oleoresin biosynthesis. One previous study showed that ABC transporter expression is related to terpenoid production in loblolly pines (Materna et al., 2006). Pathogenesis-related proteins (PR5 and PR10) are commonly recognized as playing important roles in plant biological defense, particularly in the fight against fungal pathogens. PR5 proteins are induced by different phytopathogens in many plants and share significant sequence similarity with thaumatin-like protein (Han et al., 2017). The expression of PR10 proteins is also associated with many biotic stresses (Hashimoto et al., 2004;EL-kereamy et al., 2009). These results are consistent with previous reports that transcript levels of a PR10 gene from Prunus domestica L. were significantly increased after two varieties of fungal infections (Lois, Gallego & Campos, 2000). Aldehyde dehydrogenase and alcohol dehydrogenase were reported to participate in the oxidation process of dihydroartemisinic aldehyde to dihydroartemisinic acid (Teoh, Polichuk & Reed, 2009) and artemisinic alcohol to artemisinic aldehyde, respectively (Polichuk et al., 2010). Cytochrome P450 family proteins are one of the largest classes of proteins involved in plant terpenoid secondary metabolites. In our study, a total of 67 unigenes of CYPs were identified and classified into 22 CYP subfamilies based on homology analyses. Among them were many important CYP subfamilies that participate in oxidation of various terpenoids, such as the CYP720B subfamily which is involved in the two consecutive oxidations processes in the formation of a series of diterpene resin acids with diterpene alcohols and aldehydes as substrates (Hamberger et al., 2011;Geisler et al., 2016). However, in some cases the third oxidation step of diterpene resin acids may be catalyzed by an aldehyde dehydrogenase (Lupien et al., 1999). In addition, many other subfamilies including CYP71A (Bertea et al., 2001), CYP71B (Ginglinger et al., 2013), CYP71D (Schalk & Croteau, 2000), CYP716A (Tamura et al., 2017), CYP701 (Itkin et al., 2016), CYP88 (Helliwell et al., 2001), CYP76B, andCYP76C (Höfer et al., 2014) have all been reported to be involved in the monooxygenation process of various terpenoids.
To date, many genes involved in the terpenoid biosynthetic pathway have been identified and well-studied in many other plants. In our study, 74 unigenes were identified to participate in terpenoid backbone biosynthesis. We further analyzed six representative unigenes (DXR, DXS, HMGR, IPPI, GGPS, and FPS) previously reported to be involved in the terpenoid biosynthetic pathway. In a previous study, two types of DXS (type I and type II) were identified to catalyze the first reaction of the MEP pathway and play a rate-limiting role in the production of MEP-derived isoprenoids (Schmidt & Gershenzon, 2008). In addition to DXS, DXR (Cordoba, Salmi & León, 2009) and IPPI (Zhao et al., 2016;Berthelot et al., 2012) also have rate-limiting roles in isopentenyl diphosphate and dimethylallyl diphosphate synthesis. It is well-accepted that HMGR is a key enzyme of the MVA pathway, which catalyzes the conversion of 3-hydroxy-methylglutaryl-CoA to mevalonate (Cao et al., 2010). Previous studies have suggested that over-expression of HMGR genes significantly increased the accumulation of terpenoids (Jayashree et al., 2018). FPP and GGPP are the precursors for the biosynthesis of sesquiterpenes and diterpenes, respectively. FPS is a key enzyme in isoprenoid biosynthesis, and is associated with plant growth and development (Wang et al., 2018). In Pinus massoniana, the expression levels of GGPS presented a substantially linear distribution when plotted against their corresponding oleoresin yields (Chen et al., 2018).
To further examine the evolutionary relationships of these six genes in plants, molecular phylogenetic trees were conducted based on their protein sequences, separately. The results showed that DXS, DXR, HMGR, and IPPI have the same unrooted topology, and the 15 amino acid sequences were divided into three distinct monophyletic groups. This indicates that these genes originate from gene duplication after species divergence (Paszek & Górecki, 2016;Baldauf, 2003;Krushkal et al., 2003;Kim et al., 2009). However, in the dicotyledon clade, the DXS and IPPI genes of Arabidopsis thaliana branched apart from other dicotyledon sequences. This may result from incomplete lineage sorting (Chan, Ranwez & Scornavacca, 2017). The FPS and GGPS genes exhibited relatively anomalous unrooted trees. The four monocotyledon FPS genes were divided into different branches, and the Zea mays GGPS gene was clustered into the dicotyledon clade. This may partly be due to the numerous unique mutations that can arise from rapidly evolving sequences (Degnan, 2013). Multiple sequence alignment analyses and conserved motif prediction results showed that the GGPS and FPS protein sequences were less conserved compared to the other four protein sequences, and for the DXS and DXR protein sequences, multiple identical motifs were discovered in different locations. This may be because upstream genes are expected to face stronger selective constraints and be more pleiotropic (Yang, Zhang & Song, 2009;Lu & Rausher, 2003).

CONCLUSION
Transcriptome analyses of secondary xylem from loblolly pine were performed using RNA-Seq technology. A total of 74,402 unigenes were assembled from 29,559,842 highquality reads, and 31,586 unigene sequences were annotated using five public protein databases. Based on the results of functional annotation and previous studies, 372 unigenes of terpenoid biosynthesis-related enzymes were identified. From these, six representative genes (DXR, DXS, HMGR, IPPI, GGPS, and FPS) were selected for multiple sequence alignment analyses, conserved motif prediction, phylogenetic analyses, and expression profile analyses. The results showed that all six genes exhibited high expression levels during the high-oleoresin-yielding phase. The protein sequences of DXS and DXR were more conserved and pleiotropic than the other sequences. In addition, most of these genes originated from gene duplication after the species divergence. Taken together, our results provide an extensive perspective of regulatory factors involved in oleoresin biosynthesis of the loblolly pine, as well as a portfolio of candidate genes for future study. Further research should focus on exploring the patterns of evolutionary rates among genes of the terpenoid biosynthetic pathway.