Transcriptome Analysis of Yellow Horn (xanthoceras Sorbifolia Bunge): a Potential Oil-rich Seed Tree for Biodiesel in China

Background: Yellow horn (Xanthoceras sorbifolia Bunge) is an oil-rich seed shrub that grows well in cold, barren environments and has great potential for biodiesel production in China. However, the limited genetic data means that little information about the key genes involved in oil biosynthesis is available, which limits further improvement of this species. In this study, we describe sequencing and de novo transcriptome assembly to produce the first comprehensive and integrated genomic resource for yellow horn and identify the pathways and key genes related to oil accumulation. In addition, potential molecular markers were identified and compiled.


Introduction
Due to the crises of fossil fuel depletion and worsening global environmental conditions, oil-rich seed plants that can be used to produce renewable and environmentally friendly biodiesel have received much attention [1][2][3]. A variety of vegetable oils that are obtained from rapeseed (canola), soybean, sunflower, peanut, safflower, palm and Jatropha, among others, have been used to produce biodiesel. However, the great majority of these plants are grown on farmland and are used as cooking oil. Under these circumstances, in some developing countries that have limited per capita arable land, the use of food crops to produce biodiesel is not realistic. Therefore, biodiesel production from non-food crops that can be planted in areas that are unsuitable for traditional crops is an ideal solution to this problem [4][5][6].
Yellow horn (Xanthoceras sorbifolia Bunge.) is an oil-rich seed shrub that belongs to the Sapindaceae family and has a life span of more than 200 years [7,8]. The seeds of yellow horn contain abundant oil (55-70%), of which 85-93% is unsaturated fatty acids [9,10]. According to previous studies, the molecular composition of yellow horn oil is similar to the ideal fatty ester structure for biodiesel [11,12]. Therefore, it has been identified as a major biodiesel tree species and the Chinese Government provided special support to aid its development because it can produce over 800 gallons of oil per acre of cultivation [13,14]. Unlike other energy-resource trees, such as palm and Jatropha, that cannot survive low temperatures, it can not only grow well in barren, salty and drought soil, it can also survive temperatures as low as 230 to 241uC. In addition, the yellow horn tree has many other uses, including multiple entries in the Chinese Pharmacopoeia, it can assist in eliminating desertification and erosion, and it is grown as an ornamental tree and used as a source of high-level woody natural oil for cooking [15]. During the last decade many studies have examined yellow horn seed oil; however, these focused on the extraction of oil and methods of biodiesel production from the seed oil [10,[16][17][18]. Unlike other oil crops, such as rapeseed, soybean and Jatropha, no genome-level studies have attempted to determine the oil synthesis metabolic pathway, which could be used to improve the seed yield and oil content. Although conventional breeding strategies continue to play an important role in crop improvement, genetic engineering methods are more rapid and precise, and allow the specific redesigning of crops for target characteristics [19]. For non-model plants, such as yellow horn, for which little or no molecular information is available, nextgeneration sequencing (NGS) technologies provide a ready means of obtaining genetic information [20][21][22]. The advent of NGS, such as RNA-Seq, in recent years has created unprecedented opportunities for generating genomic information in previously uncharacterised systems. NGS facilitates rapid, inexpensive and comprehensive analyses of complex genomes due to the collection of large-scale sequence data that can be used for gene discovery [23], expression profiling [24], molecular marker development [25] and functional, comparative and evolutionary genomics studies [26]. To date, the transcriptomes of a large number of plants, including many oil crops such as palm [27], peanut [28], sesame [29], safflower [30], rape [31] and Jatropha [32], have been analysed using NGS.
In this study, we report the results of using Roche 454 RNA-seq technology, which can generate sufficiently long sequence reads [33], to analyse the yellow horn transcriptome, which was derived from a pooled sample of DNA from multiple tissue types (buds, leaves, flowers and seeds). The analysis included functional annotation of the transcripts, identification of unigenes that are involved in oil biosynthesis and metabolism, and the discovery of a series of molecular markers (SSRs, SNPs and InDels). These transcripts represent the first yellow horn sequence dataset. We believe the data will open new perspectives for improving and selecting elite yellow horn varieties to produce a greater quantity of high-quality biodiesel.

Transcriptome Assembly
One and a quarter plates of pyrosequencing reactions were conducted using a 454 GS FLX titanium platform. Approximately 600 Mb of data from 1,221,677 raw reads with a GC content of 43.7% were produced; the read lengths ranged from 23 to 1,478 bp with an average length of 491.1 bp and a median length of 537 bp. After SeqClean was used to cut the adaptors and SMART primers and LUCY2 were used to remove low-quality regions and bases, 88.43% of bases were retained and 1,147,624 trimmed reads (GC content = 43.6%) were generated, with total and average lengths of 530. 6 Figure 1C).

Functional Classification of Unigenes by COG, GO and KEGG
To further evaluate the completeness of our transcriptome library and the effectiveness of our annotation process, we used the annotated unigene sequences to search for genes involved in COG classifications, GO assignments and KEGG pathway assignments to predict and classify their functions.
Of the unigenes, 26,643 were assigned into the three main GO functional categories and then into 50 sub-categories ( Figure 3). For the three main categories, 23,978 unigenes (90.03%) were assigned into the largest category, molecular function, followed by biological process (22,187,83.28%) and cellular component (15,012,56.35%). The biological process category was assigned into 25 sub-categories; the most abundant was ''metabolic process'', which contained 18,891 unigenes (36.42% of the total), indicating that these genes were enriched in the yellow horn transcriptome libraries. The cellular components category was divided into 11 small groups; the largest sub-category was ''cell'', which included 11,855 (22.86% of the total) unigenes. For molecular function, 23,978 unigenes were categorised into 14 GO terms; the majority fell into ''binding'' and ''catalytic activity'', which contained 17,084 and 15,045 unigenes, respectively.

Unigenes Related to FA Biosynthesis
In oil plants, fatty acids are stored as a form of TAG and their biosynthesis pathway can be divided into three steps [34]. The first step is de novo biosynthesis of fatty acids; this process occurs in plastids and is catalysed mainly by the fatty acid synthase complex (FAS). The second step is the synthesis of triacylglycerol (TAG), which occurs in the endoplasmic reticulum (ER), and the last is the formation of oil bodies (OBs), where TAG is combined with oleosin to form an oil body, which is released from the ER into the cytoplasm [35,36].
According to the KEGG pathway assignment and functional annotation of the unigenes, 40 unigenes were annotated as encoding ten key enzymes involved in FAs biosynthesis ( Table 3). The reconstructed pathway of FAs biosynthesis was based on these identified enzymes ( Figure 4). First, acetyl-CoA carboxylase (ACCase, EC: 6.4.1.2), as a rate-limiting enzyme in the FAs biosynthesis pathway, catalyses acetyl-CoA to form malonyl-CoA [37]; 16 unigenes that encode its four subunits were identified (four for a-carboxyltransferase, three for b-carboxyltransferase, six for biotin carboxylase and three for biotin carboxyl carrier protein). Next, a series of condensation reactions of malonyl-CoA with a growing ACP-bound acyl chain are catalysed by FAS, consecutively adding two carbon units per cycle over six or seven cycles to form 16:0-ACP or 18:0-ACP, which can then be catalysed by acyl-ACP desaturase (AAD, EC: 1.14.19.2) to form 16:0-ACP or 18:1-ACP [38]. Nineteen unigenes that encoded the five components of FAS were found, including one that encoded malonyl-CoA-ACP transacylase ( 2), free FAs are released from the acyl carrier protein (ACP). Two unigenes that encoded FATA, three that encoded FATB and one that encoded PCH which is also involved in FA elongation were identified in this study.
In addition, 16 unigenes that encode long-chain acyl-CoA synthetases (LACS), which catalyse the esterification of free FAs to CoA upon arrival in the cytoplasm [39], and seven that encoded acyl CoA binding protein (ACBP), which binds medium-and longchain acyl-CoA esters with very high affinity and may function as an intracellular carrier of acyl-CoA esters [40], were also identified based on the functional annotation of the transcriptome.
As has been reported previously, overexpression of ACCase, a crucial enzyme in fatty acid synthesis, can alter the fatty acid composition of seeds and increase the fatty acid content, which would lead to an increased oleic acid content and seed yield [41,42]. Most research on FAS has concentrated on ACP and KAS. Functional expression of an ACP from Azospirillum brasilense in Brassica juncea can improve the content of 18:1 and 18:2 in seeds, enhance the ratio of monounsaturated (18:1) to saturated fatty acids, increase the ratio of 18:2 to 18:3 and reduce the erucic acid content (22:1) [43]. In plastids, three types of KAS are found: KAS I, KAS II and KAS III. KAS II catalyses 16:0-ACP to elongate to 18:0-ACP and KAS III condenses acetyl-CoA with malonyl-ACP to form 4:0-ACP. Using hairpin RNAi to reduce the activity of KAS II can lead to an increase in 16:0 accumulation, up to about 53% of the total, but some transgenic offspring are deformed during early embryonic development [44]. Overexpression of KAS III can also improve 16:0 accumulation, but the rate of lipid synthesis is reduced [45]. In our study, we did not find any unigenes that encoded KAS I, which is highly active with acyl-ACP with chain lengths from C2 to C14, is far less effective for 16:0-ACP and almost inactive for 18:0-ACP [46].

Unigenes Related to TAG and OBs Biosynthesis
In the suggested pathway for TAG biosynthesis [47,48], a total of 33 unigenes that encode six enzymes were found. As the data in Table 4 and Figure 4 show, initially, one unigene that encodes glycerol kinase (GK, EC: 2.7.1.30), which catalyses glycerol to form glycerol-3-phosphate (G-3-P), an initial substrate in the Kennedy pathway, was detected. Then, 12 unigenes that encode the key component of TAG biosynthesis, glycerol-3-phosphate acyltransferase (ATS1 and GPAT, EC: 2.3.1.15) (three for ATS1, one for GPAT5, three for GPAT6, two for GPAT8 and three for GPAT9), which catalyses the first step of the Kennedy pathway, and 11 unigenes that encode lysophosphatidyl acyltransferase (LPAT, EC: 2.3.1.51) were identified (four for LPAT1, two for LPAT2, three for LPAT4 and two for LPAT5). Under catalysis by these two enzymes, sequential esterification of acyl chains from acyl-CoA to the positions of sn-1 and sn-2 of G-3-P occur to form lysophosphatidic acid (Lyso-PA) and phosphatidic acid (PA), respectively. The next reaction is catalysed by phosphatidate phosphatase (PP, EC: 3.  (Table 4 and Figure 4).
Overexpression of a plastidial safflower GPAT and an Escherichia coli GPAT in Arabidopsis can improve the seed oil content, with average increases of 22 and 15%, respectively [49]. Similarly, substantial increases of 8 to 48% in seed oil content and increases in both overall proportions and amounts of very-long-chain fatty acids in seed TAGs were obtained by overexpression of a mutant form of yeast LPAT in Arabidopsis and Brassica napus [50]. DGAT is a key enzyme regulating the rate of the Kennedy pathway. It has four types, and we detected DGAT1 in this study. Ectopic expression of DGAT will improve the oil content in seeds, which has been confirmed in Arabidopsis, soybean and maize [51][52][53]. In addition, compared to the single unigene that encoded GPAT1, the finding of four unigenes that encoded PDAT1 in our study provides further evidence that yellow horn has the potential to channel fatty acids that are incorporated in membrane lipids, such as PC, into TAG biosynthesis.
After biosynthesis, pools of TAGs can be stored as a form of OB surrounded by a membrane composed of a layer of phospholipids embedded with several proteins: oleosin, caleosin and steroleosin in mature seeds [35,54]. According to the annotations of the unigenes, nine encoded oleosin, seven encoded caleosin and four encoded steroleosin (Table S2). Oleosin is the most abundant structural protein in OBs; it helps stabilise OBs through increased space bit resistance and charge repulsion, preventing fusion of OBs [35,55]. Caleosin is not only involved in the synthesis and metabolism of OBs, but may also be associated with plant drought tolerance [55,56]. Steroleosin-like proteins may represent a class of dehydrogenases/reductases that are involved in plant signal transduction regulated by various sterols [57]. In conclusion, the detection of unigenes that are involved in oleosin, caleosin and steroleosin biosynthesis will contribute to future functional studies and improvements in production levels by metabolic engineering of yellow horn.

Unigenes Related to Catabolism Pathways for TAGs and FAs
The complete breakdown of TAGs can be divided into two steps [60]. First, TAGs are metabolised to free FAs; in other words, lipases catalyse the hydrolysis of ester bonds that link fatty acyl chains to the glycerol backbone. During this research, three unigenes that encode triacylglycerol lipase (TGL, EC: 3.1.1.3), which releases fatty acids and intermediate products (DAG or monoacylglycerol) from TAG or DAG, were identified in the transcriptome library (Table 4). In the second step, fatty acids are catabolised to acetyl-CoA, allowing them to be further broken down by oxidation or to follow other metabolic pathways, including re-esterification with glycerol, to form new acylglycerols [61]. Based on the KEGG pathway assignment, we identified 83 unigenes that code for enzymes related to fatty acid catabolism; three key enzymes were acyl-CoA oxidase (ACOX, EC: 1.  (Table 3). Acetyl-CoA generated through fatty acid catabolism is then used to produce energy for the cell via the citrate cycle or may participate in the synthesis of TAG.
TAG and FA catabolism proceeds in a direction opposite that of their synthesis. Therefore, identifying ways to suppress enzymes involved in TAG and FA catabolism may be another method of increasing the accumulation of lipids under conditions that do not affect plant growth. However, suppressing the expression of TGL increases TAG levels but results in severely stunted growth [62].
To identify TFs that regulate seed oil synthesis, BLASTx was used to search against the AGRIS (Arabidopsis Gene Regulatory Information Server) database with an e-value cut off 10 25 [66]. The results showed that 3,341 unigenes were annotated with 905 independent coding sequences of Arabidopsis TFs belonging to 49 known TF families. However, none of the unigenes were annotated to the AtRKD family (Table S3). The largest number of unigenes (817) was annotated to the Trihelix family, followed by the C2H2 family (519 unigenes). TF genes in the PHD, CAMTA and JUMONJI families were identified in the yellow horn transcriptome.

Identification of Molecular Markers Located in Unigenes Related to Oil Biosynthesis and Metabolism
Simple sequence repeats (SSRs), single nucleotide polymorphisms (SNPs) and insertions and deletions (InDels) are valuable tools for genetic analysis because they are highly polymorphic within species. They can be used for molecular marker-assisted selection (MAS), which is a rapid approach to the development of new crop varieties (particularly in perennials and trees) that reduces the assessment time considerably [67,68], association mapping to find genes related to good agronomic characteristics [69], and polymorphism analysis, among other techniques. Molecular markers located in genes will likely be related to the functions of those genes. Some markers that are located in coding regions related to oil synthesis have been developed and applied [70,71].
To increase the authenticity of SNP and InDel identification, we also filtered the results based on stricter multiple criteria, including read depth and allele frequency, compared with previous studies (see materials and methods). In total, 16,925 SNPs distributed across 4,401 different isotig groups that corresponded to 4,234 different unigenes had a total length of 6.10 Mb (Figure 5), resulting in an SNP occurrence rate of 0.003 per base and four SNPs per unigene. Transitions contained 9,225 (54.51%) SNPs and were primarily transversions (7,700 SNPs, 45.49%). The A/G and C/T transition genotypes had similar percentages, but among the four transversion genotypes, C/G transversions were less frequent than the other three (A/T, G/T, A/C). A total of 6,201 InDels were identified from 3,161 isotig groups that corresponded to 3,094 unigenes, with a total length of 4.41 Mb (Figure 5), which indicates there were two variations per unigene. Insertions (2,862, 46.15%) were slightly less common than deletions (3,339, 53.85%). The length of insertions ranged from 1 to 24 bp, and deletions were 1 to 22 bp in length.
Because we used larger samples from different plant regions and 1.25 runs to obtain the transcriptome, we found 29,833 (SSRs,  SNPs and InDels) potential molecular markers in 9,720 unigenes (18.74%) with a total length of 10.37 Mb, for an average of 3.07 markers per unigene and spaces of 347.58 bp between markers. These markers will play significant roles not only in the production of genetically improved varieties of yellow horn with different oil compositions, increased oil yield and improved agronomic characteristics, but also for studying the evolution and origin of Xanthoceras, and even the Sapindaceae.

Conclusions
In this study, we report the first comprehensive yellow horn sequencing effort using 454 GS FLX. Transcriptome analysis using four tissues (buds, leaves, flowers and developing seeds) of yellow horn found 51,867 unigenes (45 to 10,088 bp), which corresponded to 36.1 Mb and mean, N50 and median lengths of 696, 928 and 570 bp, respectively. These unigenes provide a strong basis for future genomic research to develop microarrays for gene expression assays and can serve as a reference transcriptome for future yellow horn RNA-seq experiments. In addition, 281 unigenes that code for key enzymes and TFs that are involved in reconstructed metabolic pathways for FA and TAG biosynthesis and metabolism were identified. Moreover, a large number of potential molecular markers (6,707 SSRs, 16,925 SNPs and 6,201 InDels) were predicted. Among them, 26 SSRs,194 SNPs and 60 InDels were identified in 74 unigenes that are related to oil biosynthesis and metabolism. These findings will make a substantial contribution to efforts to improve crop characteristics and will accelerate the breeding of new yellow horn varieties.

Collection of Tissues for RNA Extraction
Yellow horn is widely distributed in China, so it has not been listed as an endangered or protected species. In this study, we used yellow horn buds, leaves (young and mature leaves), flowers, and developing seeds (10,20,30,40,50,60, and 70 days after pollination) which were collected from 30 plants at the two locations. One is located in Xishan forest farm in Haidian, Beijing, China (E116u049, N40u039). This forest farm belongs to Beijing Forestry University and the yellow horn trees in this farm have been used for scientific research for several years. The other one is located in a small hill in Chengde city, Hebei province, China (E117u559, N40u599). The yellow horn trees which growing in this location were planted for scientific research by Dr. Ao Yan, a co-authors of this research. These samples were immediately frozen in liquid nitrogen and stored at 280uC. Because our research team is also engaged in the work of conservation and utilization of wild plant resources, we confirm that the field studies did not involve any endangered or protected species.

cDNA Library Construction and 454 Sequencing
Total RNA was extracted separately from the buds, leaves, flowers and developing seeds using RNeasy Plant Mini Kits (Qiagen, Inc., Valencia, CA, USA) following the manufacturer's protocol. Extracted RNA was qualified and quantified using a Nanodrop ND-1000 Spectrophotometer (Nanodrop Technologies, Wilmington, DE, USA) and all the samples showed a 260/ 280 nm ratio from 1.9 to 2.1. Poly(A) + RNA was purified from total RNA by Oligotex mRNA Mini Kit (Qiagen, Inc., Valencia, CA, USA) following the manufacturer's protocol. After that, equal quantities of total RNA from buds, leaves, flowers and seeds were mixed together. A total of 10 mg of total RNA was used for cDNA library construction. cDNA library construction and normalisation were performed using protocols described previously [72]. The resulting library was sequenced

Analysis of 454 Transcriptome Sequencing Results
The raw reads were trimmed before assembly. First, adaptors and SMART primers that were used in the pyrosequencing reactions were cut using the SeqClean software. Then, we used the LUCY2 software [73] to remove low-quality regions and bases. Trimmed reads that were shorter than 45 bp were discarded and the remaining reads were assembled into isotigs and singlets using Newbler (version 2.6) [40]. Finally, after clustering the isotigs and singlets using CD-HIT (version 4.5.6) [74], the obtained unigenes were used in further analyses.
To understand their functions, the yellow horn unigenes were annotated using BLASTx alignment with an E-value cut-off of 10 25 against the following protein databases: NCBI nonredundant (NR), Swiss-Prot, Conserved Domain Database (CDD), Pfam protein families database (Pfam), UniProtKB/ TrEMBL Protein Database (TrEMBL), Clusters of Orthologous Groups of proteins (COG), Gene Ontology (GO), Kyoto Encyclopaedia of Genes and Genomes (KEGG), and The Arabidopsis Information Resource (TAIR). GO functional classifications and KEGG pathway assignments were performed, as was described previously [75].

Detection of SSRs, SNPs, InDels and TFs
To detect simple sequence repeats (SSRs) in the yellow horn transcriptome, the MISA (http://pgrc.ipk-gatersleben.de/misa/) software was used to identify all 2-6-bp motifs in the unigenes. The minimum repeat unit size was set at six for di-nucleotides and five for tri-, tetra-, penta-, and hexa-nucleotides.
The ssahaSNP software tool [76] was used to identify SNPs and InDels (1 to 100 bp in size) that had high coverage depths. To add the polymorphism and reliability of the SNPs and InDels, as in previous studies [26,77,78], we kept only SNPs and InDels that met the following strict quality criteria: (1) For SNP detection, we identified SNPs that had coverage depths of at least 20 and an alternate allele that was present at a minimum frequency of 20% in all isotigs that contained at least 20 reads. (2) For InDel detection, similar to the standard for SNPs, for insertions or deletions of one base, the coverage depth of the isotig and the InDel allele frequency were set at 20 and 20%, respectively. For insertions or deletions of two or more bases, the coverage depth of the isotigs and the InDel allele frequency were set at 10 and 10%, respectively.
To identify TFs, a BLAST search for all unigenes was conducted using AtTFDB (Arabidopsis transcription factor database) with shared identities .77%, as described previously [23].

Data Deposition
The Roche 454 reads of yellow horn were deposited in the NCBI and can be accessed in the Short Read Archive (SRA) under accession number SRP026671.