Transciptome analysis reveals flavonoid biosynthesis regulation and simple sequence repeats in yam (Dioscorea alata L.) tubers

Yam (Dioscorea alata L.) is an important tuber crop and purple pigmented elite cultivar has recently become popular because of associated health benefits. Identifying candidate genes responsible for flavonoid biosynthesis pathway (FBP) will facilitate understanding the molecular mechanism of controlling pigment formation in yam tubers. Here, we used Illumina sequencing to characterize the transcriptome of tubers from elite purple-flesh cultivar (DP) and conventional white-flesh cultivar (DW) of yam. In this process, we also designed high quality molecular markers to assist molecular breeding for tuber trait improvement. A total of 125,123 unigenes were identified from the DP and DW cDNA libraries, of which about 49.5% (60,020 unigenes) were annotated by BLASTX analysis using the publicly available protein database. These unigenes were further annotated functionally and subject to biochemical pathway analysis. 511 genes were identified to be more than 2-fold (FDR < 0.05) differentially expressed between the two yam cultivars, of which 288 genes were up-regulated and 223 genes were down-regulated in the DP tubers. Transcriptome analysis detected 61 unigenes encoding multiple well-known enzymes in the FBP. Furthermore, the unigenes encoding chalcone isomerase (CHS), flavanone 3-hydroxylase (F3H), flavonoid 3′-monooxygenase (F3’H), dihydroflavonol 4-reductase (DFR), leucoanthocyanidin dioxygenase (LDOX), and flavonol 3-O-glucosyltransferase (UF3GT) were found to be significantly up-regulated in the DP, implying that these genes were potentially associated with tuber color formation in this elite cultivar. The expression of these genes was further confirmed by qRT-PCR. Finally, 11,793 SSRs were successfully identified with these unigenes and 6,082 SSR markers were developed using Primer 3. This study provides the first comprehensive transcriptomic dataset for yam tubers, which will significantly contribute to genomic research of this and other related species. Some key genes associated with purple-flesh trait were successfully identified, thus providing valuable information about molecular process of regulating pigment accumulation in elite yam tubers. In the future, this information might be directly used to genetically manipulate the conventional white-fleshed tuber cultivars to enable them to produce purple flesh. In addition, our SSR marker sets will facilitate identification of QTLs for various tuber traits in yam breeding programs.


Background
Yam (Dioscorea alata L.) is an important tuber crop valued for its dietary carbohydrate, amino acids and essential minerals. It is widely cultivated in tropical and subtropical regions. Most D. alata tubers have white flesh, but occasionally, purple-flesh tubers with high anthocyanidin content are produced because of spontaneous variation [1]. Anthocyanins are responsible for the deep purple to red pigmentation of various flowers, fruits, leaves, and other plant tissues [2]. Anthocyanins are perhaps the best characterized flavonoids with studies indicating their important role in plant physiology, in particular, plant defense against herbivores and pathogens. They have also been shown to have multiple health benefits for humans including immunomodulatory, anticancer, cardio-protective, vasodilation, antithrombotic, and UV-protection due to their antioxidant, and antiinflammatory properties [3]. Therefore, it is no surprise that the purple-flesh yam tubers have recently been selling at a premium price owing to consumer awareness about its health benefits. The current study is aimed at understanding how spontaneous variation leads to anthocyanin synthesis in certain yam strains. Understanding the molecular mechanism of triggering anthocyanin biosynthesis and accumulation in these strains makes it possible to transfer the purple pigment trait to conventional white-flesh cultivar, thus improving the tuber quality and market value.
In the past decade, the flavonoid biosynthesis pathway (FBP) has been well characterized genetically and biochemically in model and non-model plants [4,5]; and a number of genes encoding important enzymes and transcription factors responsible for the FBP have been cloned from a dozen organisms such as Arabidopsis [6], grapevine [7], Petunia [8], and maize [9]. Nevertheless, the complicated mechanism that controls anthocyanin catabolism in different plant species and tissues is far from conclusive. It is reasonable to expect that the loss or accumulation-of-color adaptations in yams are relatively unconstrained because they can be obtained in various ways [10], and affected by multiple intracellular factors such as co-pigmentation [11], pH [12] and metal-chelation in vacuoles [13]. The promotion or suppression of any one of the enzymes catalyzing a series of reactions that make up a pathway will change its final product. For example, Chen et al. [14] found that for independent events causing accumulation of red pigments in variegated peach flowers, a particular subset of genes (C4H, CHS, CHI and F3H) were enhanced and coregulated in the FBP. Lou et al. [15] also revealed that the loss of delphinidin (blue pigment) resulted from the gene suppression of FLS and DFR in grape hyacinth.
Recently, transcriptomic analysis based on next generation sequencing (NGS) technology has emerged as an extremely powerful method for identifying novel genes associated with biosynthesis of various secondary metabolites in non-model plant species [16,17]. Specially, it has been widely applied to investigate molecular mechanisms of color variation in plant species such as blueberry [18], grape [15], Brassica Juncea [19], and potato [20]. In yam, transcription profiles of leaf tissues from one anthracnose susceptible (TDa 95-0310) and two resistant yam genotypes (TDa 87-01091, TDa 95-0328) were analyzed upon infection with the anthracnose fungus; a set of genes involved in defense against anthracnose were identified [21]. Anthocyanins are considered as an important quality trait in yam [22]. However, to date, no effort has been made to uncover molecular basis of different color formation in yam tubers by using RNA-Seq. A previous study by Zhou et al. [23] through RACE technology and RT-PCR analysis, reported that DaANS1 (a member of ANS genes in FBP) controls anthocyanin accumulation of purple-flesh tubers based on its regulation at transcription level. However, a limitation in this study was the use of single cultivar (purple-flesh tuber) and study of one gene (ANS). Without comparing the global transcriptional differences between the purple-flesh cultivar and conventional white-flesh cultivar, it is impossible to separate candidate genes related to color formation. Therefore, the molecular mechanism underlying the purple-flesh formation has not yet been fully understood in yam.
Further, being a non-model species, there is lack of genomic resources, in particular, information on SSR markers for marker-assisted breeding (MAS) of yam. Previous genetic inheritance study revealed that some important traits (such as resistance) are controlled by a single dominant locus in yam [24]. Genic-SSR markers appear to be tightly linked to specific gene functions and perhaps even play a direct role in controlling important traits [25,26]. Recent studies have demonstrated the variability in cultivated yam accessions in terms of tuber shape, color, taste and yield [27,28]. However, very limited knowledge is available on the genetic regions associated with these variations, in particular SSR or SNP markers for important genes [29,30]. Therefore, identification of SSR markers from yam transcriptome is crucial for the future of marker-assisted breeding programs.
Here, we report use of RNA-Seq to investigate the transcriptomic differences between yam tubers of a purpleflesh cultivar (DP) and conventional white-flesh cultivar (DW). Differentially expressed genes and their expression patterns were analyzed, and some potential candidate genes responsible for the FBP were successfully identified. We expect this genome-wide transcriptome comparison to provide a novel resource to understand the molecular mechanisms underlying the purple-flesh trait. Moreover, transcriptomic datasets were further exploited to identify a large number of gene-based SSR markers that enable linkage mapping and marker assisted breeding of yams.

Results and discussion
Sequencing statistics and assembly The variation in pigment expression of the purple-flesh cultivar and the conventional white-flesh cultivar of yam is shown in Figure 1. To characterize the transcriptome differences between the two cultivars, two cDNA libraries were prepared from their tubers and subjected to RNA-Seq analysis based on the Illumina HisSeq 2000 platform. After removing adaptors and reads of unknown or low-quality nucleotides, in total, 35,645,052 and 34,585,554 clean reads were respectively obtained from the DP and DW libraries. The information of all high-quality reads has been deposited in the Sequence Read Achieve (SRA) database under the accession ID SRX652481 for DP, and SRX652483 for DW. The highquality reads from the two libraries were subsequently de novo assembled into 125,123 unigenes using Trinity program; the size distribution of these unigenes is shown in Additional file 1. As a result, the in silico assembled unigenes ranged from 200 to 14,799 bp with an average length of 592 bp; the N50 value was 875 bp and total size was approximately 71.8 Mb. Furthermore, in order to estimate the efficiency of short-read usage during the de novo assembling, we mapped our RNA-Seq reads to the assembled unigenes using TopHat analysis package. A total of 29,676,058 and 29,022,640 sequences from DP and DW library respectively were matched (~80%) ( Table 1), indicating that the set of assembled unigenes is applicable to carry out the downstream analysis.
Currently, the yam EST library found in Genbank database (http://www.ncbi.nlm.nih.gov/nucest/?term=Dios-corea+alata) contains 44,134 ESTs from leaves of three genotypes differing in resistance to anthracnose disease [21]. To estimate the level of transcript coverage in this study, we downloaded these ESTs from GenBank and compared them to our transcriptome unigenes using BLASTN (e ≤ 1.00 × 10 −7 ). Only 32.88% ESTs (14,512,) from GenBank matched to 23,874 unigenes (Additional file 2). This was probably associated with different tissues used for transcriptome analysis. It also highlights the high level of sequencing depth achievable through NGS compared to low coverage obtained using conventional cDNA library sequencing. Furthermore, 97,379 novel yam unigenes were discovered, some of which may be specifically expressed in yam tuber tissue. These novel unigenes may serve as a crucial genomic resource for future studies, such as gene identification, cloning and functional analysis.
Annotation, functional classification and KEGG pathway analysis of the unigenes To acquire the most informative and complete annotation, all assembled unigene sequences were matched against the NCBI non-redundant protein (NR), the Arabidopsis thaliana protein dataset of NR (ATNR), Gene Ontology (GO), and the Kyoto Encyclopedia of Genes and Genomes (KEGG) by BLASTX (e ≤ 1.00 × 10 −5 ). Out of the 125,123 unigenes, 60,020 (49.5%) represented significant match to genes encoding proteins or putative function in at least one of these public databases (Table 2), whereas 50.5% unigenes could not be annotated to predicted coding regions with unknown functions in other species. In comparison with previous publications for yam and other non-model plant species [31,32], the low rate of annotated unigenes indicated that assembled unigenes, particularly sequences without a significant homologous hit, were potentially novel gene sequences not yet reported in other crops.
For unigene sequences in the NR annotations, Blast search analysis further revealed that a total of 11,115, 3,542, 3,304, 2,943, 2,859 unigenes respectively matched with the sequences from Vitis vinifera, Oryza sativa, Populus trichocarpa, Zea mays, and Ricinus communis with the highest homology ( Figure 2A). Similar distributions were also observed for yam in previous study [21]. Moreover, the identifying distribution pattern showed that 19.02% of the sequences had a similarity higher than 80%, while 71.68% showed a moderate similarity (40-80%), and the remaining 9.29% showed a lower similarity (18-40%) ( Figure 2B).
When describing the properties of genes and their products, their functional classification is the most important. In this study, GO functional analysis was performed using Blast2GO to characterize all assembled unigenes; the result is shown in Figure 3. A total of 43,594 unigenes were classified into 51 functional terms, including 23 terms in biological process, 14 terms in cellular component, and 14 terms in molecular function. Within biological process, "cellular process" (GO:0009989) with 28,608 unigenes and "metabolic process" (GO:0008152) with 27,564 unigenes were predominant. Under the cellular component, the "cell" (GO:0005623, 25,988 unigenes), "cell part" (GO:0044464, 25,988 unigenes), and "organelle" (GO:0043226, 15,029 unigenes) represented the majority of this category. Similarly, for molecular function, the terms of "binding" (GO:0005488, 32,229 unigenes) and "catalytic activity" (GO:0003824, 25,042 unigenes) were the most abundant assigned terms. These GO annotations demonstrated that the unigenes expressed in yam tuber encode diverse structural, regulatory and stress response proteins.
Furthermore, all annotated unigene sequences were matched against Cluster of Orthologous Groups (COG) database to predict and classify possible functions. Out of 58,439 NR hits, a total of 23,633 sequences with COG annotations were assigned into 25 COG categories ( Figure 4). The "general function prediction only" category represented the largest group (4,747), followed by "posttranslational modification", "protein turnover and chaperones" (2,011), "signal transduction mechanisms" (1,992), "replication, recombination and repair" (1,898) and "Transcription" (1,869), and only one sequence was assigned into extracellular structures.
In addition, to identify which metabolic pathways were enriched, a pathway-based analysis was conducted through the KEGG pathway database using BLASTX with an E-value cutoff of <10 −5 . In total, 24,289 unigenes were assigned to 279 KEGG pathways (Additional file 3). The result reveals that metabolic pathway (Ko01100) was the most enriched (4,064 unigenes), followed by biosynthesis of secondary metabolites (Ko1110; 1,687 unigenes) and microbial metabolism in diverse environments (Ko1120; 1,020 unigenes). The focus of this study was differential anthocyanin accumulation in the DP cultivar. Therefore, genes associated with two secondary metabolic pathways including flavonoid biosynthesis, flavone and flavonol biosynthesis were separately analyzed. A total of 61 genes were found to be directly or indirectly involved in flavonoid biosynthesis, and they were mapped and highlighted in this pathway (Ko00941) (Additional file 4). In contrast, relatively few genes (25) were found to encode key enzymes in the flavone and flavonol biosynthesis (Ko00944) (Additional file 5). Overall, these findings provide useful information to further uncover the molecular mechanism of anthocyanin accumulation in yam tuber.     expressed in both cultivars. This indicated that some unique genes may play an important role in the accumulation of purple pigment. Based on the false discovery rate (FDR) ≤ 0.05, and fold change (FC) ≥ 1, 511 DEGs were identified from the two libraries, among which 288 genes were up-regulated, and 223 genes were down-regulated in DP versus DW. For a detailed comparison, see Additional file 6. There were more up-regulated genes than down-regulated ones, suggesting that many genes were positively regulated for biosynthesis of anthocyanins. Similar results were also reported in other species [19,20]. Annotation of differentially expressed unigenes revealed that 433 unigenes were grouped into 45 GO groups while the remaining 78 unigenes could not be classified (not shown). The most common categories were "intracellular part" (27 up-regulated and 17 down-regulated) and "protein binding" (24 up-regulated and 17 down-regulated), followed by "cellular macromolecule metabolic process", and "intracellular organelle".

Identification of candidate genes associated with the flavonoid biosynthesis pathway
Flavonoids are a class of important secondary metabolites including hydroxycinnamic acids, isoflavones, flavonols, phlobaphenes, pro-anthocyanidins and anthocyanins. In our annotated yam transcriptome, multiple unigenes of encoding almost all known enzymes associated with biosnythesis of anthocyanin and its derivatives in the FBP were identified (Table 3, Figure 5).
primary precursor (chalcone) and lead compounds (DHK, DHQ) of all flavonoids. Similar results were also observed during the differential pigment deposition in potato tubers [20], peach and grape flowers [14,15].
On the other hand, in the downstream of FBP, three up-regulated genes (DFR, LDOX, UF3GT) also play a critical role during formation of colored anthocyanins. We found that DFR and LDOX unigenes were not expressed in the white-flesh tubers, whereas two LDOX unigenes were expressed at high levels in the purpleflesh tubers (Additional file 6). In addition, the upregulated glycosyltransferase (UF3GT) can potentially make structural modifications to anthocyanins. Two anthocyanins (cyanidin-3-O-glucoside and pelargonidin-3-O-glucoside) in the purple-flesh tuber are glycosylated at the 3-postion of the C-ring by this enzyme. Similar results were also reported in previous studies [35][36][37]. For instance, two glycosyltransferases, UGT79B1 and UGT84A2 were found to cause high levels of anthocyanin modifications (3-O-glucosylated anthocyanidins) in Arabidopsis flavonoid biosynthesis [35], whilst anthocyanins were drastically reduced in the UGT79B1 and UGT84A2 knockout mutants. Besides, the O-methyltransferase is one of the most important modification reactions of flavonoids and the resulting O-methylated flavonoids have been shown to display new biological activities [38]. In this study, the down-regulated O-methyltransferase (FOMT) was assigned to code quercetin-3-O-methyltransferase protein and may have redundant function in the FBP. Taken together, these results indicate that key genes responsible for the FBP have a higher expression level in the purple-flesh tubers of yam. This finding is an important explanation of well-known higher antioxidant activity in pigmented tissues found in a number of tissues.

Identification of genes associated with transcription factors (TFs)
Besides structural genes, it is well known that transcription factors play an essential role in regulating the overall activity of flavonoid biosynthesis. In most species, the anthocyanin branch within the FBP is controlled by a ternary complex of MYB-bHLH-WD40 TFs [5,39], which generally regulate expression of many structural genes. In our transcriptome database, a total of 183, 146, 95 unigenes were respectively predicted to code bHLH, MYB, and WD40 proteins including a large number of its members. Of these genes, the transcriptomic analysis detected four TFs that were differentially expressed between the two cultivars of yam tubers, including three WD40 repeat proteins with one up-regulated (uni-gene050252) and two down-regulated (unigene041043, unigene056944) in purple-flesh tubers. Further, one MYB4R1 protein (unigene029894) was also found to be up-regulated in purple-flesh tubers (Additional file 6).
The high variation in expression of structural genes associated with the FBP in the purple-flesh tubers may most likely be regulated by one or more of these TFs. However, the specific function of these TFs in the FBP of yams still needs to be validated using a functional genomics approach.

Gene validation and expression analysis
It was reported that several genes involved in the FBP showed special expression patterns in different species such as CHS, F3H, DFR, LDOX genes in Brassica juncea Seed Coat [19], Solanum tuberosum L. tuber [20], Carthamus tinctorius L flower [40], Magnolia sprengeri pamp flower [41]. Therefore, to experimentally confirm that the unigenes obtained in this study from transcriptome analysis were indeed differentially expressed, eight DEGs (Additional file 7) associated with the FBP were chosen for real-time quantitative PCR assay. The expression profiles of these unigenes are shown in Figure 6. Results showed that unigene 003987 (CHS), uni-gene005154 (F3H), unigene014794 and unigene004018 (F3'H), unigene004195 (DFR), unigene028912 and uni-gene017716 (LDOX), and unigene02509 (UF3GT) were up-regulated in the purple-fleshed tuber of yam, which was well consistent with those observed by transcriptome analylsis (Table 3, Figure 5). This result further confirms the reliability of RNA-seq analysis.

Identification of simple sequence repeats (SSRs) in yam
Usually, gene-derived SSRs are more transferable between species than random genomic SSRs. This is perhaps because they are associated with functional genetic variation, as opposed to non-coding SSRs, with presence in transcribed regions potentially influencing gene function, transcription or translation [26]. In this study, transcriptome analysis of the two yam tuber cultivars (DP and DW) led to identification of 11,793 SSRs within 121,253 unigenes, of which, 977 sequences contained more than 1 SSR, and 1706 SSRs were present in compound form. The observed frequency of unigenes was 8.6% (10,426); considering that approximately 71,983 kb total size was analyzed, and least one SSR per 6.1 kb could be detected in the expressed sequences of yam.
The motifs of 11,793 SSRs contained 5,788 (49.08%) dinucleotides, 5,582 (47.33%) trinucleotides, 335 (2.85%) tetranucleotides, 44 (0.37%) pentanucleotides and 44 (0.37%) hexaucleotides ( Table 4). The most abundant repeat type was AG/CT, followed by AAG/CTT, and AAAT/ATTT. Further, 6,082 SSR primer pairs were successfully designed using Primer 3. The details of the frequency of SSR motif and genic-SSR primers sequences (including designing parameters) are summarized in Additional file 8. Very recently, we revealed that Chinese yam species have rich genetic diversity and phenotype traits including stable tuber yield, taste, texture, and dry matter content [28]. In comparison with previous study using Roche 454 sequencing technology [21], a larger number of new genic-SSR markers were developed in this study, and they may be closely linked to these qualitative traits. In the future, these functional gene-based markers, will make it possible to construct a high density linkage map or association map for identification of quantitative traits loci (QTL) associated with tuber quality traits in yams.

Conclusions
The focus of this study was use of NGS-based Illumina paired-end sequencing platform to characterize the gene expression differences between an elite purple-flesh tuber and conventional white-flesh tuber of yam. A total of 125,123 unigenes were identified from the two cDNA libraries, which will contribute significantly to further genome-wide research and analyses of this species and other related species. Analysis of the transcriptome data revealed a number of candidate genes which are possibly involved in purple-flesh tuber formation. The candidates include not only structural genes such as CHS, F3H, F3'H, DFR, LDOX and UF3GT, but also some transcription factors (bHLH, MYB, and WD40) that potentially regulate development of purple-flesh in yam tubers. Such knowledge can be used to genetically enhance tuber color of conventional white-flesh cultivar. In addition, we also used transcriptomic data as a resource to develop new SSR markers. These marker sets will facilitate identification of quantitative traits loci (QTL) associated with yam tuber quality in future.

Plant materials
The elite purple-flesh tubers and the conventional white-flesh tubers of yam (D. alata) were cultivated in a yam producing region (Wenzhou city, Zhejiang province, China; 121°09′48.82″ E and 28°27′53.62 N). Both were planted at the same time and cultivated in similar conditions. Tubers were harvested 10 days after new tuber emergence (DAM) and used for transcriptome

Reads assembly and transcriptome annotation
The clean reads were obtained by read trimming of raw data by removing adaptors, reads in which unknown bases were more than 10%, low-quality reads with quality scores less than Q30, and low-quality bases less than (Q30) at the 3′ end. Next, the high-quality filtered reads were further assembled using a de novo assembly program Trinity (released 2011-05-19, http://trinityrnaseq.sourceforge.net/) with the main parameters "K-mer = 25, group_pairs_distance = 500, min_glue = 2, min_kmer_cov = 1" [42].
. Briefly, for each library (DP and DW), short reads were first assembled into longer contiguous sequences (contigs) according to their overlap regions, and then these reads were mapped back to the contigs based on their pairedend information. With paired-end reads it is possible to detect the contigs from the same transcript as well as the distances between these contigs. Afterwards, the contigs were further assembled, and the assembled sequences that could not be extended on either end and were defined as unigenes. Finally, the potential unigenes from DP and DW library were clustered using the TGICL clustering tool [43] to acquire a single set of non-redundant unigenes. In addition, to obtain assembly statistics profile about reads that could be mapped back to the assembled unigenes, TopHat (version 2.0.8) (released 2013-02-26, http:// tophat.cbcb.umd.edu/) [44] with the parameter "mateinner-dist = 250", was used to align short reads to the constructed transcripts by de novo assembling.
All assembled unigenes were annotated by matching against the NCBI non-redundant protein (NR), the Arabidopsis thaliana protein dataset of NR (ATNR), Gene Ontology (GO), and the Kyoto Encyclopedia of Genes and Genomes (KEGG) using the BLASTX analysis with a cut-off E-value of 10 −5 . Based on NR annotation, the Blast2GO software (version 2.3.5) was used to obtain GO annotations according to molecular function, biological process and cellular component ontologies (http://www.geneontology.org) [45]. The unigene sequences were subsequently matched against the COG database to predict and classify possible functions. The KEGG pathway annotation was also performed by comparison against the KEGG database using the online KEGG Automatic Server (KAAS) (http://www.genome.jp/ tools/kaas/) [46,47].

Differentially expressed genes (DEGs) between the DP and DW tubers
In order to assess the differential expression between the two investigated yam cultivars, TopHat (version 2.0.8) was first used to match against the assembled unigenes, which was followed by estimation of total mapped reads [44]. After the alignment, cufflinks (version 2.1.1) (released 2013-04-11, http://cufflinks.cbcb.umd.edu) was used to estimate the abundances of unigenes as Fragments Per Kilobase of transcript per Million fragments mapped (FPKM) [33]; and cuffdiff was carried out to perform pairwise comparisons between different investigated cultivars. Differentially expressed genes (DEGs) were further characterized and estimated using the R software module edge R (R v2.14; edgeRv 2.3.52) in term of the results from cufflinks [48]. False discovery rate (FDR) <0.05 and an estimated absolute log 2 fold-change (log 2 FC) ≥ 1 were used as threshold for determining significant difference in gene expression between the purple-and white flesh tubers of yam. Moreover, all DEGs were mapped to terms in the KEGG database and searched for KEGG terms to identify pathways related to purple-flesh trait in yam tubers.

qRT-PCR verification
Total RNA was extracted from the white and purple flesh-tubers of yam as described above. Approximately 2 mg of total RNA per sample was treated with DNaseI (Takara), and reverse transcribed into cDNA using