De novo transcriptome analysis and glucosinolate profiling in watercress (Nasturtium officinale R. Br.)

Watercress (Nasturtium officinale R. Br.) is an aquatic herb species that is a rich source of secondary metabolites such as glucosinolates. Among these glucosinolates, watercress contains high amounts of gluconasturtiin (2-phenethyl glucosinolate) and its hydrolysis product, 2-phennethyl isothiocyanate, which plays a role in suppressing tumor growth. However, the use of N. officinale as a source of herbal medicines is currently limited due to insufficient genomic and physiological information. To acquire precise information on glucosinolate biosynthesis in N. officinale, we performed a comprehensive analysis of the transcriptome and metabolome of different organs of N. officinale. Transcriptome analysis of N. officinale seedlings yielded 69,570,892 raw reads. These reads were assembled into 69,635 transcripts, 64,876 of which were annotated to transcripts in public databases. On the basis of the functional annotation of N. officinale, we identified 33 candidate genes encoding enzymes related to glucosinolate biosynthetic pathways and analyzed the expression of these genes in the leaves, stems, roots, flowers, and seeds of N. officinale. The expression of NoMYB28 and NoMYB29, the main regulators of aliphatic glucosinolate biosynthesis, was highest in the stems, whereas the key regulators of indolic glucosinolate biosynthesis, such as NoDof1.1, NoMYB34, NoMYB51, and NoMYB122, were strongly expressed in the roots. Most glucosinolate biosynthetic genes were highly expressed in the flowers. HPLC analysis enabled us to detect eight glucosinolates in the different organs of N. officinale. Among these glucosinolates, the level of gluconasturtiin was considerably higher than any other glucosinolate in individual organs, and the amount of total glucosinolates was highest in the flower. This study has enhanced our understanding of functional genomics of N. officinale, including the glucosinolate biosynthetic pathways of this plant. Ultimately, our data will be helpful for further research on watercress bio-engineering and better strategies for exploiting its anti-carcinogenic properties.

In this study, we used an Illumina NextSeq500 sequencer to analyze the transcriptome of N. officinale seedlings and generated 69,570,892 raw reads that were assembled into 69,635 transcripts. The N. officinale transcriptome showed highest species similarity and annotation ratio to Arabidopsis thaliana. From the transcriptome data, we identified several candidate genes that encode enzymes related to glucosinolate biosynthetic pathways. To validate the spatial distribution of glucosinolate-related genes, we analyzed the expression of glucosinolate biosynthesis genes and transcription factors in different organs of N. officinale using quantitative real-time RT-PCR. Metabolite profiling using HPLC-UV analysis identified eight different glucosinolates in the different organs of N. officinale, and the total glucosinolate contents were found to be highest in flowers. Among the eight identified glucosinolates, the level of gluconasturtiin was considerably higher than that of any other glucosinolate, irrespective of the organ. Taken together, the data obtained from this comprehensive transcriptomic and metabolomic profiling will provide an invaluable resource for a better understanding of glucosinolate biosynthetic pathways, as well as strategies for exploiting the anti-carcinogenic properties in N. officinal.

Plant material and growth conditions
Nasturtium officinale seeds were purchased from Asia Seeds Co., Ltd (Seoul, Korea) and grown under field conditions at the experimental greenhouse of Chungnam National University (Daejeon, Korea). Different organs were harvested from mature plants at approximately 2 months after sowing. The samples were immediately frozen in liquid nitrogen and then stored at -80°C for RNA isolation or freeze-dried for subsequent analysis by high performance liquid chromatography (HPLC).

Illumina sequencing of the transcriptome
Total RNA was isolated from frozen seedlings of N. officinale using the RNeasy Mini Kit (Qiagen, USA) and cleaned by ethanol precipitation. We removed rRNAs in total RNA using the ribo-zero rRNA removal kit (Epicentre, RZPL11016) and constructed a cDNA library for RNA sequencing using the TruSeq stranded total RNA sample prep kit-LT set A and B (Illumina, RS-122-2301 and 2302) according to the manufacturer's protocols (Illumina, San Diego, CA, USA). The cDNA library was sequenced in 76 bp length paired-end (PE) reads in an Illumina Next-Seq500 sequencer (Illumina Inc., San Diego, CA, USA) to produce 69,570,892 raw sequencing reads.

De novo assembly and annotation of the watercress transcriptome
The quality-trimmed reads of watercress RNAs were assembled as contigs of the watercress transcriptome using the Trinity software package (http://trinityrnaseq.github.io/) [43]. The Trinity program combines the overlapping reads of a given length and quality into longer contig sequences without gaps. The characteristic properties, including N50 length, average length, maximum length, and minimum length of the assembled contigs were calculated using Transrate software (http://hibberdlab.com/transrate) [44]. We clustered the watercress transcriptome contigs based on sequence similarity using CD-HIT-EST software (http:// weizhongli-lab.org/cd-hit) [45]. To infer the biological functions of watercress transcripts, we performed a homology search of the transcripts in the various public protein and nucleotide databases. A BLASTX search was performed using the National Center for Biotechnology Information (NCBI) (http://blast.ncbi.nlm.nih.gov) nr and Clusters of Orthologous Group (COG) (http://www.ncbi.nlm.nih.gov/ COG) protein databases, BRAD (http://brassicadb.org/ brad) Brassica rapa protein database, TAIR (TAIR10, http://www.arabidopsis.org) Arabidopsis thaliana protein database, and the EBI Swiss-Prot (UniProt) database. A BLASTN search was performed using the NCBI nucleotide database. The best scored hit from the BLASTX and BLASTN results passed the cutoff of e-value < 10 −5 and was selected for annotation of query transcripts for each database search. Transcript lists and sequences are presented in Additional files 1 and 2. The functional category distributions of watercress transcripts in terms of Gene Ontology (GO) and COG were evaluated using the results of the homology search. COG functional category information attached to the hit COG proteins was used for determining COG functional category distribution, and GO information attached to the hit UniProt proteins was collected and re-analyzed using the WEGO tool (http:// wego.genomics.org.cn) [46] in terms of the level for the three GO categories.

Differentially expressed gene analysis
To quantify watercress transcript expressions, we aligned preprocessed quality-trimmed reads on the watercress transcript sequences and calculated the expression values with the aligned read counts for each transcript. Bowtie2 (http://bowtie-bio.sourceforge.net/bowtie2) software [47] was used to align the quality-trimmed reads on the transcript sequences, and eXpress (http://bio.math.berkeley.edu/eXpress) software [48] was used to evaluate gene expression, in terms of fragments per kilobase of exon per million mapped fragments (FPKM), from the aligned results. The FPKM method provides a comparison between genes within a sample or between samples by normalizing the amount of sequencing for samples and gene length bias during gene expression evaluation.
Identification of candidate genes related to glucosinolate biosynthetic pathways We searched for candidate genes involved in glucosinolate biosynthetic pathways using functional annotation data based on the orthologous gene names. In addition, the glucosinolate biosynthetic genes of Arabidopsis obtained from TAIR were used as queries to search for homologous sequences in the watercress transcriptome database. Following this, each sequence was confirmed by the BLAST program in the NCBI GenBank database.

Quantitative real-time RT-PCR
For quantitative real-time RT-PCR, gene-specific primer sets were designed for each gene using the Primer3 website (http://frodo.wi.mit.edu/primer3/). Real-time RT-PCR was performed in a CFX96 real time system (BIO-RAD Laboratories, USA) using 2x Real-Time PCR Smart mix (BioFACT, Korea) under the following conditions: 95°C for 15 min, followed by 40 cycles of 95°C for 15 s, annealing for 15 s at 56°C, and elongation for 20 s at 72°C. PCR products were analyzed using Bio-Rad CFX Manager 2.0 software. Gene expression was normalized to that of the UBC9 gene, used as a housekeeping gene. The results of the real-time RT-PCR assay were calculated as the mean of three different biological experiments using seeds and different plant organs (all leaves, stems, roots, and flowers) of individual plants. Real-time RT-PCR product sizes and primer sequences are shown in Additional file 3: Table S1.

Results
Sequencing and de novo assembly of the N. officinale transcriptome As shown in Fig. 1, watercress can grow to a height of 50 to 120 cm and has slender hollow stems and small round leaves (Fig. 1a). Tiny white flowers are formed in clusters and become small pods containing two rows of seeds (Fig. 1b). To provide an overview of the N. officinale transcriptome, we performed RNA-sequencing analysis of N. officinale seedlings using the Illumina NextSeq500 platform (Fig. 1c). After removal of adaptor sequences, 69,570,892 reads comprising a total of 5,287,387,792 nucleotides were obtained for assembly (Table 1). These reads were assembled using the Trinity program, resulting in 123,433 contigs with an average length of 724 nt and an N50 of 994 nt. After clustering with CD-Hit-EST, the contigs were assembled into 69,635 transcripts with a mean size of 681 nt and N50 of 930 nt. The size distribution of the transcripts exhibited the following pattern: 25.51% (17,770) of the transcripts were less than 300 nt, 55.38% (38,564) of the transcripts ranged from 300 to 1000 nt in length, 18.09% (12,603) of the transcripts ranged from 1000 to 3000 nt, and 1.0% (698) were more than 3000 nt in length (Additional file 4: Figure S1).

Functional annotation and classification of N. officinale transcripts
For functional annotation, the transcripts were identified based on the BLASTX algorithm (available at the NCBI website) against the non-redundant (NR) protein database and nucleotide (NT) database with an E-value cutoff of 1 x 10 −5 ( Table 2). Of the total 69,635 transcripts, 57,550 transcripts (82.65%) had BLAST hits to known proteins in the NR database and 61,020 transcripts (87.63%) had BLAST hits to nucleotides in the NT database. In addition, some transcripts were aligned to public databases, including 46   The E-value distribution of the transcripts in the NR databases showed that 54.6% of aligned transcripts had strong similarity with an E-value <1e-60, whereas the remaining 45.4% of the homologous sequences ranged from 1e-5 to 1e-60 (Fig. 2a). The similarity distribution in the NR database showed that 81% of the sequences had a similarity higher than 80.7 and 19.3% of the sequences had a similarity lower than 80% (Fig. 2b). In the species distribution, the N. officinale transcriptome showed 20.5% similarity with that of Arabidopsis thaliana, with lower similarities to other species, including Camelina sativa (19.4%), Arabidopsis lyrata (18.6%), Eutrema salsugineum (12.9%), Capsella rubella (12.8%), Brassica napus (5.7%), Arabis alpina (4.1%), Brassica rapa (2.4%), and others (3.6%) (Fig. 2c). Most BLAST hits (approximately 96.4%) were to sequences from the Brassicaceae family. The N. officinale transcriptome showed highest species similarity and annotation ratio to A. thaliana, which is an important plant model species.
Arabidopsis is a member of Brassicaceae family such as N. officinale and contains 25,498 genes encoding proteins from 11,000 families [51]. Arabidopsis and N. officinale have similar morphology and significant sequence homology, indicating the correlation between mouse-ear cress and watercress.
Expression analysis of glucosinolate-related genes in different organs of N. officinale Brassica rapa has 102 putative glucosinolate genes, which are orthologs of 52 glucosinolate genes in A. thaliana. The homologous glucosinolate genes in B. rapa and A. thaliana share 59%-91% nucleotide sequence identity [52]. To identify the expression of genes that encode enzymes related to the glucosinolate biosynthetic pathways, we analyzed the N. officinale transcriptome dataset. On the basis of the functional annotation of the N. officinale transcriptome, we found that seven glucosinolate transcription factors and 26 glucosinolate biosynthetic genes were highly similar to those of species belonging to the Brassicaceae such as A. thaliana, B. oleracea, and B. rapa (Table 3). e Amino acid transport and metabolism; f Nucleotide transport and metabolism; g Carbohydrate transport and metabolism; h Coenzyme transport and metabolism; i Lipid transport and metabolism; j Translation, ribosomal structure and biogenesis; k Transcription; l Replication, recombination and repair; m Cell wall/membrane/ envelope biogenesis; n Cell motility; o Post-translational modification, protein turnover, chaperones; p Inorganic ion transport and metabolism; q Secondary metabolite biosynthesis, transport and catabolism; r General functional prediction only; s Function unknown; t Signal transduction mechanisms; u Intracellular trafficking, secretion, and vesicular transport; v Defense mechanisms; w Extracellular structures; x Phage-derived proteins, transposases and other mobilized components; y Nuclear structure; z Cytoskeleton   and seeds of N. officinale by quantitative RT-PCR (Fig. 5). The expression of NoMYB28 and NoMYB29 was highest in the stems, which is consistent with the transcript levels of BrMYB28, BrMYB29-2, and BrMYB29-3 in the stems of B. rapa [53]. NoMYB34, NoMYB51, NoMYB122, and NoDof1.1 were more strongly expressed in the roots compared with other organs. Finally, the highest expression of NoIQD1-1, which is involved in both aliphatic and indolic glucosinolate biosynthesis, was observed in leaves. Most glucosinolate biosynthetic genes were more highly expressed in the flowers compared with the leaves, stems, roots, and seeds. However, NoMAM1, NoMAM3, NoCYP83A1, NoGSTU20, NoST5c, and NoFMO GS-OX2, which are involved in aliphatic glucosinolate biosynthesis, had the highest expression levels in stems, roots, leaves, seeds, roots, and leaves, respectively (Fig. 6). In addition, among the indolic glucosinolate biosynthetic genes, the highest expression levels of NoCYP79B3, NoGSTF10, NoCYP81F3, and NoPEN2 were observed in roots (Fig. 7).

Analysis of glucosinolate content in different organs of N. officinale
In HPLC analysis, we identified eight different glucosinolates in the different organs of N. officinale; glucoiberin, glucotropaeolin, 4-hydroxyglucobrassicin, glucosiberin, glucohirsutin, glucobrassicin, 4-methoxyglucobrassicin, and gluconasturtiin ( Table 4). The levels of these glucosinolates distributed over the different organs of N. officinale  (Table 5). The amount of total glucosinolates was highest in the flower, 6.1, 3.0, 2.3, and 1.2 times higher than that in the root, stem, leaf, and seed, respectively. Among the eight glucosinolates, the level of gluconasturtiin was considerably higher than any other glucosinolate, irrespective of the organ. In particular, the gluconasturtiin content in the flower and seed was considerably higher than that in other organs. The content of gluconasturtiin in the flower was 9.8, 2.9, 2.2, and 1.3 times higher than that in the root, stem, leaf, and seed, respectively. The content of glucotropaeolin was also highest in the flower, with concentrations 12.0, 4.5, and 2.3 times higher than that in the stem, leaf, and root, respectively. The second highest level of total glucosinolates was observed in the seed. The seed contains higher amounts of glucoiberin, glucosiberin, and glucohirsutin than the other organs of N. officinale. The level of glucoiberin was 7.9, 5.6, 3.3, and 1.8 times higher in the seed than in the root, leaf, stem, and flower, respectively. The content of glucosiberin was highest in the seed, with levels 8.6, 7.8, 6.8, and 1.4 times higher than those in the stem, root, leaf, and flower, respectively. The amount of glucohirsutin was highest in the seed, being 9.6, 7.6, 6.5, and 1.6 times higher than that in the stem, leaf, root, and flower, respectively. Although the total glucosinolate content was lowest in the root, the amount of 4-hydroxyglucobrassicin was highest in the root, with levels 36.8, 32.7, 26.7, and 2.9 times higher than that in the seed, stem, leaf, and flower, respectively. The root also contained the highest amount of glucobrassicin.

Discussion
Despite the health-benefiting importance and economical value of watercress, there is still limited genomic and   [27][28][29][30][31]. Intensive research on the relationship between shikimate pathway genes and glucosinolate biosynthetic genes in watercress will enhance our understanding of functional genomic approach, including glucosinolate biosynthetic pathways.
Many indole glucosinolate biosynthetic genes are specifically expressed at highest levels in the roots and flowers. In N. officinale, the accumulation patterns of indole glucosinolates, such as 4-hydroxyglucobrassicin and glucobrassicin, coincide with the expression patterns of the genes related to these indole glucosinolates. In contrast, the accumulation patterns of aliphatic glucosinolates did not coincide with the expression pattern of aliphatic glucosinolate-related genes in N. officinale. Most of the aliphatic glucosinolate biosynthetic genes were more highly expressed in the flowers compared with the leaves, stems, roots, and seeds, whereas the contents of aliphatic glucosinolates, such as glucoiberin, glucosiberin, and glucohirsutin, were relatively higher in the seed. Several genes involved in the regulation of glucosinolate biosynthetic pathways and external stimulations could be linked to accumulate the glucosinolate contents [54,55]. Although there are many reasons for this discordance, the shift in developmental stage from flower to seed might possibly explain the discrepancy between gene expression pattern and metabolite content.

Conclusions
In RNA sequencing analysis using an Illumina Next-Seq500 sequencer, we identified a total 69,635 transcripts and annotated 64,876 transcripts, which provide basic information for further research on the secondary metabolites in N. officinale. Our transcriptome data reveal that several genes encoding enzymes related to glucosinolate biosynthetic pathways are well conserved in N. officinale and that these genes have high similarity to those in other cruciferous plants such as Arabidopsis thaliana, Brassica rapa, and Camelina sativa.
On the basis of our gene expression study and HPLC analysis, we identified that most glucosinolate biosynthetic genes are highly expressed in flowers and that the content of total glucosinolates was also higher in flowers than in other organs, indicating a positive correlation between the expression of glucosinolate-related genes and glucosinolate Total glucosinolates were measured in 2-month-old N. officinale (μg g −1 dry weight). Each value represents the mean of three replicates and error bars are SDs