Identification of putative flowering genes and transcription factors from flower de novo transcriptome dataset of tuberose (Polianthes tuberosa L.)

Polianthes tuberosa is commercially popular because of their economic importance in floriculture for cut and loose flowers and in perfume industry because of the unique fragrance. Despite its commercial importance, no ready-to-use transcript sequence information is available in the public database. We have sequenced the RNA obtained from tuberose flowers using the Illumina HiSeq. 2000 platform and have carried out a de novo analysis of the transcriptome data. The de novo assembly generated 11,100 transcripts. These transcripts represent a total of 7876 unigenes that were considered for downstream analysis. These 7876 unigenes, which was further annotated using blast2go and KEGG pathways, were also assigned. Tuberose transcripts were also assigned to metabolic pathways using the Kyoto Encyclopedia of Genes and Genomes database to determine their biochemical functions. 4591 of the tuberose transcripts matched to genes in KEGG pathways and 66 transcripts were mapped to the Flavonoid biosynthesis pathway. 21 flowering genes have been identified in this tuberose transcriptome. Transcription factor analysis helped in the identification of a large number of transcripts similar to key genes in the flowering regulation network of Arabidopsis thaliana. Among the transcription factors identified “NAC” which is associated with plant stress response represented the most abundant category followed by APETALA2 (AP2)/ethylene-responsive element binding proteins (EREBPs) which plays various role in floral organ identity and respond to different biotic and abiotic stress.


Subject area
Plant Biotechnology and Bioinformatics More specific subject area Transcriptome Type of data Data is with this article and the raw sequence data generated has been deposited in the SRA database (http://www.ncbi.nlm.nih.gov/biopro ject/321962) for public access (BioSample accession ID: SAMN05006898).

Value of the data
This is the first report of de novo transcriptome analysis of Polianthes tuberosa flower. Tuberose transcripts were assigned KEGG pathways from the transcriptome data. Flowering genes and transcription factors were identified from the transcriptome data successfully.
Transcriptome data will provide a strong foundation for research on gene expression, genomics and functional genomics in Polianthes tuberosa and other important members of Amaryllidaceae.
The data generated during this work has not only added so much of information on a plant which had no genomic information on the public domain but also shall help in the studies of other economically important plants like daffodils, snowflakes, onions and garlic belonging to the same family.
The data will help in the better understanding of expression patterns and their relation to function and regulation, and also the genetic mechanisms, evolutionary relationships between tuberose and other plants.
This transcriptomic analysis has opened up the prospects for a better understanding of its genomics and we have updated the current gene resource.

Data
In spite of its considerable industrial importance, genomic information on tuberose is very scarce. There are no public Expressed Sequence Tags (EST) or ready-to-use transcripts for Polianthes tuberosa. This is for the first time a high-throughput, RNA sequencing (RNA-Seq) of the P. tuberosa flower transcriptome was carried out to generate a database that will be useful for further functional analyses. An overview of the sequencing assembly of P. tuberosa transcriptome data is presented in Table 1. The length distribution of unigenes is shown in the Fig. 1. The blast result showed that unigenes returned 79.76% (6282) significant hits against the reported datasets. When considering the annotation by species, significant similarity to Elaeis guineensis followed by Phoenix dactylifera both belonging to the monocotyledons was obtained (Fig. 2).
Using gene ontology, 1446 ESTs were classified to cellular component category, 2521 ESTs were classified for biological process and 1493 ESTs were classified under molecular function category. A summary with the number and percentage of unigenes annotated in each GO slim term is shown (Fig. 3). According to the data 4122 unique sequences were classified into 24 COG categories (Fig. 4). KEGG Orthology (KO identifiers) for the unigenes were retrieved (Supplementary Table S1a Table  S1b). We have identified 21 unigenes which showed homology to Arabidopsis thaliana flowering genes ( Table 2). Analysis of transcription factor in tuberose revealed a total of 511 unigenes, representing 6.48% of the transcriptome classified into 59 putative transcription factors (TF) families (Supplementary Table S2; Fig. 6).

Plant material
Fully opened tuberose flowers of cultivar Shringar were collected and were immediately frozen in liquid nitrogen and stored at À 80°C.

RNA extraction, cDNA library construction and sequencing
Total RNA was extracted from frozen flower tissues using 596 Nucleospin RNA isolation kit (Macherey-Nagel GmbH & Co. KG, Duren, Germany). Agilent 2100 Bioanalyzer (Agilent Technologies) was used to assess the quality and quantity of RNA. RNA with an RNA integrity number (RIN) of 8.0 was only considered mRNA purification. OligodT beads (Illumina s TruSeq s RNA Sample Preparation Kit v2) were used to purify mRNA from one microgram of total RNA. Elevated temperature (90°C) in presence of divalent cations was used to achieve the fragmentation of the purified mRNA. cDNA synthesis was done using random hexamers with Superscript II Reverse Transcriptase (Invitrogen Life Technologies). Agencourt Ampure XP SPRI beads (Beckman-Coulter) were used to clean the cDNA. Illumina adapters were ligated to the cDNA molecules after end repair and the addition of an 'A' base followed by SPRI clean-up. The resultant cDNA library was amplified using PCR for the enrichment of adapter-ligated fragments, quantified using a Nanodrop spectrophotometer (Thermo Scientific) and validated for quality with a Bioanalyzer (Agilent Technologies). The libraries were then sequenced on Illumina Hiseq. 2000 platform at SciGenom Next-Gen sequencing facility, Cochin, India.

Sequence data assembly and analysis
NGSQC Toolkit version v2.3.3 [1] was used to remove low quality reads (Phred score o 30) and to generate sequencing statistics. High quality paired end filtered reads (15.9 gb) obtained were used for de-novo assembly using Velvet (v.1.2.08) and Oases (v.0.2.08) pipeline [2]. Velveth assembly was done with various k-mer range (71-83) and optimal assembly was attained at k-mer 83. Oases tool was used to identify non-overlapping isoforms/splice variants at minimum transcript length 100. Since our initial target was to identify unique genes. Thus, transcripts were subjected for clustering using CD-HIT-EST [3] 90% similarity. ORF Predictor web server (http://bioinformatics.ysu.edu/tools/OrfPre dictor.html) [4] was used to predict proteins from the all non-redundant transcripts (Z 100 bp) using the default cut-off value of 1e À 5, and 7876 proteins were predicted which were considered for the annotation. The raw sequence data generated has been deposited in the SRA database (http://www. ncbi.nlm.nih.gov/bioproject/321962) for public access (BioSample accession ID: SAMN05006898).

Functional annotation and biological classification of transcripts
Functional annotation of predicted tuberose transcripts was performed using blast2go pipeline on default settings [5]. BLASTP [6] were performed with an E-value of 1e À 5 to align against NCBI non-redundant (nr) protein database for homology search. Blast results (xml format) were imported to Blast2GO V.3.0.11. GO annotations were performed with default settings and following GO annotation, an Interproscan [7] was performed and results were merged to the GO annotations.