Data of the first de novo transcriptome assembly of the inflorescence of Curcuma alismatifolia

Curcuma alismatifolia, is an Asian crop from Zingiberaceae family, popularly used as ornamental plant in floriculture industry of Thailand and Cambodia. Different varieties with a wide range of colors can be found in species. Until now, few breeding programs have been done on this species and most commercially important cultivars are hybrids that are propagated vegetatively. In spite of other flowering plants, there is still lack of transcriptomic-based data on the functions of genes related to flower color in C. alismatifolia. The raw data presented in this article provides information on new original transcriptome data of two cultivars of C. alismatifolia by Illumina Hiseq. 4000 RNA-Seq technology which is the first ever report about this plant. The data is accessible via European Nucleotide Archive (ENA) under project number PRJEB18956.


a b s t r a c t
Curcuma alismatifolia, is an Asian crop from Zingiberaceae family, popularly used as ornamental plant in floriculture industry of Thailand and Cambodia. Different varieties with a wide range of colors can be found in species. Until now, few breeding programs have been done on this species and most commercially important cultivars are hybrids that are propagated vegetatively. In spite of other flowering plants, there is still lack of transcriptomic-based data on the functions of genes related to flower color in C. alismatifolia. The raw data presented in this article provides information on new original transcriptome data of two cultivars of C. alismatifolia by Illumina Hiseq. 4000 RNA-Seq technology which is the first ever report about this plant.

Value of data
The data obtained using Illumina sequencer is the first transcriptome data that can be useful for other ornamental ginger breeders.
The data presented here can be used by other researches for identification of differentially expressed genes (DEGs) and different pathways that may play a significant role in putative gene (s) discovery.
Further analysis of these data will be applicable for specific simple sequence repeats (SSRs) and single nucleotide polymorphism (SNP) markers development to perform phylogenetic analysis in breeding programs studies of Curcuma genus.

Data
The dataset of this article provides information about the inflorescence transcriptomic data for two cultivars of Curcuma alismatifolia namely 'Chiang Mai Pink' and 'UB Snow 701' with purple and white bract color generated from the polyA-enriched cDNA libraries prepared from the total RNA extracted using Illumina HiSeq. 4000 platform is provided.

Experimental design, materials and methods
The rhizomes of two cultivars of C. alismatifolia were provided from the Curcuma Nursery (Ubonrat), Thailand. Rhizomes were grown in screen house at field 2, Universiti Putra Malaysia, Malaysia. The inflorescences of two cultivars were harvested at the full-bloom stage and were immediately stored at À80°C until RNA extraction.
Total RNA was isolated from the purple and white bracts of the inflorescences using the modified TRIzol method [1]. The concentration and purity of isolated RNA were determined using NanoDrop 2000 (Thermo Fisher Scientific Inc.). The quality was verified by electrophoresis on 1.5% agarose gel. The two total RNAs were sent to Beijing Genomic Institute (BGI) Company (Shenzhen, China) for the construction of cDNA libraries using mRNA fragments as templates according to the manufacturer's instructions. The sequencing of two samples was performed using Illumina HiSeq. 4000 system.
After sequencing, firstly, raw reads were filtered for low-quality, adaptor-polluted, high content of unknown base (N) reads, empty reads, non-coding RNA (such as rRNA, tRNA and miRNA) to get clean reads. After filtering, clean reads were stored in FASTQ format [2]. A total of 131.93 Mb good quality reads were obtained after the removal of low-quality reads. The transcripts of length 200 bp and above were retained for further analysis. Using Trinity (v2.0.6) [3] (Table 1). N50: a weighted median statistic that 50% of the Total Length is contained in transcripts great than or equal to this value. GC (%): the percentage of G and C bases in all transcripts. Q20: the rate of bases which quality is greater than 20.