RNA-seq data from whole rice grains of pigmented and non-pigmented Malaysian rice varieties

Pigmented rice is enriched with antioxidants, macro- and micronutrients. A comprehensive investigation of the gene expression patterns among the pigmented rice varieties would help to understand the cellular mechanism and biological processes of rice grain pigmentation. Hence, we performed RNA sequencing and analysis on the whole grain of dehusked mature seeds of selected six Malaysian rice varieties with varying grain pigmentations. These varieties were black rice (BALI and Pulut Hitam 9), red rice (MRM16 and MRQ100) and white rice (MR297 and MRQ76). Illumina HiSeq™ 4000 sequencer was used to generate total raw nucleotides of approximately 53 Gb in size. From 353,937,212 total paired-end raw reads, 340,131,496 total clean reads were obtained. The raw reads were deposited into European Nucleotide Archive (ENA) database and can be accessed via accession number PRJEB34340. This dataset allows us to identify and profile all expressed genes with functions related to nutritional traits (i.e. antioxidants, folate and amylose content) and quality trait (i.e. aroma) across both pigmented and non-pigmented rice varieties. In addition, the transcriptome data obtained will be valuable for discovery of potential gene markers and functional SNPs related to functional traits to assist in rice breeding programme.


a b s t r a c t
Pigmented rice is enriched with antioxidants, macro-and micronutrients. A comprehensive investigation of the gene expression patterns among the pigmented rice varieties would help to understand the cellular mechanism and biological processes of rice grain pigmentation. Hence, we performed RNA sequencing and analysis on the whole grain of dehusked mature seeds of selected six Malaysian rice varieties with varying grain pigmentations. These varieties were black rice (BALI and Pulut Hitam 9), red rice (MRM16 and MRQ100) and white rice (MR297 and MRQ76). Illumina HiSeq TM 40 0 0 sequencer was used to generate total raw nucleotides of approximately 53 Gb in size. From 353,937,212 total paired-end raw reads, 340,131,496 total clean reads were obtained. The raw reads were deposited into European Nucleotide Archive (ENA) database and can be accessed via accession number PRJEB34340. This dataset allows us to identify and profile all expressed genes with functions related to nutritional traits (i.e. antioxidants, folate and amylose content) and quality trait (i.e. aroma) across both pigmented and non-pigmented rice varieties. In addition, the transcriptome data obtained will be valuable for discovery of potential gene markers and functional SNPs related to functional traits to assist in rice breeding programme.
© 2020 The Author(s

Value of the data
• These RNA-seq data obtained from the selected 6 rice varieties which represent the first complete set of transcriptome data generated from rice varieties with varying grain pigmentations (black, red and white). • This dataset allows us to discover functional genes related to rice grain pigmentation, nutritional and aromatic properties. • These data permit comparative transcriptomics between pigmented and non-pigmented rice varieties. Differential gene expression profiles between varieties could help in understanding of molecular mechanisms and biological processes that responsible for certain valuable rice trait. • These RNAseq data together with rice genomic data are important for identification of functional markers such as single nucleotide polymorphisms (SNPs) and microsatellites related to nutritional and quality traits for future rice genetic improvement research.

Data description
The dataset in this article is RNA-seq raw reads for dehusked whole rice grains obtained from mature seeds of four pigmented (BALI, Pulut Hitam 9, MRM16 and MRQ100) and two nonpigmented (MR297 and MRQ76) rice varieties. Raw data obtained from Illumina HiSeq TM 40 0 0 sequencer were deposited as FASTQ format in ENA database (accession number: PRJEB34340). The accession number for individual rice variety in ENA database were presented as ENA run primary accession in Table 1 . Analyses of sequencing data from each rice variety e.g. raw and clean reads, raw and clean nucleotide were performed as shown in Table 2 . The quality of clean reads were assessed and the percentage of high quality clean reads were obtained. By mapping clean reads to Oryza sativa japonica cv. Nipponbare reference genome, the number of mapped reads were estimated ( Table 3 ). Oryza sativa japonica cv. Nipponbare genome was used for clean reads mapping due to it is a well-assembled and annotated genome. Although a few indica rice cultivars have been sequenced however those genomes were not well-annotated [3] . Additionally, transcript assembly to reference genome with a threshold of FPKM ≥ 0.1 predicted the number of transcripts for each rice variety as listed in Table 3 .

Plant materials, total RNA extraction and quality assessment of total RNA
Mature seeds of each pigmented and non-pigmented rice variety were obtained in the field plots at MARDI Seberang Perai, Penang, Malaysia. The seeds were dehusked and the whole rice grain tissue (three plants of each variety) were ground into fine powder using liquid nitrogen. Total RNA extraction was performed using MTL method [4] with modifications. NanoDrop ND-10 0 0 (Thermo Scientific, Waltham, MA, USA) ultraviolet spectrophotometer was used to evaluate the isolated total RNA quantity and 1% (w/v) agarose gel electrophoresis was used to observe for the RNA degradation and contamination.

Library preparation and transcriptome sequencing
High-quality total RNA samples with RIN values ≥ 6.5 were subjected to isolation of messenger RNAs using oligo(dT) beads and cDNA synthesis was performed using random hexamers and SuperScript II Reverse Transcriptase (Invitrogen, USA) according to manufacturers' instructions. After that, second-strand synthesis by nick-translation was carried out using a custom secondstrand synthesis buffer (Illumina) added with dNTPs, RNase H and Escherichia coli polymerase I. The cDNA library was then constructed after a round of purification, terminal repair, A-tailing, ligation of sequencing adapters, size selection and PCR enrichment. The cDNA library concentration was quantified using a Qubit 2.0 fluorometer (Life Technologies, USA), and then diluted to 1 ng/μl before checking insert size on an Agilent 2100 bioanalyzer (Agilent Technologies, USA). Paired-end sequencing was performed on the cDNA fragments from the resulting libraries using Illumina HiSeq40 0 0 TM platform with read length of 150 bp at each end.

Repository and processing of RNA-seq raw data
The sequencing raw reads were deposited into European Nucleotide Archive (ENA) ( https: //www.ebi.ac.uk/ena ) with an accession number of PRJEB34340. Table 1 shows the ENA Run Primary accession numbers of individual pigmented and non-pigmented rice transcriptome in ENA database. The raw reads were subsequently filtered using Trimmomatic version 0.36 [5] to remove the adapter sequences, contamination and low-quality reads. Table 2 shows the statistics of raw and clean reads of individual rice transcriptome after sequence processing and analysis.

Reads mapping, transcripts assembly and gene expression analysis
The clean reads were mapped to the reference genome of Oryza sativa japonica cv. Nipponbare. Bowtie2 version 2.3.0 was used to index the reference genome, while TopHat2 version 2.0.12 [6] was used to map the clean reads onto the reference genome. The default parameters were used for the above analyses. HTSeq version 0.6.1 [7] was used to estimate the Fragments Per Kilobase of transcript per Million mapped reads (FPKMs) that were mapped to each rice gene. A threshold of FPKM ≥ 0.1 was used to determine the significance of gene expression. Cufflinks version 2.1.1 [8] was used to combine and assemble the mapped reads into the transcript. The number of mapped reads, percentage of mapped reads and number of transcripts are shown in Table 3 . These sequences and information will be used for further downstream analyses such as differential expressed genes, genes co-expression network and SNPs calling.