The first transcriptome dataset of roselle (Hibiscus sabdariffa L.) calyces during maturation

Roselle (Hibiscus sabdariffa L.) is recognized for its phytochemical compounds such as anthocyanins, which possess pharmacological potentials in the treatments of hypertension, diabetes, cancer, hyperlipidaemia and hyperglycaemia. The calyx is the most commercially valuable part of the roselle and usually harvested at maturation. However, genetic study to understand the transcriptome changes in the calyx during maturation has yet to be explored. In this study, we sequenced the transcriptomes of roselle calyces at maturation stages III and IV using Illumina NextSeq 500 platform. These are the two most critical maturation stages in roselle, as these stages are often associated with the quality of the calyx. Over 200 million good quality paired-end reads were generated and de novo assembled into a reference transcriptome consisting of 221,334 transcripts with N50 score of 491bp. Among these transcripts, 92,974 transcripts (42%) were successfully annotated. The total number of significantly differentially expressed genes (DEGs) and the top five most significantly regulated genes in each of the maturation stage were presented. Twenty-one genes implicated in the biosynthesis of anthocyanins and their relative expressions in the calyx tissues at the two maturation stages were reported. Two secondary metabolites biosynthesis pathways that attained a relatively higher number of DEG mappings compared to other pathways were also reported. The findings from this work provide novel insights to better understand the transcriptional changes in roselle during calyx maturation, and the data made available here is intended for continued genetic study on roselle. The work is registered under NCBI Bioproject PRJNA664826. The raw sequencing reads are available in Short Read Archive with the accession numbers SRX9171161, SRX9171162, SRX9171163, SRX9171164, SRX9171165 and SRX9171166.


a b s t r a c t
Roselle ( Hibiscus sabdariffa L.) is recognized for its phytochemical compounds such as anthocyanins, which possess pharmacological potentials in the treatments of hypertension, diabetes, cancer, hyperlipidaemia and hyperglycaemia. The calyx is the most commercially valuable part of the roselle and usually harvested at maturation. However, genetic study to understand the transcriptome changes in the calyx during maturation has yet to be explored. In this study, we sequenced the transcriptomes of roselle calyces at maturation stages III and IV using Illumina NextSeq 500 platform. These are the two most critical maturation stages in roselle, as these stages are often associated with the quality of the calyx. Over 200 million good quality paired-end reads were generated and de novo assembled into a reference transcriptome consisting of 221,334 transcripts with N50 score of 491bp. Among these transcripts, 92,974 transcripts (42%) were successfully annotated. The total number of significantly differentially expressed genes (DEGs) and the top five most significantly regulated genes in each of the maturation stage were presented. Twenty-one genes implicated in the biosynthesis of anthocyanins and their relative expressions in the calyx tissues at the two maturation stages were reported. Two secondary metabolites biosynthesis path-ways that attained a relatively higher number of DEG mappings compared to other pathways were also reported. The findings from this work provide novel insights to better understand the transcriptional changes in roselle during calyx maturation, and the data made available here is intended for continued genetic study on roselle.

Value of the Data
• The transcriptome dataset is useful to offer insight into the transcriptome regulation of key genes of interest that are responsible for the desired traits such as the genes involved in the biosynthesis of anthocyanins and flavonoids in roselle. • The transcriptome dataset could benefit studies that focuses on pigment development, antioxidant activities and pharmacological research directed to roselle calyces. It could also be used as a reference data for researchers working on roselle. • The transcriptome dataset allows for further downstream applications associated with newly constructed genetic resources such as genic SSR marker development and discovery of transcription factors.

Data Description
The maturation stage of the calyx is commonly used as a measure to reflect its quality, where it is preferably harvested during stage III over stage IV. To date, no scientific study has been conducted to reveal the transcriptome changes in the calyces of the two stages. Six sequencing libraries prepared from stage III calyx (samples A, B, C) and stage IV calyx (samples a, b, c) were sequenced using Illumina NextSeq 500. The datasets contained raw data sequence converted to FastQ format. The raw reads were deposited into NCBI Short Read Archive with te Bioproject number PRJNA664826 (accession numbers SRX9171161, SRX9171162, SRX9171163, SRX9171164, SRX9171165 and SRX9171166). The number of total reads and good quality reads generated in this study is presented in Table 1 . Transcript annotation based on SwissProt/Pfam/GO associations and the number of unique database hits are displayed in Fig. 1 . Gene ontology classifications derived from stage III and stage IV roselle calyx transcriptomes are presented in Fig. 2 and Fig. 3 , respectively. Fig. 4 and Fig. 5 show the scatter plot matrices comparing normalised expression values from biological replicates of stage III and stage IV calyces. Volcano plot comparing transcripts expression in stage III and stage IV calyx tissues is visualised in Fig. 6 , and the total number of significantly differentially expressed transcripts between stage III and stage IV calyx tissues is summarised in Table 2 . Pathway reconstruction of significantly differentially expressed genes (DEG) based on KEGG database is displayed in Fig. 7 . The anthocyanin-related genes and the number of transcripts associated with them are listed in Table 3 . The top five most significantly regulated DEGs in stage III and stage IV are shown is Table 4 .
The two secondary metabolites biosynthesis pathways that attained a relatively higher number of DEG mappings compared to other pathways within the same category were the phenylpropanoid biosynthesis pathway and the isoquinoline and alkaloid biosynthesis pathway. Table 1 Number of total raw reads, good-quality reads before and after quality trimming and filtering, and the number of lowquality reads discarded.

Sample
Raw

Plant material
Hibiscus sabdariffa variety UMKL1 (voucher number: CY001) was used in this study. The roselle planting materials were obtained from the Department of Agriculture Terengganu, Malaysia. It is a red variety registered as 'HS2' by the Malaya University for national crop list under the Department of Agriculture Malaysia.
Calyx tissues were harvested at stages III and IV determined based on the number of days post-anthesis (DPA), in reference to the calyx tissue maturation index guidelines provided by the Federal Agricultural Marketing Authority, Malaysia ( Table 5 ) [1] . Three biological replicates of the calyx tissues for each maturation stage were used in this study. Biological replicates of stage III calyces (designated as samples A, B and C) were collected on 32 nd DPA; whereas biological replicates of stage IV calyces (designated as samples a, b and c) were collected on 59 th DPA. Collected samples were immediately flash-frozen in liquid nitrogen and stored in a -80 ˚C until RNA extraction as a preventive measure against rapid RNA degradation. A separate set of calyx samples were harvested during sampling, observed, and dissected, and the images were captured for the evaluation of phenotypic differences.

RNA extraction, rRNA-depletion, cDNA sequencing library construction and RNA sequencing
Total RNA was extracted from calyx tissue samples using RNeasy Plant Mini Kit (Qiagen, Hilden, Germany) based on the manufacturer's protocol version 2012. RNA extractions were conducted in triplicate for calyx samples from each maturation stage. RNA quality and quantity were determined using Qubit Fluorometer 2.0 (Life Technologies Corporation, Carlsbad, USA) and NanoDrop spectrophotometer (Thermo Fisher Scientific, Wilmington, USA) for concentration assessment and Agilent 2100 Bioanalyzer (Agilent Technologies, Germany) via a Pico Chip and 1% agarose gel electrophoresis for preliminary assessment of band size and quality. RNA samples having a RIN value above seven and a concentration of more than 1 μM were selected rRNA-depletion.
RNA samples were subjected to rRNA-depletion using Terminator TM 5'-Phosphate-Dependant Exonuclease (Epicentre, Madison, USA) for the purpose of removing the abundance of ribosomal RNAs (rRNA) from the sample to ensure most of the sequencing output generated accounts for reads derived from messenger RNAs (mRNA) which is important for gene expression analysis. The resulted rRNA depleted RNA samples were assessed using Qubit Fluorometer 2.0 (Life Technologies Corporation, Carlsbad, USA) and Agilent 2100 Bioanalyzer (Agilent Technologies, Germany). rRNA-depleted samples having a concentration of more than 0.5 ng/μl and a rRNA contamination below 10% were selected to construct the cDNA sequencing libraries.
Constructions of the cDNA sequencing libraries were carried out using ScriptSeq TM v2 RNA-Seq Library Preparation Kit (Epicentre, Madison, USA) based on the manufacturer's handbook to reverse transcribe the less stable RNA into a more stable molecule, complementary DNA (cDNA) that is used as the template for sequencing. The ScriptSeq cDNA libraries were quantified using Qubit Fluorometer 2.0 (Life Technologies Corporation, Carlsbad, USA) and the size distribution was assessed using the Agilent 2100 Bioanalyzer (Agilent Technologies, Germany) via a High Sensitivity DNA Chip. cDNA sequencing libraries with more than 60% of the fragment sizes falling within the 20 0-10 0 bp range were selected and quantitated using the CFX96 Touch TM Real-Time PCR Detection System (Bio-Rad Laboratories Inc, USA). Libraries were sequenced using Illumina NextSeq 500 Sequencing platform (Illumina, USA), resulted in the generation of 76 bp paired-end sequences.

Data analysis 2.3.1. De novo assembly and sequence annotation
Raw sequencing reads were cleaned by filtering the PhiX (Illumina co-sequencing positive control) sequences using Bowtie2 version 2.2.3 software [2 , 3] . The filtered reads were then subjected to Illumina sequencing adapter removal, base quality trimming (Q ≥30) and removal of short reads ( ≤35 base pairs) with its pair via BBDUK software [4] to generate good quality sequencing reads. De novo reconstruction of transcriptomes using the good quality reads was performed using Trinity version 2.2.0 software [5 , 6] for the assembly of contiguous sequences representing full or partial fragments of the transcriptome. Protein coding region of the assembled transcripts were then predicted using Transdecoder version 2.0.1 software, and subsequently functionally annotated using Trinotate version 3.0.1 software. Sequence similarity were searched against Swiss-Prot database [7] with BLAST version 2.2.31 + [8] for protein sequences and Pfam-A.hmm dat abase [9] via HMMER version 3.1b2 [10] software for protein domain recognition. The sequence similarity and protein domain identification results were used to link the transcripts to eggNOG [11] for identification search of biological processes, and gene ontology [12] for the exploration and retrieval of information concerning three important components: biological process, molecular function, and cellular component. The generated functionally annotated reference transcriptome was then integrated into SQLite database. Raw reads were deposited to NCBI Sequence Read Archive under the Accession PRJNA664826.

Transcript expression analysis
Transcript expression analysis was carried out using the CLC Genomic Workbench (Version 10.1.0). Raw reads were subjected to quality control (PhiX and adapter sequences filtering, base quality trimming Q ≥20, read length ≥30bp) and the good quality reads of each sample were mapped to the de novo assembled transcripts. Transcript abundance for every transcript was quantified in transcripts per million (TPM). Expression profiles of biological replicates were first examined to ensure the well correlation of replicates and to probe relationship among samples. Scatter plots and Pearson correlation coefficient graphs were constructed between biological replicates within samples of each stage to measure their interrelationship. Principal component analysis (PCA) was performed using RStudio IDE (version 4.1.2.) software [13] to emphasize variations and bring out strong patterns in the dataset. Pair-wise differential expression analysis was performed on the six samples. Transcripts having a fold change value (FC) of ≥2 differences and an adjusted p -value (FDR) of ≤0.05 were defined as significantly differentially expressed genes (DEGs) and further utilised to form a volcano plot to visualize and highlight the degree of which genes are significantly differentially expressed. A Venn diagram was constructed via Interactivenn software [14] to illustrate the distribution of DEGs between stages. An overall, comprehensive annotation bar chart was further established for each group of staged specific transcripts via WEGO 2.0 software [15 , 16] .

Kyoto encyclopedia of genes and genomes (KEGG) pathway analysis
The KEGG Mapper -Search and Color Pathway software was employed for pathway analysis. Significantly differentially expressed transcripts identified from the differential expression analysis were used to obtain a list of Trinotate-annotated K numbers, which were then used for pathway reconstruction using the Kyoto Encyclopedia of Genes and Genomes web server against the KEGG pathway database. The overall flow chart of the RNA-sequencing pipeline used in this study is summarised in Fig 8 .

Ethics Statements
This study does not involve human participants and samples derived from human. It also does not involve animals including live vertebrates and higher invertebrates.

Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Data Availability
Transcriptome profiling of Hibiscus sabdariffa L during calyx maturation (Original data) (NCBI SRA database).