Data of de novo assembly and functional annotation of transcriptome of Peninsular Malaysian Amomum Roxb. species

The Amomum genus comprises of perennial, tropical, herbaceous plants that are distributed globally and possess both medicinal and ornamental properties. These plants contain a variety of secondary metabolites, including compounds with antioxidant and antimicrobial properties. This is the first transcriptomic analysis of seven Amomum species from Peninsular Malaysia, utilizing leaves, stems, and roots as sample material. Paired-end Illumina HiSeq technology was used for data generation which includes raw data, cleaned reads, de novo assembly, and functional annotation. The data is accessible via NCBI BioProject (PRJNA936673).


Subject
Plant Science Specific subject area Transcriptomics Type of data

Value of the Data
• This is the first transcriptome data of seven Peninsular Malaysian Amomum species, which can be useful for the remaining species.• The RNA-seq data can be utilized to predict genes and identify key genes involved in biosynthesis of secondary metabolites, such as flavonoids, steroids, and terpenoids.• As Amomum species are important ornamental and medicinal plants, thus this data is significant for future genomic studies and gene expression analysis.• These transcriptomic datasets can serve as reference for the design of targeted experiments to explore secondary metabolite biosynthesis pathways or investigate gene expression patterns in other Amomum species of Peninsular Malaysia, yet to be studied.

Objective
The objective of this study is to generate transcriptomic datasets using RNA-Seq technology to characterize and compare differences in important genes, pathways and networks that are involved in the production of secondary metabolites between selected Amomum species of Peninsular Malaysia.

Data Description
The following data is transcriptomic data of selected Peninsular Malaysian Amomum species using Illumina NoveSeq technology.Approximately 170 GB raw data was generated from total RNA of seven species (14 samples) including different plant tissues, namely, leaves, stem, and roots.The absorbance readings, RNA concentration, and RIN (RNA integrity number) values are provided in Supplementary Material, Table S1.Table 1 below shows the summary of raw and clean reads generated from the sequencing.A total of 806,223,725 clean reads were obtained for the 14 samples.The Q20 and Q30 percentages were > 96 and > 90 respectively.The GC% ranged from 48.12% to 51.86%.The de novo assembly produced 470,621,303 total nucleotides for transcripts with an average of 623 bp, and N50 of 674 and 177,787,746 nucleotides for Unigenes with average 592 bp and N50 of 342 ( Table 2 and 3 ).The findings show that 14,346 unigenes, or 18.99% of the total transcripts, had coding sequences (CDS).A total of 1041 sequences, or 0.72% of them, had both start and stop codons.A total of 50,457 sequences (35.17%) were categorized as "5 prime partial len" and had only termination codon, while 622 sequences (0.433%) were classified as "3 prime partial len" and contained only initiation codon.A total of 32,805 sequences (22.86%) were classified as "internal len" because they lacked either an initiation or termination codon.
Gene Ontology (GO) classified all unigenes into three main GO domains: Biological Process, (BP), Cellular Component (CC), Molecular Function (MF).The highest number of genes (47,810) was for cellular process in biological process.The highest cellular component GO term consisted of 33,538 genes whereas binding (44,297) and catalytic activity (34,579) accounted for the high-  est in molecular function ( Fig. 2 ).A total of 30,966 unigenes annotated in KOG and classified into 26 KOG functional categories ( Fig. 3 ).The largest function category was Posttranslational modification, protein turnover, chaperones with 4643 unigenes followed by Translation, ribosomal structure, and biogenesis with 4159 unigenes.The maximum number of unigenes (7096) was assigned to Environmental Information Processing, followed by Genetic Information Processing (5786 unigenes).For metabolism of terpenoids and polyketides there were 1127 unigenes assigned and 1217 unigenes for biosynthesis of secondary metabolites ( Fig. 4 ).

Plant material
Three different young plant tissues, leaves, stem, and roots of seven Amomum species ( A. uliginosum, A. testaceum, A. elan, A. smithiae, A. trilobum, A. aculeatum, and A. curtisii ) were collected from their natural habitat, during their initial flowering period.The plant tissues were stored in RNALater solution upon collection to avoid any RNA degradation.Table 4 shows the list of Amomum species used in this study, with location and accession number.RNA extraction was conducted using the optimized commercial kit protocol (Machery and Nagel RNA extraction kit) for each tissue type from each species, and the resulting RNA was pooled from all three plant tissue types to obtain total RNA.The RNA quality and quantity were assessed using a UV-visible spectrophotometer (Beckman Coulter, U.S.A).Each sample was measured for optical density (OD) at 230 nm, 260 nm, and 280 nm.RNA samples with RIN value of 6.5 or higher, purity readings of A 260/280 ∼ 2.0 and A 260/230 values ranging from 2.0 -2.2 were sent to Apical Scientific, SDN, BHD, Malaysia (NGS service company) for preparation of cDNA libraries and subsequent transcriptomic sequencing (RNA-seq) using Next Generation Sequencing (NGS) technology (NovaSeq-PE150).

De novo assembly and functional annotation
The resulting sequence data was filtered to eliminate low quality reads using FASTQC [1] .The clean reads underwent de novo assembly using Trinity version 2.6.6Program [2] .Corset version 4.6 [3] was employed to cluster transcripts, eliminate redundant sequences, and obtain unigenes.The protein coding sequences (CDS) were predicted using TransDecoder [4] .
A series of bioinformatic analysis was performed using clean reads.The software RSEM v1.2.28 [5] was utilized for mapping and estimation of abundance of unigenes.Functional annotations were performed by BLASTx searches with e-value = 1e-5 against NCBI non-redundance protein database (data downloaded in December 2019) as well as other established databases such as Swiss-Prot using Diamond version 0.8.22 (data downloaded in December 2022) [6] , and Pfam using hmmscan HMMER 3.1 (data downloaded in November 2022) [7] .Blast2go (b2g4pipe_v2.5)[8] with parameters e-value = 1e-6 was used for assigning gene ontology (GO).

Table 1
Summary of RNA Seq-generated from seven Amomum species.

Table 2
Assembly statistics of Amomum species of Peninsular Malaysia.

Table 3
Length distribution of Transcripts and Unigenes.

Table 4
List of Amomum species with location and accession number.