Transcriptome dataset of sago palm in peat soil

Sago palm (Metroxylon sagu Rottb.) is an important agricultural starch-producing palm that contributes to Malaysia's economics, especially in the State of Sarawak, Malaysian Borneo. In this palm tree, the central part of the plant storage-starch. Under normal condition, sago palm develop its trunk after 4-5 years being planted. However, sago palms planted on deep-peat soil failed to develop their trunk even after 17 years of being planted. This phenomenon is known as ‘non-trunking’, which eliminates the economic value of the palms. Numerous research has been done to address the phenomenon, but the molecular mechanisms of sago palm responding toward the responsible stresses are still lacking. Therefore, in this study, leaf samples were collected from trunking (normal) and non-trunking sago palms planted on peat soil for total RNA extraction, followed by next-generation sequencing using the BGISEQ-500 platform. The raw reads were cleaned, and de novo assembled using TRINITY software package. A total of 40.11 Gb bases were sequenced from the sago palm leaf samples. The assembled sequence produced 102,447 unigenes, with N50 score 1809 bp and GC ratio of 44.34%. The alignment of unigenes with seven functional databases (NR, NT, GO, KOG, KEGG, SwissProt and InterPro) resulted in the annotation of 65,523 (63.96%) unigenes. Functional annotation results in the detection of 46,335 coding DNA sequences by Transdecoder. A total of 30,039 simple-sequence repeats distributed on 21,676 unigenes were detected using Primer3 software, and 2355 transcription factor coding unigenes were predicted using getorf and hmmseach software. This work is registered under NCBI BioProject PRJNA781491. The raw RNA sequencing data are available in Sequence Read Archive (SRA) database with accession numbers SRX13165895, SRX13165896, SRX13165897, SRX13165898, SRX13165899, and SRX13165900. Gene expression and annotation information are accessible in public functional genomics data repository Gene Expression Omnibus (GEO) with accession number GSE189085.


a b s t r a c t
Sago palm ( Metroxylon sagu Rottb.) is an important agricultural starch-producing palm that contributes to Malaysia's economics, especially in the State of Sarawak, Malaysian Borneo. In this palm tree, the central part of the plant storage-starch. Under normal condition, sago palm develop its trunk after 4-5 years being planted. However, sago palms planted on deep-peat soil failed to develop their trunk even after 17 years of being planted. This phenomenon is known as 'non-trunking', which eliminates the economic value of the palms. Numerous research has been done to address the phenomenon, but the molecular mechanisms of sago palm responding toward the responsible stresses are still lacking. Therefore, in this study, leaf samples were collected from trunking (normal) and non-trunking sago palms planted on peat soil for total RNA extraction, followed by next-generation sequencing using the BGISEQ-500 platform. The raw reads were cleaned, and de novo assembled using TRINITY software package. A total of 40.11 Gb bases were sequenced from the sago palm leaf samples. The assembled sequence produced 102,447 unigenes, with N50 score 1809 bp and GC ratio of 44.34%. The alignment of unigenes with seven functional databases (NR, NT, GO, KOG, KEGG, SwissProt and Inter-Pro) resulted in the annotation of 65,523 (63.96%) unigenes. Functional annotation results in the detection of 46,335 coding DNA sequences by Transdecoder. A total of 30,039 simple-sequence repeats distributed on 21,676 unigenes were detected using Primer3 software, and 2355 transcription factor coding unigenes were predicted using getorf and hmmseach software. This work is registered under NCBI BioProject PRJNA781491. The raw RNA sequencing data are available in Sequence Read Archive (SRA) database with accession numbers SRX13165895, SRX13165896, SRX13165897, SRX13165898, SRX13165899, and SRX13165900. Gene expression and annotation information are accessible in public functional genomics data repository Gene Expression Omnibus (GEO) with accession number GSE189085. ©

Value of the Data
• This data is useful for the scientific community as it provides insights into the transcriptome of M. sagu . • This data provides a comprehensive transcriptomic expression using pair-end sequencing with two sets of samples with three biological replicate datasets, each to comprehend gene expression contributing to the non-trunking phenomenon in M. sagu . • Researchers involved with the work related to the omics study of M. sagu could also benefit from this data as cross-references information to support their findings.

Data Description
Sago palm grows through a series of developmental stages, which takes up to twelve years to be ready for the harvest. M. sagu generates suckers (soboliferous) every 18 months as the successor of the mother plant, which dies after fruiting (hapaxanth). Mature sago palm yields 15-25 metric tons of air-dried starch per hectare at the end of an 8-year growth cycle under good condition [1] . The advantages of sago palm as a starch-producing crop that grows in peat soil with seasonal waterlogged has triggered the Land Custody and Development Authority Sarawak [2] to initiate the commercial plantation in Mukah, Sarawak in 1987. However, there was the occurrence of non-trunking sago palms even after ten years of cultivation. The non-trunking sago palm reduced starch yield per hectare of land, resulting in the instability of the sago starch market. It reduced the plantation income, consequently restricting the development of sago industries and loss of confidence in this palm by the potential or current sago palm farmers [3] .
Numerous studies were performed to address the non-trunking sago palm problem such as soil physicochemical properties [2] , soil microbiome [4] and molecular studies [5][6][7] . The general outcome of the studies revealed that the mineral deficiency causes the non-trunking in sago palm, but how this deficiency affects sago development remains unanswered. Currently, several research studies of this palm in genomics and proteomics are being conducted. In conjunction with those studies, this study utilises transcriptome analysis to compare the gene expression between the trunking and non-trunking sago palm leaf tissue to highlight the differential expressed genes and their correlation with the non-trunking phenomenon in sago palm.
The information in this article includes the transcriptomics of trunking sago palm (control) and non-trunking sago palm (target of interest) from peat soil. The global gene expression between the trunking and non-trunking sago palm was evaluated by differential expressed genes analysis. The files of the transcriptome dataset, which generated from 6 libraries of raw data and 2 sets of processed data, were submitted to Sequence Read Archive (SRA) and Gene Expression Omnibus (GEO) NCBI database.

Sample collection
Sago palm leaf tissues were used in this study. They were six samples consisting of three biological replicates of 2 phenotypes. All the samples were collected from Dalat Sago Planta-  Table 1 ). The samples were wiped with a kitchen towel containing 70% ethanol to remove debris. The samples were then stored in containers followed by snap-freeze in liquid nitrogen. The samples were kept in liquid nitrogen before being transferred into a -80 °C freezer for long-term storage.

RNA extraction and RNA-seq information
Total RNA of the six samples were extracted using CTAB protocol and sequenced using BGISEQ-500 platform. Trunking and non-trunking sago palm ( M. sagu ) transcriptome were successfully sequenced, and the raw RNA sequence reads were deposited in NCBI's Sequence Read Archive (SRA) database with the accession numbers SRX13165895, SRX13165896, SRX13165897, SRX13165898, SRX13165899, and SRX13165900.
The total RNA samples were subjected to mRNA enrichment before the RNA sequencing. About 40.11 Gb bases raw sequence reads of the six RNA samples were successfully generated using the BGISEQ-500 sequencing platform. The raw reads containing more than 5% unknown N base, adaptor-polluted and more than 20% of bases in the total read with a quality score lower than 15 were then cut-off, and the remaining reads are characterised as clean reads. The clean read ratio exceeded 95% with high accuracy reflected by Q score Q30 (equivalent to the probability of an incorrect base call of 1 in 10 0 0 times) above 90% of the reads and Q20 (equivalent to the probability of an incorrect base call of 1 in 100 times) above 95% of the reads (Refer Table 2 ). The N50 length is used to determine the assembly continuity, the higher the better. N50 is a weighted median statistic that 50% of the total length is contained in Unigenes that are equal to or larger than this value; N70: Similar to N50 N90: Similar to N50 GC(%): the percentage of G and C bases in all transcripts

De novo assembly
The clean reads were then de novo assembled using trinity software and generated the reference sequence ( Table 3 ). Reference sequences were then undergone abundance screening using TIGR gene indices clustering tools (TGICL) to obtain unique gene (Unigene) sequences ( Table 4 ; Fig. 1 ).

Unigene functional annotation
After assembly, the Unigenes were functionally annotated with seven functional databases, namely; NCBI protein database (NR), NCBI nucleotide database (NT), Gene Ontology (GO), Eukaryotic Orthologous Groups of proteins (KOG), Kyoto Encyclopedia of Genes and Genomes (KEGG), Swiss-Prot, a curated protein sequence database of UniProt, and InterPro ( Table 5 ; Fig. 2 ). Unigene annotation and expression information are deposited in NCBI's Gene Expression Omnibus (GEO) with accession number GSE189085.

Unigene expression
Based on the assembly result, the clean reads of each sample were mapped to the Unigenes with Bowtie2 software and the gene expression level were calculated with RSEM. Correlation between samples are distinguished in Principal component analysis (PCA) ( Fig. 3 ).
Transcriptomic data of two sago phenotypes were completed, with 40.11 Gb bases sequenced, producing annotated Unigenes, and the detection of SSR and transcription factors. The data obtained from this study can be used to understand gene expression contributing to the trunking phenomenon in M. sagu .

Ethics Statement
This work does not contain any studies with humans. The original collections of sago palm leaf ( M. sagu ) were made with the direct permission of Dalat Sago Plantation owned by Land Custody and Development Authority (LCDA) Holdings Sdn. Bhd., in the Mukah division. The sago palm leaf samples were not collected from any National Parks or protected wilderness areas. Additionally, the sago palm ( M. sagu ) are not endangered species.

Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.