Leaf transcriptome data of two tropical medicinal plants: Sterculia lanceolata and Clausena excavata

The data presented in this article are associated to the research articles, “DOI: 10.1007/s11295-019-1348-3”, [1]; and “DOI: 10.1007/s13205-018-1162-x” [2]. Clausena excavata Burm. f. and Sterculia lanceolata Cav. are medicinal tree plants [3,4] native to Southeast Asia and China, and most members of both the genus Clausena and the genus Sterculia contain various valuable secondary metabolites with a great potential for drug development. Though many phytochemical studies have been conducted using plant extracts from various parts of these plants [4,5], there are very limited genetic resources available. RNA sequencing of C. excavata and S. lanceolata was conducted using pair-end Illumina HiSeq2500 sequencing system, from which the first de novo transcriptome data were produced for both genus Clausena and Sterculia. Transcriptome shotgun assembly using three different assembly tools [2] generated a total of 16,638 non-redundant contigs (N50, 900 bp) from C. excavata and 7,857 (N50, 423 bp) from S. lanceolata. The data are accessible at NCBI BioProject: PRJNA428402 for C. excavata [2] or PRJNA435648 for S. lanceolata[1].


a b s t r a c t
The data presented in this article are associated to the research articles, "DOI: 10.1007/s11295-019-1348-3", [1]; and "DOI: 10.1007/s13205-018-1162-x" [2]. Clausena excavata Burm. f. and Sterculia lanceolata Cav. are medicinal tree plants [3,4] native to Southeast Asia and China, and most members of both the genus Clausena and the genus Sterculia contain various valuable secondary metabolites with a great potential for drug development. Though many phytochemical studies have been conducted using plant extracts from various parts of these plants [4,5], there are very limited genetic resources available. RNA sequencing of C. excavata and S. lanceolata was conducted using pair-end Illumina HiSeq2500 sequencing system, from which the first de novo transcriptome data were produced for both genus Clausena and Sterculia. Transcriptome shotgun assembly using three different assembly tools [2] generated a total of 16,638 non-redundant contigs (N50, 900 bp) from C. excavata and 7,857 (N50, 423 bp) from S. lanceolata. The data are accessible at NCBI BioProject: PRJNA428402 for C. excavata [2] or PRJNA435648 for S. lanceolata [1].

Data
This article reports RNA sequencing transcriptome data from leaf samples of two medicinal plants, C. excavata and S. lanceolata [3,4,5]. The raw read data were deposited at NCBI Sequence Read Archive (SRA) database under the accession SRR6438389 for C. excavata [2] and SRR6798190 for S. lanceolata [1]. Assembled sequence data are accessible at Transcriptome Shotgun Assembly (TSA) under the accession GEM00000000 for C. excavata [2] and GGIS00000000 for S. lanceolata [1]. The annotation of the assembled contigs showed that many contigs contain only partial coding regions as shown in Fig. 1. The raw and assembled RNA sequencing data are summarized in Table 1. Simple sequence repeat (SSR) primer sets (464 primer sets from C. excavata and 153 sets from S. lanceolata), most of which has not been reported and tested, were shown in Supplementary file 1.

Sample collection
Leaf samples of fully grown wild C. excavata Burm. f. and S. lanceolata Cav. were collected from Vinh Phuc province or Me Linh field station, Hanoi, Vietnam, August 2015. Leaf samples were submerged into liquid nitrogen, transferred into RNAlater solution (Ambion Ins, USA), and then stored in À20 C freezer.

cDNA library construction and sequencing
Leaf samples were removed from RNAlater solution and ground with a pestle and mortar in liquid nitrogen to isolate total RNA using TRIzol reagent (Thermo Fisher Scientific, Korea). The purity and Specifications Value of the data These data are the first de novo leaf transcriptome from the genus Sterculia and the genus Clausena, which increased significantly not only the amount of sequence information available to both genus but also a potential for the discovery of genes involved in biosynthesis of useful secondary metabolites in both species. The data would be very useful for genetic and comparative studies of Clausena or Sterculia species as well as their relative species. Assembled sequences will serve as a reference for future studies and would be valuable resources to examine molecular characteristics involved in pharmaceutical properties of Sterculia and Clausena species.
quantity of total RNAs were measured using an RNA Pico Chip on the Agilent 2100 Bioanalyzer (Agilent Technologies, USA). A ten mg of the total RNA was used for mRNA isolation using oligo-dT beads, and random sheared mRNA was used for cDNA synthesis, followed by the adaptor ligation at 3' A overhang.
The mRNA isolation and cDNA library construction was conducted by following the procedure of the Sureselect strand-specific RNA reagent kit (Agilent, USA). Equal quantity of mRNA from three different leaf samples from three independent trees was pooled and used for cDNA library construction. The cDNA library was checked for quality using Agilent DNA 1000 chip (Agilent Technologies, USA) and sequenced by the Illumina Hiseq 2500 (Illumina, USA).

De novo assembly
The raw reads from sequencing were trimmed and filtered to remove adaptor sequences, empty reads, and low quality reads with 20 of a phred quality score and 50bp in length using NGS tool kits and Trimmomatic tool [6]. The high quality reads were assembled using three assemblers, CLC  for Trinity) were applied to obtain the best results. All contigs from each assembler at various k-mer values were merged separately for further process. As Oases does not cluster assembled contigs, CD-HIT-EST was used to cluster the contigs with an identity more than 90% and coverage of 100% [7]. All data sets from each assembler were combined into a single dataset by collapsing identical or nearidentical contigs into single contig using CD-HIT-EST with the same criteria described above. Due to the lack of a public reference genome sequence data of both C. excavata and S. lanceolata, the contigs were annotated by running NCBI BLAST with a cutoff E-value of 10 À6 against the NCBI non-redundant (NR) protein database.