Transcriptome dataset from bark and latex tissues of three Hevea brasiliensis clones

Hevea brasiliensis is exploited for its latex production, and it is the only viable source of natural rubber worldwide. The demand for natural rubber remains high due its high-quality properties, which synthetic rubber cannot compete with. In this paper, we present transcriptomic data and analysis of three H. brasiliensis clones using tissue from latex and bark tissues collected from 10-year-old plant. The combined, assembled transcripts were mapped onto an H. brasiliensis draft genome. Gene ontology analysis showed that the most abundant transcripts related to molecular functions, followed by biological processes and cellular components. Simple sequence repeats (SSR) and single nucleotide polymorphisms (SNP) were also identified, and these can be useful for selection of parental and new clones in a breeding program. Data generated by RNA sequencing were deposited in the NCBI public repository under accession number PRJNA629890.

Hevea brasiliensis is exploited for its latex production, and it is the only viable source of natural rubber worldwide. The demand for natural rubber remains high due its highquality properties, which synthetic rubber cannot compete with. In this paper, we present transcriptomic data and analysis of three H. brasiliensis clones using tissue from latex and bark tissues collected from 10-year-old plant. The combined, assembled transcripts were mapped onto an H. brasiliensis draft genome. Gene ontology analysis showed that the most abundant transcripts related to molecular functions, followed by biological processes and cellular components. Simple sequence repeats (SSR) and single nucleotide polymorphisms (SNP) were also identified, and these can be useful for selection of parental and new clones in a breeding program. Data generated by RNA sequencing were deposited in the NCBI public repository under accession number PRJNA629890.
© 2020 The Authors. Published by Elsevier Inc. This is an open access article under the CC BY license.
( http://creativecommons.org/licenses/by/4.0/ ) Value of the Data • These datasets are important to obtain the information from biological processes and pathways occur in the formation of natural rubber. • These datasets can be useful for plant breeders to get the information for improving rubber breeding program as well as researchers which involved in the transcriptomic analysis and RNA dataset. • The data will be of practical use in the development of genetic SNP and SSR markers as a tool in rubber breeding programmes. • The data will facilitate further analysis aimed at identifying genes and pathways related to latex yield for the development of new rubber clone with improved performance.

Data description
The data presented here contain a combined transcriptome assembly of two different types of tissue (bark and latex) from three rubber clones ( Hevea brasiliensis ) collected in Malaysia (RRIM 600, PB 260 and RRIM 929). Rubber trees have been planted on a large scale in Malaysia since the 1920s. The transcriptome dataset generated from RNA-Seq can be used to identify highexpression genes that could be manipulated in order to improve desirable agronomy traits. Currently, there are more than 150 transcriptome datasets and bioproject of rubber tree available in the various online databases including NCBI (National Center of Biotechnology Information), EMBL (The European Bioinformatic Institute) and DDBJ (DNA Database of Japan). Previous transcriptome studies were focusing among others in phylogenetic relationships [1] , latex flowing mechanisms [2] as well as molecular marker development involving single-nucleotide polymorphisms (SNPs) [3] and simple-sequence repeats (SSRs) [4] , for rapid identification of parental and new clone characteristics. With the expansion of transcriptome dataset using three different rubber clones, the SNPs and SSRs markers associated with economic traits such as latex biosynthesis and disease resistance genes could be identified for improvement of new rubber clones traits during marker-assisted selection in rubber breeding program.
In this study, RNA isolation and the sequencing process were generated using the HiSeq 20 0 0 platform. The raw reads generated was trimmed for quality analysis. RNA seq statistics from bark and latex tissues is shown in Table 1 . A summary of the H. brasiliensis transcriptome analysis is shown in Table 2 . Analysis showed that 46.5% of the transcriptome data had a significant   ( Table 3 ). A summary of gene ontology (GO) annotation of the top three functions of transcripts found showed 187,573 transcripts related to molecular functions; 84,601 transcripts were biological processes; and 60,422 were cellular functions ( Table 4 ). In this study, microsatellite motifs from merged transcriptomes were identified ( Table 5 ), with dinucleotides being the most abundant, followed by trinucleotides, tetranucleotides and pentanucleotides. The number of SNPs within rubber transcriptome datasets is shown in Table 6 , with the highest number of SNPs being identified from bark and latex tissue from the RRIM 929 rubber clone.

Specimen collection
Bark and latex rubber-tree tissue was collected from a 10-year-old plant obtained from PL Oil Palm & Rubber Sdn Bhd (formerly known as Tradewinds Plantations Bhd), Kedah, Malaysia. Three replicates were collected for each tissue type.

RNA isolation and sequencing
The total RNA from the bark and latex tissue samples was extracted following the Qiagen RNAeasy Plant MiniKit protocol (Qiagen Inc., Chatsworth, CA). The resulting RNA quality and integrity was then estimated using standard Qubit Nanodrop spectrophotometry (O.D. ∼ 2.0) and Agilent 2100 Bioanalyzer (RIN value > 8) protocols. The paired-end Illumina mRNA libraries were generated using an Illumina TruSeq Kit from 1 ug of total RNA, following the manufacturer's protocol. Each sample was sequenced in multiple HiSeq 20 0 0 lanes using the TruSeq SBS 36 Cycle Kit (Illumina, San Diego, CA) to obtain 2 × 101 bp reads.

Transcriptome analysis and mapping
Raw reads generated in FASTQ format obtained from Illumina platforms were analysed using FastQC, version 0.10.1 [5] . The raw reads were first screened for sequencing adaptors and then trimmed using Trimmomatic, version 0.32 [6] . The adaptor-trimmed raw sequences were then analysed for quality scores and bases with Q > 20. The sequences with Ns were removed before downstream analysis using Prinseq-Lite, version 0.2.0.4 [7] . Cleaned paired raw reads obtained in FASTQ format were mapped onto the draft H. brasiliensis genome [8] (accession: PRJDB4387) by Bowtie2, available from the DDBJ/EMBL/GenBank BioProject database, using TopHat software (version 2.1.0) [9] . Reads were mapped to the genome with default parameters. The mapped reads were then assembled using Cufflink v2.2.1 [9] , with default parameters and selection of a minimum transcript length of 100 bp, to generate the reference transcriptome. The output transcripts were considered as Hevea reference sequences.

Transcriptome annotation
The clone transcripts were searched against UniProtKB/Swiss-Prot protein databases using BLASTX (version ncbi-blast-2.2.29 + ), with a cut-off e-value of 1e −5 . The transcriptome analysis was annotated using the BLAST2GO program [10] and the gene ontologies (GO), Kyoto Encyclopaedia of Genes and Genomes (KEGG) and Pfam databases, on the basis of the BLASTX output.

Gene expression analysis
Expression profiling for each tissue sample was calculated according to the relative abundance of transcripts, aligned to the assembled reference sequence by Bowtie2 [11] . The expression levels of transcripts were calculated using Cufflinks (v2.2.1). The output was expected to show the value for each transcript with a 95% confidence interval.

Simple sequence repeat (SSR) and single-nucleotide polymorphism (SNP) discovery
Identification of contigs containing microsatellites (SSRs) was performed using the MISA program [12] , and the minimum repeats were as follows: 10 for one base, six for two bases, and five for three, four, five and six bases; the interruptions (maximum difference between microsatellites) were 100 bases. Moreover, single-nucleotide polymorphism (SNP) calling was done using SAMtools, version 1.3.1 [13] , to generate mpileup for one or multiple BAM files. VarScan version 2.3.9 [14] was used to perform SNP detection using default parameters. Reads were mapped against the reference transcriptome and filtered by determining which DNA bases were different from the reference. The putative SNPs selected were required to have a read depth equal to or greater than 10; the SNP reads/total reads ratio had to be equal to or greater than 0.25; the minimum phred score of bases had to be 20; and the SNP quality had to be 50.

Ethics statement
This article does not contain any studies with human participants or animals performed by any of the authors.

Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships which have, or could be perceived to have, influenced the work reported in this article.