Dataset for the combined transcriptome assembly of M. oleifera and functional annotation

In this paper, we present the data acquired during transcriptome analysis of the plant Moringa oleifera [1] from five different tissues (root, stem, leaf, flower and seed) by RNA sequencing. A total of 271 million reads were assembled with an N50 of 2094 bp. The combined transcriptome was assessed for transcript abundance across five tissues. The protein coding genes identified from the transcripts were annotated and used for orthology analysis. Further, enzymes involved in the biosynthesis of select medicinally important secondary metabolites, vitamins and ion transporters were identified and their expression levels across tissues were examined. The data generated by RNA sequencing has been deposited to NCBI public repository under the accession number PRJNA394193 (https://www.ncbi.nlm.nih.gov/bioproject/PRJNA394193).


a b s t r a c t
In this paper, we present the data acquired during transcriptome analysis of the plant Moringa oleifera [1] from five different tissues (root, stem, leaf, flower and seed) by RNA sequencing. A total of 271 million reads were assembled with an N50 of 2094 bp. The combined transcriptome was assessed for transcript abundance across five tissues. The protein coding genes identified from the transcripts were annotated and used for orthology analysis. Further, enzymes involved in the biosynthesis of select medicinally important secondary metabolites, vitamins and ion transporters were identified and their expression levels across tissues were ex-amined. The data generated by RNA sequencing has been deposited to NCBI public repository under the accession number PRJNA394193 ( https://www.ncbi.nlm.nih.gov/bioproject/ PRJNA394193

Value of the Data
• This data provides a transcriptome assembly of M. oleifera along with downstream analysis including relative abundance, orthology relationships and function assignment. • A platform for identification of enzymes involved in biosynthesis of secondary metabolites, vitamins and ion-transporters with help of an improved bioinformatics pipeline. • The data will allow the scientific community to carry out additional analysis for commercial production of the secondary metabolites.

Data
Data reported here contains a combined transcriptome assembly of five different tissues (leaf, root, stem, seed and flower) from Drumstick ( M. oleifera ) tree. A total of 17,148 proteins were identified from the set of 66,079 transcripts, assembled with an N50 of 2094 bp. The expression values of 17,148 gene models were estimated by aligning this transcriptome data to the available M. oleifera genome [2] . Pfam [3] associations for predicted proteins were obtained for 14,624 (85.3%) proteins. Pfam domains were identified in 12,026 ( ∼70%) of proteins. Additionally, more than 16 thousand ( ∼95%) proteins found homologues in the UniProt Viridiplantae database ( Table 1 , Supplementary Data). Orthology analysis was performed using two methods, OrthoMCL and ProteinOrtho. Or-thoMCL analysis lead to formation of 7380 orthogroups common to selected four species. Whereas, in ProteinOrtho analysis, 102 orthogroups were observed common to all 38 species whereas 51 orthogroups were found unique to C. papaya and M. oleifera ( Fig. 1 , Supplementary  Data). Top abundant transcripts from M. oleifera transcriptome were studied. Their GO terms were obtained from annotation data and enrichment analysis was performed. ( Fig. 2 , Supplementary Data). A set of 36 candidate genes (involved in metabolite and vitamin synthesis and ion transporters) was identified and their expression in each tissue was analysed ( Fig. 3 , Supplementary Data).

Transcriptome sequencing and assembly
RNA isolation was carried out from the five samples using Spectrum Plant total RNA kit (Sigma Aldrich), followed by treatment with Ambion-DNase1 (Thermofisher). The quality was assessed using Bioanalyzer (Agilent Technologies) and samples with RNA Integrity Number (RIN) > 7 were sequenced using Illumina HiSeq 10 0 0, in technical duplicates (whole ten libraries). Reads were processed using Trimmomatic (v0.35) [4] and 271 million reads were retained. The assembly was guided by the reference genome [2] using Trinity (v2.4.0) [5] with default parameters.

Gene identification and functional annotation
Gene identification was carried out for M. oleifera using MAKER (v2.31.9) [6] . The gene prediction was done through Augustus using gene models from Arabidopsis thaliana . Pfam domains [ 3 , 7 ] were identified in the proteins using HMMSCAN (HMMER v3.1) with an E-value of 0.01 and Pfam library (Pfam version 31). Homologues were identified in the UniProt Viridiplantae database using BLAST (v2.7) at an E-value cutoff of 10 −3 .

Orthology analysis
Orthology analysis was performed on the protein coding genes identified from the transcripts of M. oleifera using OrthoMCL and ProteinOrtho. The OrthoMCL (v2.0.9) was implemented on M. oleifera and four other plant species ( Carica papaya, Theobroma cacao, Arabidopsis thaliana and Oryza sativa ) at an E-value cutoff of 10 −5 . The ProteinOrtho was performed using M. oleifera proteins and 37 other proteomes of sequenced plant genomes (as described in Pasha et al.) [1] at E-value cutoff of 10 −10 . All the proteomes were obtained from the Phytozome resource (v10.3.1) [8] .

Differential expression of transcripts across five tissues and go term enrichment analysis
Transcriptome reads from the ten libraries derived from five tissue samples were mapped on the reference genome [2] using Tophat [9] as described in Pasha et al. [1] . Gene models were generated from each library and Fragments Per Kilobase Million (FPKM) values for each  transcript were calculated using Cufflinks (v2.2.1) [10] . A merged assembly was created from the individual assemblies using cuffmerge module. Differential expression log2(fold change) of each transcript across different tissues was calculated using cuffdiff module [11] .
The top 100 abundant transcripts in each tissue were examined for an enrichment of GO terms as described in Pasha et al. [1] . The Blast2GO (v5.2) [12] package was used to assign the GO terms with significance associated to a GO term based on p-value (0.05). The GO terms observed across tissues for the biological process, molecular function and cellular component were visualized using REVIGO webserver [13] .

Proteins involved in synthesis of secondary metabolites, vitamins and transporters
A unique pipeline was developed to identify the enzymes involved in biosynthesis of medicinally important secondary metabolites and vitamins. The protein queries were identified from PlantCyc database [14] for each enzyme. These queries were aligned using Clustal Omega [15] and the alignment was used to jump-start PSI-BLAST (E-value: 10 −5 , 2 iterations) [16] for identifying M. oleifera protein hits. Validation of hits was performed using phylogenetic analysis and functionally important residues mapping. The abundance of the transcripts, encoding these proteins was checked across different tissues.
the Gandhi Krishi Vignan Kendra (GKVK), University of Agriculture research, Bangalore for providing plant sample. This work was supported by the grant from Department of Biotechnology, India ( BT/PR10550/BID/7/479/2013 ) and JC Bose fellowship ( SB/S2/JC-071/2015 ) from Science and Engineering Research Board to RS. We also thank NCBS (TIFR) for infrastructural and financial support.

Supplementary materials
Supplementary material associated with this article can be found, in the online version, at doi: 10.1016/j.dib.2020.105416 .