Data on large cardamom transcriptome associated with Chirke disease

Large cardamom (Amomum subulatum Roxburg), is an ancient spice native to North-Eastern India and Southeast Asia, which belongs to the family Zingiberaceae under the order Scitaminae. Large cardamom is mostly affected by a viral disease termed Chirke caused by Large Cardamom Chirke Virus (LCCV). These disease has spread due to drastic changes in the ecosystem, inadequate rain in dry months and absence of good agricultural practices by the farmers resulting in aphid infestations. In the present study, using HiSeq™ 2000 RNA sequencing technology transcriptome sequencing was performed for both control (disease not expressed) and diseased large cardamom leaf tissues. RNA-seq generated 77260968 (7.72 GB) and 72239708 (7.22 GB) paired raw reads for large cardamom control and diseased samples respectively. The raw data were submitted to the NCBI SRA database under the accession numbers SRX2529373 and SRX2529372 and the assembled transcriptomes were submitted to TSA under the accession numbers GIAV01000000 and GIAW01000000 for the control and diseased samples respectively. The raw reads were quality trimmed and assembled de novo using TRINITY assembler which created 156822 (control) and 148953 (diseased) contigs with N50 values 2107 (control) and 2182 (diseased). The data were used to identify the significantly differentially expressed genes between control and diseased samples.


a b s t r a c t
Large cardamom (Amomum subulatum Roxburg), is an ancient spice native to North-Eastern India and Southeast Asia, which belongs to the family Zingiberaceae under the order Scitaminae. Large cardamom is mostly affected by a viral disease termed Chirke caused by Large Cardamom Chirke Virus (LCCV). These disease has spread due to drastic changes in the ecosystem, inadequate rain in dry months and absence of good agricultural practices by the farmers resulting in aphid infestations. In the present study, using HiSeq™ 2000 RNA sequencing technology transcriptome sequencing was performed for both control (disease not expressed) and diseased large cardamom leaf tissues. RNA-seq generated 77260968 (7.72 GB) and 72239708 (7.22 GB) paired raw reads for large cardamom control and diseased samples respectively. The raw data were submitted to the NCBI SRA database under the accession numbers SRX2529373 and SRX2529372 and the assembled transcriptomes were submitted to TSA under the accession numbers GIAV01000000 and GIAW01000000 for the control and diseased samples respectively. The raw reads were quality trimmed and assembled de novo using TRINITY assembler which created 156822 (control) and 148953 (diseased) contigs with N50 values 2107 (control) and 2182 (diseased). The data were used to identify the significantly differentially expressed genes between control and diseased samples.

Data
The dataset contains raw sequencing data obtained through transcriptome sequencing of leaf samples of large cardamom (Amomum subulatum Roxburg). The data files were deposited at NCBI SRA database under project accession no. PRJNA369131. Information generated from the raw data and that of assembly are provided in Table 1  Transcriptome Shotgun Assembly for the control sample has been deposited at DDBJ/ EMBL/GenBank under the accession GIAV00000000. The version described in this paper is the first version, GIAV01000000. Transcriptome Shotgun Assembly for the diseased sample has been deposited at DDBJ/ EMBL/GenBank under the accession GIAW00000000. The version described in this paper is the first version, GIAW01000000.

Value of Data
Large cardamom is severely affected by many diseases, prominent among them is chirke viral infection which affects the crop productivity. Expression profiling could unravel the over-expression of R genes or genes related to plant stress tolerance. Transcriptome data generated from leaves of plants grown under specific conditions could provide information on the molecular mechanism underlying disease tolerance. Differential expression analysis of control and disease stressed large cardamom could compare the expression variation of particular genes in healthy and infected plants and can be utilized for several downstream applications.

Plant material
Transcriptome sequencing was carried out in leaf samples of large cardamom (Amomum subulatum Roxburg). Large cardamom chirke virus (LCCV) was not expressed in one of the samples which served as the control whereas the disease was expressed in the other sample. Leaf tissues from both sets were collected followed by immediate freezing in liquid nitrogen.

Total RNA isolation and transcriptome sequencing
RNA extraction was done using a modified protocol of the RNeasy Plant Mini Kit (Qiagen) and CTAB method [1] RNA integrity and quality analysis were done using 2100 BioAnalyzer (Agilent Technologies). Illumina sequencing was performed using the HiSeq™ 2000 platform as per the manufacturer's  instructions (Illumina, San Diego, CA). RNA-seq generated paired-end strand-specific 77260968 (101 bases) and 72239708 (101 bases) raw reads which correspond to 7.72 GB and 7.22 GB of sequence data for large cardamom control and diseased samples respectively.

De novo transcriptome assembly
Raw reads were first quality checked using the FastQC [2] tool and the different criteria were crosschecked to determine the integrity of the raw data and based on the quality control data it was determined to trim the raw reads of any adapters present in it. Adapter trimming was done using BBDuk [3] against Illumina universal adapters. Non-coding RNAs such as tRNAs, rRNAs, snRNAs, and snoRNAs were filtred using BBSplit [3] against all non-coding RNA sequences of viridiplantae collected from NCBI, based on further quality checking it was determined that the data was ready for assembly. De novo transcriptome assembly was performed using the Trinity [4] assembler program (Trinity Release v 2.8.5) utilizing three consecutive modules: Inchworm, Chrysalis, and Butterfly to generate contigs. The assembler created 156822 and 148953 contigs for control and infected large cardamom samples ( Table 1). The assembled transcripts were converted into peptides using Transdecoder [5] and the peptides were clustered using cd-hit [6] to produce non-redundant and representative sequences. Further statistical data were generated from the assembly by means of the QUAST tool [7].

Confirmation of chirke virus genome sequences in the assembled transcriptome
Virus genome sequences were fetched from NCBI (https://www.ncbi.nlm.nih.gov/nuccore/? term¼chirke) and found only 4 sequences for chirke (JN257715.1, MH899149.1, MH899148.1, and MH899147.1). These were aligned to both infected and control sequences using BLASTþ [8]. The Alignment generated 140 hits for the infected sequences. Whereas the control sequence showed one hit from all four of the sequences. This might be due to the dormant virus particles present in the control sequences or possible cross-contamination.

Quantification of peptides from the transcripts
A total of 156822 transcripts were generated from the control sample while 148953 were generated from the diseased. While converting the transcripts into peptides the control sample generated 76913 peptide sequences while the treatment generated 74060. The obtained peptides were clustered for non-redundancy which resulted in 30498 unique peptides being generated from control compared to the 29512 that were generated from the diseased (Fig. 1).