Data on RNA-seq analysis of the cocoa pod borer pest Conopomorpha cramerella (Snellen) (Lepidoptera: Gracillariidae)

Cocoa bean (Theobroma cacao L.) is part of the global cocoa and chocolate industry valued at 44 billion US dollars in 2019. Cocoa pod borer (CPB), Conopomorpha cramerella is a major pest of cocoa in Malaysia and Indonesia that is responsible for the decline for cocoa production. They have been detected since 1980s. Unfortunately, current control strategies are inefficient for CPB management. Although biotechnological alternatives, including RNA interference (RNAi), have been proposed in recent years to control insect pests, characterizing the genetics of the target pest is essential for successful application of these emerging technologies. We generated a comprehensive RNA-seq dataset (135,915,430 clean reads) for larva and adult stages of CPB by using the Illumina HiseqTM 4000 system to increase the understanding of CPB in relation to molecular features. The CPB transcriptome was assembled de novo and annotated. The final assembled produced 249,280 unigenes, of which 75,929 unigenes annotated against NCBI NR database and were distributed among 156 KEGG pathways. The raw data were uploaded to SRA database and the BioProject ID is PRJNA553611. The transcriptomic dataset we present are the first reports of transcriptome information in CPB that is valuable for further exploration and understanding of CPB molecular pathways.


a b s t r a c t
Cocoa bean ( Theobroma cacao L.) is part of the global cocoa and chocolate industry valued at 44 billion US dollars in 2019. Cocoa pod borer (CPB), Conopomorpha cramerella is a major pest of cocoa in Malaysia and Indonesia that is responsible for the decline for cocoa production. They have been detected since 1980s. Unfortunately, current control strategies are inefficient for CPB management. Although biotechnological alternatives, including RNA interference (RNAi), have been proposed in recent years to control insect pests, characterizing the genetics of the target pest is essential for successful application of these emerging technologies. We generated a comprehensive RNA-seq dataset (135,915,430 clean reads) for larva and adult stages of CPB by using the Illumina Hiseq TM 40 0 0 system to increase the understanding of CPB in relation to molecular features. The CPB transcriptome was assembled de novo and annotated. The final assembled produced 249,280 unigenes, of which 75,929 unigenes annotated against NCBI NR database and were distributed among 156 KEGG pathways. The raw data were uploaded to SRA database and the BioProject ID is PRJNA553611. The transcriptomic dataset we present are the first reports of tran-scriptome information in CPB that is valuable for further exploration and understanding of CPB molecular pathways. ©

Value of the Data
• The RNA-seq data obtained is from cocoa pod borer (CPB) developmental stages transcriptome. This will lead to identification of differently expressed genes between the developmental stages that will reveal putative developmental pathways and mechanisms for further exploration. • This data will benefit molecular biology researchers of CPB in gene discovery, characterization and cloning works. • Transcriptome data of CPB is a solid foundation for studies related to CPB development. The data is valuable for further studies on putative genes and proteins discovery that controls the development of CPB. Understanding the molecular mechanisms of CPB may lead to novel control methods of the pest.

Data Description
RNA-seq transcriptome data from CPB larvae and adult stages with three biological replicates have been obtained using the Illumina HiSeq TM 40 0 0 sequencing platform. Raw reads from the sequencing have been uploaded to the NCBI Sequence Read Archive (SRA) database. Links and accession number to each sample fastq file is listed in Table 1 . Over 135,915,430 clean reads were obtained from the Illumina sequencing. The number of reads for each sample is in Table 2 . The data were also de novo assembled into full-length transcriptome ( Table 3 ) and this can be replicated using the protocol in the methods section below.

Insect collection and rearing
Cocoa pod borer (CPB) adult and larvae were collected from cocoa pod in the Malaysian Cocoa Board Bagan Datuk Plantation, Perak, Malaysia (3.894131 N 100.8642093 E). Cocoa pods showing premature yellowing of the husk, a characteristic symptom of CPB infestation were collected in the field.

cDNA library construction and high-throughput sequencing
Each sample was homogenized with liquid nitrogen in a mortar and dissolved in 1 ml of TRI Reagent (Thermo Fisher Scientific) per 100 mg tissue. Total RNA was purified using RNeasy mini kit (Qiagen, Inc., Valencia, CA) following the manufacture's protocol RNA extraction with modification step. Residual genomic DNA was removed using DNA-free TM DNA Removal Kit (Invitrogen), according to the manufacturer's instructions. RNA quality was assessed using Nanodrop 10 0 0 spectrophotometer (Thermo Scientific, USA). The OD 260/280 values of each RNA sample were between 1.8 and 2.0, indicating sufficient quality. Finally, the integrity of the total RNA sample was evaluated using an Agilent 2100 Bioanalyzer (Agilent Technologies, USA), with an expected RNA integrity number (RIN) threshold of 7.0. Poly(A) RNA was isolated using the NEBNext Poly(A) mRNA Magnetic Isolation Module and libraries were prepared using the NEBNext Ultra Directional RNA Library Prep Kit for Illumina, both following the protocol of the manufacturer.
Library construction and sequencing was performed by a commercial service provider Novogene Med NGS Clinical Laboratory (Tianjin, China) using Illumina HiSeq40 0 0 at 150-bp paired-end (PE) reads ( Table 1 ).

Data filtering
Weak signals and low-quality sequences were removed; read ends were also screened and trimmed for Illumina adaptor sequences by sequencing company. Approximately 142 million (142,027,810) reads were obtained, resulting in over 42.7 Gb of paired-end data. Raw reads containing ambiguous 'N' nucleotides (with ration of 'N' greater than 10%) and low quality sequences (with quality score less than 5) were removed using an in-house software (service by Novogene) in order to obtain clean read sequence ( Table 2 ). Next, adaptor sequences were removed from the raw reads using Trimmomatic software (version 0.36) [1] . A thorough quality control on the trimmed reads was performed using FastQC software [2] written in Java to provide summary statistics for FASTQ files.

Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships which have, or could be perceived to have, influenced the work reported in this article.