Characterization of the complete mitogenome data of Ischyja marapok (Lepidoptera: Noctuoidea: Erebidae) from Malaysia

Ischyja marapok is a moth species from the genus Ischyja, a member of the Lepidoptera family, Erebidae. Due to their wide variation, this family constitutes the largest described species, however, the mitogenome dataset on the genus Ischyja is scarce. Hence, the mitochondrial genome dataset of Ischyja marapok from Malaysia was completely sequenced using the next-generation sequencing technology, Illumina NovaSeq 6000 and analyzed. The mitogenome has a sequence length of 15,421 bp, consisting of 13 protein-coding genes (PCGs), 22 transfer RNAs (tRNAs), 2 ribosomal RNAs (rRNAs) and a control region. The mitogenome is A + T biased (80.6%), with the base composition of A (39.2%), T (41.4%), C (11.9%) and G (7.5%). Among the 13 PCGs, 12 were initiated by the standard ATN codon, except for COX1 which utilizes the CGA start codon. Two PCGs were terminated with an incomplete stop codon T, while others ended with a TAA codon. Phylogenetic tree analyses showed that the sequenced I. marapok resides within the Erebinae subfamily and is closely related to Ischyja manlia (MW664367) with high bootstrap support and posterior probabilities. This dataset presented the mitogenome data of I. marapok from Malaysia, which is valuable for further research of their phylogeny and the diversification of the Ischyja genus. Also, this dataset can be implemented and used as references to assess environmental changes in the terrestrial ecosystem via environmental DNA approaches. The mitogenome of I. marapok is available in GenBank under the accession number ON165249.


a b s t r a c t
Ischyja marapok is a moth species from the genus Ischyja, a member of the Lepidoptera family, Erebidae. Due to their wide variation, this family constitutes the largest described species, however, the mitogenome dataset on the genus Ischyja is scarce. Hence, the mitochondrial genome dataset of Ischyja marapok from Malaysia was completely sequenced using the next-generation sequencing technology, Illumina NovaSeq 60 0 0 and analyzed. The mitogenome has a sequence length of 15,421 bp, consisting of 13 protein-coding genes (PCGs), 22 transfer RNAs (tRNAs), 2 ribosomal RNAs (rRNAs) and a control region. The mitogenome is A + T biased (80.6%), with the base composition of A (39.2%), T (41.4%), C (11.9%) and G (7.5%). Among the 13 PCGs, 12 were initiated by the standard ATN codon, except for COX1 which utilizes the CGA start codon. Two PCGs were terminated with an incomplete stop codon T, while others ended with a TAA codon. Phylogenetic tree analyses showed that the sequenced I. marapok resides within the Erebinae subfamily and is closely related to Ischyja manlia (MW664367) with high bootstrap support and posterior probabilities. This dataset presented the mitogenome data of I. marapok from Malaysia, which is valuable for further research of their phylogeny and the diversification of the Ischyja genus. Also, this dataset can be implemented and used as references to assess environmental changes in the terrestrial ecosystem via environmental DNA approaches.

Value of the Data
• The mitogenome data presented here provides the complete and novel mitochondrial genome of I. marapok from the Lepidoptera family Erebidae, originating from Malaysia. • This dataset provides useful information for other researchers who are working on assembling and annotating the mitogenomes of Erebidae species. • The provided dataset can also be used to further analyze the phylogenetic relationships of the Erebidae family and the phylogenetic position of the Ischyja genus. • Additionally, as one of the bioindicator species, the mitogenome data provided here will also benefit researchers who are working on the application of environmental DNA (eDNA) for biodiversity monitoring via DNA approaches.

Objective
Ischyja marapok is a moth species from the family Erebidae, the largest family in Noctuoidea. Due to the advancement in next-generation sequencing technologies, there are approximately 220 complete mitogenome sequences of Erebidae species published in the NCBI database, however, the mitogenome data for the genus Ischyja is scarce. Additionally, the genus Ischyja has been placed under incertae sedis in the family Erebidae and is in need of more sampling to improve their placement within the family [1 , 2] . To date, only one complete mitogenome data has been reported in NCBI database for this genus originating from India [1] , however, none has been reported from Malaysia. Therefore, this work aims to generate and characterize the complete mitogenome of I. marapok originating from Malaysia, as well as their phylogenetic position in Erebidae.

Data Description
The complete mitochondrial genome of I. marapok is 15,421bp in length, comprising of 13 protein-coding genes (PCGs), 22 transfer RNAs (tRNAs), 2 ribosomal RNAs (rRNAs) and a control region ( Fig. 1 ). Using the Illumina NovaSeq 60 0 0 sequencing technology, we managed to obtain a total of 10,122,328 raw reads with the final mitogenome displaying a depth of coverage 109.5X ( Table 1 ). The mitogenome is A + T biased (80.6%) with a nucleotide composition of A (39.2%), T (41.4%), C (11.9%) and G (7.5%) ( Table 2 ). Nucleotide composition of the whole mitogenome showed high occurrence of T over A, and C over G, giving rise to the AT-skew of −0.027 and GCskew of −0.227. Similar occurrence was also found in the control region where there is more T over A, and C over G.  [3] . The outer circle area indicates heavy strand, while the inner circle indicates the light strand. The arrows represent the direction of transcription, and the inner gray ring area expresses the mitogenome GC content. CR represents the control region. The mitogenome has a gene order of trnM-trnI-trnQ, located between the control region and NAD2, which has been observed in most Lepidoptera mitogenomes. The 13 protein-coding gene sequences has a total length of 11,214bp, while the transfer RNAs are 1458bp. The length of the 12S and 16S rRNAs are 821bp and 1276bp, respectively. Most of the genes are located on the heavy strand, compared to the light strand. On the heavy strand, 9 PCGs (COX1, COX2, COX3, NAD2, NAD3, NAD6, ATP8, ATP6, CYTB) and 15 tRNAs (trnM, trnI, trnW, trnL2, trnK, trnD, trnG, trnA, trnR, trnN, trnS1, trnE, trnT, trnS2, trnH) were observed. Subsequently, the light strand places 7 PCGs (NAD1, NAD4, NAD4l, NAD5), 6 tRNAs (trnQ, trnC, trnY, trnF, trnP, trnL1, trnV) and 2 rRNAs (12S and 16S). Out of the 13 PCGs, 12 were initiated by the standard ATN start codon (ATT, ATG, ATA), except for COX1, which utilized the CGA codon. Two PCGs (COX1 and NAD4) were terminated by an incomplete stop codon, T, while the rest of the PCGs ended with a TAA stop codon ( Table 3 ). The control region of I. marapok , also known as the AT-rich region is located between 12S rRNA and trnM, spanning a total length of 208 bp. Conserved motif 'ATAGA' was detected close to the 12S rRNA, followed by a 20 bp poly-T stretch and microsatellite-like elements AT after the motif 'ATTTA'. A string of poly-A was also detected up-stream of trnM. Additionally, two tandem repeats were detected as shown in Fig. 2 . Phylogenetic analyses based on Maximum-Likelihood (ML) and Bayesian Inference (BI) were performed to determine the phylogenetic position of I. marapok in the Erebidae family. Thirteen concatenated protein-coding genes from 43 Lepidoptera mitogenomes (including the newly sequenced I. marapok ) were used in the analysis ( Table 4 ). Based on the data generated, both Maximum-Likelihood (ML) and Bayesian Inference (BI) analyses yielded identical tree topology, but with different branch length (ML = 0.3, BI = 0.4). Based on Fig. 3 , the newly sequenced Ischyja marapok in this work is clustered within the Erebinae subfamily, and is phylogenetically closer to Ischyja manlia (MW664367) with high bootstrap value (ML = 100%), and posterior probabilities (PP = 1.0). A BLAST analysis was conducted on I. marapok (ON165249) and I. manlia (MW664367) mitogenome which showed that I. marapok is 94.72% similar to I. manlia (MW664367) deposited in NCBI GenBank. Additionally, a BLAST analysis was also conducted on the COX1 sequence of I. marapok (ON165249) with other available COX1 sequences of similar species in the database (GenBank and BOLD) and the analysis showed between 99.39% to 99.85% similarities.

Sampling, DNA extraction and data pre-processing
The adult sample of I. marapok (voucher no: DIM052) was collected from Ayer Hitam Forest Reserve Johor, Malaysia (2.03N 102.49 E). Genomic DNA was extracted from the hind leg tissue using the Qiagen Blood and Tissue Kit (Qiagen, Valencia, CA) prior to fragmentation using a Bioruptor® system ( https://www.diagenode.com/en/categories/bioruptor-shearing-device ). The library was prepared using the NEBNext® Ultra TM II DNA Library Prep Kit for Illumina®, following the manufacturer's instructions and was sent for sequencing using the Illumina NovaSeq 60 0 0 with paired-end mode 150. The raw reads were firstly assessed for its quality using the fastQC program ( https://www.bioinformatics.babraham.ac.uk/projects/fastqc/ ) and were trimmed for sequencing adapters using AdapterRemoval v2.3.2 [4] . The trimmed reads displayed a quality score of more than 24, thus retained.

Mitogenome assembly, annotation and data analysis
Using the seed input from BOLD public data (Sequence ID: LEPKA953-09.COI-5P) as reference, the mitogenome was successfully assembled by NOVOPlasty v.4.2 [5] program and run through the PALEOMIX BAM pipeline [6] (default parameters), to assess the mitogenome mapping. Next, the mitogenome annotation was performed using the MITOS v2 web server [7] . To improve the annotation, the predicted proteins were verified using the Open Reading Frame (ORF) Finder ( https://www.ncbi.nlm.nih.gov/orffinder/ ) server, followed by alignment visualization in Jalview 2 v11.1.4 [8] . Subsequently, Tablet software [9] was used to manually check for indels and sequence coverage. To determine the total base composition, BioEdit software [10] was used by integrating the formula: AT skew = (A-T)/( A + T ) and GC skew = (G-C)/(G + C). The circular mitogenome map of I. marapok was generated by OGDraw [3] . Tandem repeats at the control region were predicted using the Tandem Repeats Finder tool ( https://tandem.bu.edu/trf/basic _ submit ) (basic parameter).

Phylogenetic analyses
Forty-two available Lepidoptera mitogenomes from the family Erebidae and Lasiocampidae (Superfamily Bombycoidea) were downloaded from NCBI GenBank ( Table 4 ), in which the ingroups consist of representatives from the five recognized subfamilies in Erebidae [2] . Dendrolimus spectabilis (KU558688) and Euthrix laeta (KU870700) from the family Lasiocampidae were used as outgroups. Prior to phylogenetic analyses, the protein-coding genes of each Lepidoptera mitogenomes were firstly extracted using PhyloSuite v1.2.2 [11] and aligned using MAFFT [12] . Next, ambigous sites from the 13 protein-coding genes were removed by Gblocks with default settings [13] and were concatenated. Here, PartitionFinder v2.1.1 was utilized to determine the best-partitioning schemes for the dataset [14] . For Maximum-Likelihood (ML), the analysis was performed using IQ-Tree web server [15] with 50 0 0 ultrafast bootstrapping and the best fit model was determined by ModelFinder [16] . The Bayesian Inference (BI) tree was built using MrBayes [17] , carried out for 10,0 0 0,0 0 0 generations with 4 chains, sampled every 10 0 0 generations until the average standard deviation of split frequencies are less than 0.01. Tracer v1.7.2 was used to ensure sufficient parameter sampling and that the Estimated Sample Size (ESS) is more than 200 [18] . The resulting trees were visualized using Figtree v1.4.4 ( http://tree.bio.ed.ac.uk/software/figtree/ ).

Ethics Statements
No data were collected involving any human subjects, animal experiments and social media platforms.

Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.