The draft genome dataset of the American cockroach, Periplaneta americana (Linnaeus, 1758) (Blattidae: Blattinae)

Periplaneta americana is a cosmopolitan pest cockroach endemic to tropical and subtropical climates. It occurs frequently in urban sewer and wastewater system and transit in human proximities, spreading pathogens that causes serious public health concerns such as asthma, allergies, and others. By using the Next-generation Sequencing (NGS) known as Illumina NovaSeq 6000, this article documents for the draft genome data set of P. americana collected in Penang Island, Malaysia. This article displays the pair-end 150 bp genome dataset and results on the sequence quality. This genome dataset presents the information for further understanding of P. americana populations at molecular level and the opportunity to develop effective control and management strategies for the species. This dataset is available under Sequence Read Archive (SRA) databases with the SRR23867103.


Specifications
Biological Sciences Specific subject area Entomology and insect science, Genomics: Next generation sequencing (NGS) reads of genome Type of data Figures and raw reads of sequenced genome How the data were acquired Paired end reads of P. americana genome were sequenced using Illumina NovaSeq 60 0 0 Data format Raw sequencing data: FASTQ file. Description of data collection Periplaneta americana was baited using a baited glass jar with beer-soaked bread. DNA was extracted using HiYield Plus TM Genomic DNA Mini Kit (Blood/tissue/cultured cells) (

Value of the Data
• Periplaneta americana represents a pest species endemic to tropical and subtropical climates and can be found inhabiting in urban areas, especially in sewers. • Results obtained are particularly useful for urban and medical entomologist in Malaysia.
• The P. americana genome data could be used for the development of species-specific microsatellite marker. • Further study could potentially contribute to pest control and management approach according to genetic diversity of P. americana .

Objective
The American cockroach, or also known as Periplaneta americana ( P. americana ), is predominantly found indoors in warm, humid regions. In addition, unlike German cockroaches which in general do not reproduce outside human structures, the American cockroaches has the capabilities to maintain their large populations in outdoor niches [1] . Particularly by which, the P. americana can reproduce in locations where there are accumulations of waters and of these, several studies implicated sewer systems as the main source of urban American cockroaches [2,3] . This abundance in close proximity to urban areas has shown as a substantial threat to home allergens and human public health [4] similar to mosquito [5] . Despite this, however, the knowledge of the species is still limited due to the lack of genome-related data. To develop effective management strategies for sewer systems and control strategies for the species, it is crucial to study their urban populations at molecular level. Hence, this study aimed to draft the whole genome sequences of P. americana using Next-generation Sequencing (NGS) via Illumina NovaSeq 60 0 0.

Data Description
This article described a dataset of whole-genome pair-end sequencing result of BioSample SAMN33762737 which is under the BioProject PRJNA944812 . This data is registered under the Sequence Read Archive (SRA) databases with the accession number SRR23867103 . The data set comprised of two high throughput sequencing fastq files: PA1_DKDN220024683-1A_HNVCCDSX5_L4_1.fq and PA1_DKDN220024683-1A_HNVCCDSX5_ L4_2.fq contained half of the full sequence in a total of 36,359,668 raw reads with 150bp each. PA1_DKDN220024683-1A_HNVCCDSX5_L4_1.fq composed of the 1 st -150 th base position meanwhile the 151 th -300 th base position composed of from PA1_DKDN220024683-1A_HNVCCDSX5_L4_2.fq of each sequence.
In this manuscript, characteristics and quality of dataset is presented. Firstly, the dataset's quality score scored between Q30 to Q40, whereby Q30 showed 99.9% of the correct base meanwhile Q40 showed 99.99% of correct base ( Fig. 1 ). Secondly, the rate of the single base error rate distribution along reads is under 0.04% ( Fig. 2 ). The total GC content represented at 36.69% ( Fig. 3 ). Next, the sequencing data filtration can be seen that out of the 36,359,668 raw reads, in percentage, 99.24% are clean reads and 0.76% reads related to the adapter sequence ( Fig. 4 ). Lastly, the forward adapter sequence is 5 -AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGT GGTCGCCGTATCATT-3', and the reverse adapter sequence is 5 -GATCGGAAGAGCACACGTCTGAA CTCCAGTCACGGATGACTATCTCGTATGCCGTCTTCTGCTTG-3 .

Sampling and DNA extraction
The American cockroach, Periplaneta americana , was collected using a baited glass jar with beer-soaked bread and was left in a sewer manhole shaft overnight from late evening till the next morning [4] . After collection, the P. americana was then freeze killed and stored in 95% ethanol under -20 °C. An individual of P. americana was used for the genomic DNA (gDNA) extraction. The gDNA extraction was extracted solely from the leg tissue in order to reduce the possibility of DNA extraction contamination from endosymbionts [6] . The gDNA extraction was conducted using HiYield Plus TM Genomic DNA Mini Kit (Blood/Tissue/Cultured Cells) (Real Biotech Corp., Taipei, Taiwan) according to the manufacturer's instruction. The leg tissues were vortexed by using MX-S Dragon Lab Single-head Vortex Mixer in lysis buffer with Proteinase K and incubated for 1 hour at 60 °C by using MINIC-100 Mini Dry Bath. Elution was then be carried out twice using 50 μL elution buffers after the DNA binds to the filter column through an ethanol wash to get a total of 100 μL gDNA solution [7] . The extracted gDNA was then quantified by using OPTIZEN TM NanoQ Lite Microvolume Spectrophotometer.

Library preparation and sequencing
The general workflow of library construction is as shown in Fig. 5 . The genomic DNA was randomly sheared into short fragments and the fragments obtained were then end repaired, A-tailed, and further ligated with Illumina adapters for sequencing. Next, the fragments with adapters were amplified, size selected, and purified.
Subsequently after that, the library was checked with Qubit and real-time PCR for quantification and bioanalyzer for size distribution detection. The quantified libraries will be pooled and sequenced on Illumina platforms, according to effective library concentration and data amount required [7] .

Genome assembly
The Illumina sequencing data of Periplaneta americana were assembled, and an assembly with a total length of 15,0 64,6 61 bp and a contig N50 of 2089 bp was yielded ( Table 1 ).

Functional annotation of protein coding genes
Predicted protein sequences from this data were functionally annotated to EggNog mapper (Evolutionary Genealogy of Genes: Non-supervised Orthologous Groups) with a minimum Evalue of 0.001 whereby 423 queries were scanned [8] . The functional annotation of genes was by performing mapping against public database, COG, the Clusters of Orthologous Groups ( Table 2 ).

Ethics Statements
Our work does not involve studies with humans. Our work involves studies with animals, specifically the American Cockroaches, Periplaneta americana .

Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Data Availability
Raw genome sequence of Periplaneta americana (Original data) (NCBI-GeneBank