A Chromosome-Level Genome Assembly and Annotation for the Oecanthus rufescens (Orthoptera: Oecanthidae)

Abstract Oecanthus is a genus of cricket known for its distinctive chirping and distributed across major zoogeographical regions worldwide. This study focuses on Oecanthus rufescens, and conducts a comprehensive examination of its genome through genome sequencing technologies and bioinformatic analysis. A high-quality chromosome-level genome of O. rufescens was successfully obtained, revealing significant features of its genome structure. The genome size is 877.9 Mb, comprising ten pseudo-chromosomes and 70 other sequences, with a GC content of 41.38% and an N50 value of 157,110,771 bp, indicating a high level of continuity. BUSCO assessment results demonstrate that the genome's integrity and quality are high (of which 96.8% are single-copy and 1.6% are duplicated). Comprehensive genome annotation was also performed, identifying approximately 310 Mb of repetitive sequences, accounting for 35.3% of the total genome sequence, and discovering 15,481 tRNA genes, 4,082 rRNA genes, and 1,212 other noncoding genes. Furthermore, 15,031 protein-coding genes were identified, with BUSCO assessment results showing that 98.4% (of which 96.3% are single-copy and 1.6% are duplicated) of the genes were annotated.


Introduction
The Grylloidea currently consists of around 6,300 extant species, making it the third-largest group within the Orthoptera (Cigliano et al. 2024).They are commonly found inhabiting tree leaves, grass patches, and shrubbery.The Oecanthus Serville, 1831, classified within the family Oecanthidae, Grylloidea, currently includes 78 species worldwide.Oecanthus are characterized by their slender bodies, thin and translucent wings, and coloration, typically yellow or green, commonly found in leaves, grasses, and shrubs.The Oecanthus are distributed widely across the world's major animal biomes, suggesting a long global evolutionary history and stable environmental adaptation capability.One of their distinctive features is that male crickets produce sound by friction of their wings, encompassing various chirping types such as calling, courtship, aggression, warning, and triumph (Wagner 1996;Hirtenlehner and Römer 2014).Although the morphology of this cricket group is relatively conserved, its chirping exhibits diversity, which enables researchers to distinguish cricket species based on their calls.Nevertheless, the majority of studies on chirping have concentrated on taxonomy, with little attention paid to the genetic level (Toms and Otte 1988;Collins and Schneider 2020;Collins et al. 2021).By mapping and analyzing these specific genes and regulatory networks, it is possible to gain a deeper understanding of the genetic control of chirping and the genetic mechanisms underlying its variability.Furthermore, it is possible to identify which genes are activated or repressed during specific chirping behaviors.This information facilitates the establishment of a direct link between behavioral phenotypes and molecular mechanisms, thereby further elucidating the genetic regulation of behavior.Consequently, a highquality genome is the foundation and prerequisite for the aforementioned research.
Currently, the National Center for Biotechnology Information (NCBI) has publicly released genome data for six species within the Grylloidea, including five species from the Gryllidae: Acheta domesticus Linnaeus, 1758 (Gupta et al. 2020), Gryllus bimaculatus De Geer, 1773 and Nudilla kohalensis Otte, 1994 (Ylla et al. 2021), Teleogryllus occipitalis Serville, 1838 (Kataoka et al. 2020), and Teleogryllus oceanicus Le Guillou, 1841 (Kataoka et al. 2022).Additionally, two species from the Trigonidiidae are included: Apteronemobius asahinai Yamasaki, 1979 (Satoh et al. 2021), and N. kohalensis Otte, 1994(Blankers et al. 2018).However, only the genome of G. bimaculatus has been fully sequenced at the chromosome level.This study aims to provide high-quality chromosome-level whole-genome data for the Oecanthidae.The research aims to elucidate the genomic structure and genetic diversity of Oecanthus rufescens to provide a new data foundation for further exploration of the evolutionary history, interspecies relationships, and biological characters of crickets.

Genome Assembly and Assessment
First, we aligned the obtained genome sequencing data against the NCBI Nucleotide Database.The results indicated that the data were not contaminated with exogenous DNA (refer to supplementary table S1 to S3, Supplementary Material online).After quality control, 45.69 Gb of highquality data were acquired.The genome was preliminarily assembled to be 1,260,246,573 bp, consisting of 1,052 contigs.The longest contig was 23,011,083 bp, with an average length of 1,197,953 bp, and an N50 of 3,871,008 bp.Then, the sequencing data were used to remove duplication.This resulted in a draft genome of O. rufescens with a size of 877,827,037 bp, consisting of 403 contigs.The longest contig remained at 23,011,083 bp, with an increased average length of 2,178,231 bp, and an N50 of 4,538,698 bp (refer to Table 1).
We used 146.08 Gb of Hi-C sequencing data to scaffold the 403 contigs of the draft genome, enhancing the assembly to the chromosomal level.This resulted in a chromosomal-level genome size of 877,902,837 bp, 80 scaffolds total, ten of which are pseudo-chromosomes.The longest pseudo-chromosome measured 163,837,246 bp, with an average length of 10,973,785 bp, and an N50 of 157,110,771 bp.The genome draft was used to construct ten pseudo-chromosomes, with an anchored rate of 92.31%, from 372 contigs.The lengths of the chromosomes range from 18,071,128 bp to 163,837,246 bp.Chromosome 2 had the highest number of mounted contigs, totaling 81, while chromosome 8 had the fewest, with 18 (refer to supplementary table S4, Supplementary Material online and Fig. 1a and b).
The BUSCO assessment results show that 98.4% of genes in the insecta_odb10 gene set were successfully detected, of which 96.8% are single-copy and 1.6% are multi-copy, with fragmented genes accounting for 0.8% (refer to supplementary fig.S4, Supplementary Material online).The BUSCO assessment results indicate that the final genome of O. rufescens at the chromosome level was highly complete.

Genome Structure Annotation
Approximately 310 Mb of the O. rufescens genome was annotated as repetitive, constituting 35.3% of the entire genome.The highest portion of these repetitive sequences are Long Interspersed Nuclear Elements (LINEs), amounting to 21.6%.This is followed by DNA transposons, Short Interspersed Nuclear Elements (SINEs), Long Terminal Repeats (LTRs), and Penelope, which constitute 6.32%, 3.56%, 1.78%, and 0.33% of the genome, respectively (refer to supplementary table S5, Supplementary Material online).

and supplementary table S6, Supplementary Material online).
A total of 15,031 protein-coding genes were annotated.The average gene length was 16,406 bp (refer to supplementary tables S7 and S8, Supplementary Material online).Further analysis of the gene structure revealed that the total length of cDNA sequences was 23.67 Mb, with the longest cDNA measuring 39,024 bp and an average length of 1,576 bp.The protein sequences have a total length of 7.88 million amino acids (Maa).The longest coding sequence is 13,007 amino acids (aa), with an average length of 524 aa (refer to supplementary table S8, Supplementary Material online).
The coding genes of O. rufescens comprise 15,031 genes, constituting a total of 106,993 exons.Collectively, these exons span 23.68 Mb, with an average length of 221 bp.A total of 91,980 introns are present, measuring 222.63 Mb in combined length and averaging 2,420 bp each.Notably, the length of introns exceeds that of exons.On average, each gene contains seven exons and six introns.The O. rufescens genome encompasses 15,093 intergenic regions.These intergenic regions span a total of 631.6 Mb, with an average length of 41,847 bp (refer to supplementary S8, Supplementary Material online).
Then conducted on the number of protein-coding genes located on the ten pseudo-chromosomes and other sequences, with chromosome 1 harboring the most genes, totaling 3,060 (refer to supplementary fig.S3a, Supplementary Material online).Separate statistics were collected on the length and proportion of exons, introns, and intergenic regions on the ten pseudo-chromosomes and other sequences (refer to supplementary fig.S3b, Supplementary Material online).The proportion of exons, introns, and intergenic regions is consistent across the ten pseudo-chromosomes, except for chromosome 10 which has a higher exon proportion of 4.21% compared to the average proportion of 2.73%.The remaining sequences have distinct features, with exons accounting for only 1.42% of the proportion (refer to supplementary fig.S3b, Supplementary Material online and supplementary table S9, Supplementary Material online).
The BUSCO assessment of protein-coding genes showed that 97.8% of the genes in the insecta_odb10 gene set were successfully detected, with 96.3% being single-copy, 1.6% multi-copy, and 0.5% being fragmented genes (refer to supplementary fig.S5, Supplementary Material online).

Genome Function Annotation
The functions of the 15,031 protein-coding genes were determined by eight functional databases.The eggNOG database annotated 11,801 genes, the KEGG database annotated 7,139 genes, the Pfam database annotated 11,027 genes, the KOG database annotated 8,692 genes, the Swissprot database annotated 9,513 genes, the GO database annotated 9,397 genes, the NR database annotated 12,406 genes, and the TrEMBL database annotated 12,364 genes (refer to supplementary fig.S6, Supplementary Material online).In total, 12,823 genes were annotated by at least one database, and 5,769 genes were annotated by all eight databases.Additionally, we mapped the secondand third-generation transcriptome data to the genome data.The results showed that all mapping rates were above 96%, indicating the high quality of the genome annotation (refer to supplementary table S10, Supplementary Material online).Ylla et al. (2021) reported high-quality genome assemblies for G. bimaculatus and N. kohalensis, with the annotated protein-coding genes totaling 17,871 and 12,767, respectively.These figures are relatively close to our annotation results.

Sampling and DNA/RNA Extraction
This study collected a total of 30 females of O. rufescens for research.The samples were collected in 2021 within the campus of Shaanxi Normal University.All specimens were kept alive and then transferred to the lab for further processing.The QIAGEN DNeasy Blood & Tissue Kit method was used for DNA extraction, and the TRIzol method was used for RNA extraction.

Library Construction and Genome Sequencing
Five sequencing libraries were constructed, divided into two categories.The second-generation libraries included genome, Hi-C, and transcriptome libraries, all sequenced on the Illumina Novaseq 6,000 platform using PE150 mode and a 350 bp insert size.The genome library involved DNA fragmentation, end repair, adapter ligation, purification, and PCR amplification.The Hi-C library involved fixation, enzyme digestion, biotin labeling, protein digestion, sonication, biotin capture, adapter ligation, and PCR amplification.The transcriptome library involved reverse transcription of mRNA to cDNA, followed by fragmentation and repair.The third-generation libraries included genome and transcriptome types, both sequenced on the PacBio Sequel IIe platform in HiFi mode.The genome library involved DNA extraction, enzyme digestion, end repair, adapter ligation, and purification.The transcriptome library involved RNA extraction, end repair, adapter ligation, and purification.
For the genome size estimation, we employed the subcommand "kmerfreq" from the GCE v1.0.2 (Liu et al. 2013) to perform.Frequencies of 17-mers were generated based on high-quality PE reads.The genome size was calculated using the formula G = N k-mer /Daverage k-mer , where N k-mer represents the total number of k-mers, and Daverage k-mer the average depth (Guo et al. 2015).The genome size was also determined via flow cytometry, employing Periplaneta americana male (1C = 3.41 pg) as the internal standard.This process adhered to the protocols established by Mao et al. (2020) and Gregory and Johnston (2008).The genome size was calculated using the formula: Genome size sample = Genome size internal standard × (sample 2C mean peak position/internal standard 2C mean peak position) (Lower et al. 2017).

Fig. 1 .
Fig. 1.(a) A Circo map depicting the characters of the O. rufescens genome.The Circo map is structured from the inner to the outer circle to display the distribution of the average GC content, the distribution of the average repetitive sequence content, the distribution of the average number of protein-coding genes, and the lengths of ten pseudo-chromosomes (with a sliding window size of 1 Mb, the numbering chr1 to chr10 represents the ten pseudochromosomes).(b) Heatmap of interaction frequency of genomic fragments of O. rufescens.

Table 1
Summary statistics for the O. rufescens genome assembly and annotation