A Chromosome-Level Genome Assembly of the Parasitic Wasp Chelonus formosanus Sonan 1932 (Hymenoptera: Braconidae)

Abstract Chelonus formosanus Sonan 1932 (Hymenoptera: Braconidae) is a wasp capable of parasitizing a variety of lepidopteran pests at the “egg-larval” stage which distributes throughout Taiwan, Guangdong, Zhejiang, and Hainan provinces of China. This wasp has been successfully used to control pests such as Spodoptera litura Fabricius, 1775, Spodoptera frugiperda (JE Smith, 1797), Spodoptera exigua (Hübner, 1808), and Helicoverpa armigera (Hübner, 1808). So far, there is only one genome assembled from the Chelonus genus [Chelonus insularis (Cresson, 1865)] and it is fragmented with 455 scaffolds. Here, we report a chromosome-level genome assembly of C. formosanus, which was sequenced using PacBio, Illumina, and Hi-C technologies. The long reads were 35.4 Gb (∼150× coverage) with an average length of 15.23 kb. The size of the genome assembly was 139.59 Mb. More than 99.46% of the assembled sequences were anchored to seven pseudochromosomes (138.84 Mb). The Benchmarking University Single-Copy Orthologs (BUSCO) assessment results showed 99.0% of the 1,367 genes (insect_odb10 database) were completely present. We annotated 11,242 protein-coding genes including 98.6% of BUSCO complete genes that were recovered. Nearly one-fourth of the genome assembly (22.25%) was annotated as repetitive sequences and 324 noncoding RNAs were predicted. There were 58 gene families found with significant expansion including allelopathic families (odorant receptors and ionotropic receptors), which may play a crucial role in efficiently locating a wide range of hosts. This high-quality genome assembly and annotation could provide a highly valuable resource of parasitic wasp for the biological control of Lepidoptera pest.


Introduction
The cosmopolitan genus Chelonus Panzer, 1806, harbors near 360 known species, which are ovo-larval endoparasitoids of Lepidoptera (Zhang et al. 2006). Chelonus normally regulate the metamorphic process to kill host larva during their final instar (Jones 1985;Zhang et al. 2006). The genus Chelonus have numerous associations with many Spodoptera species which include some important agricultural pests in the world (Jones 1985). So far, we found only one whole-genome assembly sequenced from the Chelonus genus (Chelonus insularis Cresson, 1865) was deposited on NCBI and it is fragmented with 455 scaffolds. To increase genomic resource from this insect genus and provide chromosomal information at the same time, we sequenced and assembled the whole genome of Chelonus formosanus Sonan using PacBio, Illumina, and Hi-C technologies. We also annotated protein-coding genes and analyzed the evolution of gene families across 16 species in different orders, including C. formosanus and C. insularis.

Genome Assembly
We obtained a total of 35.4 Gb PacBio long ($150 Â coverage) and 36.43 Gb Illumina short reads. The average length and N50 length of the long reads were 15.23 and 17.94 kb, respectively. The kmer analysis predicted genome size being 139.59 Mb, and it also indicated no significant heterozygosity and approximately 18 Mb (12.95%) repetitive sequences of the genome (supplementary fig. S2, Supplementary Material online). The genome assembly size, GC content, and Benchmarking University Single-Copy Orthologs (BUSCO) assessment results are comparable to the genome assembly of the closely related species C. insularis (table 1). However, our genome assembly was more complete with smaller number of scaffolds and more contiguous with fewer gaps compared with C. insularis. The mapping-back rates from the Illumina DNA and RNA sequences as well as the PacBio raw reads were 98.29%, 96.92%, and 96.02%, respectively. Overall, our C. formosanus genome scaffolds have recovered most of sequencing reads and is suitable for further analysis. According to the longrange linked reads from Hi-C data, we assigned 138.84 Mb of the assembly to the seven pseudochromosomes (supplementary fig. S2, Supplementary Material online). A chromosomal synteny analysis between the C. formosanus and Aphidius gifuensis Ashmaed, 1906 chromosomes showed limited level of conservation. We also noticed that there was no indication of conservation between the chromosome one of C. formosanus and the A. gifuensis genome ( fig. 1a).
To investigate this further, more chromosomal conservation analysis with other closely related species may be applied when the chromosome-level genome assemblies of those species became available.
A total of 4.42-Gb RNA-Seq reads were imported into the gene prediction program MAKER as biological evidence for the protein-coding gene prediction. The MAKER process predicted 11,242 protein-coding gene models, which was comparable to that of C. insularis (11,574). The average gene and protein-coding region lengths were 4,350 and 1,593 bp, respectively. The average exon length and the average number of exons per gene were 355.13 bp and 5.73, respectively. The average intron length was 522.85 bp. The BUSCO assessment result showed 98.6% complete genes were captured, and 0.4% and 1.0% of the genes were fragmented and missing, respectively. There were 9,019 (80.23%) protein-coding genes identified with protein domains, which were then assigned with 7,822 GO terms, 6,157 KEGG KO terms, 2,020 enzyme codes, 3,576 KEGG pathways, 2,364 Reactome pathways, and 9,071 COG categories (supplementary figs. S3 and S4, Supplementary Material online).  1b). A phylogenetic tree with the bootstrap support value of 100/100 was reconstructed based on the 1,581 genes after 121 single-copy genes were removed ( fig. 1b). The topology of this phylogeny shows consistency with the previous phylogenetic tree constructed by Peters et al. (2017). For example, parasitoid Apocrita belongs to a monophyletic group and forms a sister group to Orussoidea (Orussus abietinus), and Aculeata, Chalcidoidea, and Ichonoidea are all monophyletic groups (Gauthier et al. 2021). As expected, our analysis showed C. formosanus was closely clustered with C. insularis ( fig. 1b) and our calculation indicated the two species diverged approximately 6 Ma. We identified 355 expanded (58 significantly expanded) and 383 contracted (28 significantly contracted) gene families from the C. formosanus annotated gene models ( fig. 1c). There was a rapid expansion of the allelopathic families (odorant receptors and ionotropic receptors) and a digestionrelated family (trypsin) (fig. 1c). The gene ontology enrichment results also show a rapid expansion of the gene families belonging to the GO terms of olfactory receptor (OR), odorant binding, and sensory perception of smell (supplementary fig. S4, Supplementary Material online). ORs play an important role in locating host during parasitic process (Gauthier et al. 2021). The expansion of these genes we found in the C. formosanus genome might explain its wide range of insect hosts. In the phylogeny, the node values on the tree represent the number of expanded, contracted, and rapidly evolving families for each clade or species. Statistics of orthology inference result: "1:1:1" indicates single-copy orthologs; "N:N:N" indicates multicopy orthologs; "Braconidae" indicates orthologs are specific to Braconidae; "Others" indicates unclassified orthologs; "Unassigned" indicates orthologs that cannot be assigned into any of the orthogroups. (c) The top 15 significantly expanded gene families with numbers of genes in each family shown above each bar.

Sample Collection and Sequencing
Chelonus formosanus specimens were collected in June 2020 within the Guilinyang Economic Development Zone, Haikou City, Hainan Province, China (20.0521 N,110.2067 E) and then reared with Spodoptera frugiperda for more than five generations under the laboratory conditions of 26 6 1 C, 70 6 5% RH, and a photoperiod of 14:10 (L:D) h. Male adults were used for genome sequencing: four individuals for Illumina, 20 for PacBio, and five for Hi-C. Four males were used for RNA-Seq (supplementary fig. S1, Supplementary Material online).
The high-quality DNA was extracted using the QIAGEN DNeasy Blood & Tissue kit. For the PacBio sequencing, a 20kb insert size library was constructed using the SMRTbell Template Prep Kit 2.0. For Illumina DNA sequencing, a library with an insert size of 350 bp was constructed using the TruSeq DNA PCR-free kit. The Hi-C library construction (restriction enzyme: MboI) was performed by Frasergen Co., Ltd (Wuhan, China). RNA was extracted using the TRIzolTM Reagent kit and the RNA libraries were constructed using the TruSeq RNA v2 kit. The Illumina and PacBio sequencing were performed on NovaSeq 6000 and PacBio Sequel II, respectively, at the Beijing Berry Genomics Co., Ltd (Beijing, China).

Genome Assembly
The quality control of the Illumina short reads was performed using BBTools suite v38.49 (Bushnell 2014): the script "Clumpify.sh" was used to remove duplicated sequences; "bbduk.sh" was used for trimming sites with base quality scores below 20 (>Q20) and poly-A/G/C ends with their lengths being more than 10 bp, filtering sequences with lengths below 15 bp, and correcting bases according to the sequence overlap regions (qtrim¼rl trimq ¼ 20 minlen ¼ 15 ecco¼tmaxns ¼ 5 trimpolya ¼ 10 trimpolyg ¼ 10 trimpolyc ¼ 10). The k-mer frequency was calculated using the "khist.sh" script from the BBTools suite (kmer: 21). A genomic survey based on the k-mer distribution frequency was performed using Genomescope v2.0 (Vurture et al. 2017) with the parameters of "-k 21 -p 2 -m 10,000." The long-read assembler Flye v2.8.1 (Kolmogorov et al. 2019) was used to generate a preliminary genome assembly with the parameters "-i 2 -m 1,000" (two rounds of longread polishing with a minimum overlap length of 1,000 bases between sequences). The Illumina reads were then aligned to the preliminary assembly using Minimap2 v2.17 (Li 2018) with default parameters and the alignments were used for the two consecutive rounds of short-read polishing with NextPolish v1.3.0 (Hu et al. 2020). The haplotigs and overlaps in the genome assembly were filtered based on the read depth using Purgedups v1.0.1 (Guan et al. 2020) with the minimum alignment score of 70 (-a 70). The remaining contigs were then assigned to pseudochromosomes based on the contact information from the read alignments of the Hi-C data: first, the raw Hi-C reads were quality-assessed and then removed unusable reads using Juicer v1.6.2 (Durand et al. 2016); second, the pseudochromosomal assignment was performed using 3D-DNA v180922 (Dudchenko et al. 2017); third, the assignment errors were corrected using Juicebox v1.11.08 based on the Hi-C contact maps (Durand et al. 2016). The contaminant sequences were removed using BLASTþ (BlastN) v2.7.1 (Camacho et al. 2009) based on the homological search against the NCBI nucleotide (nt; downloaded on 31 st of December 2020) and UniVec databases. Genome completeness was assessed using BUSCO v3.0.2 pipeline (Waterhouse et al. 2018) by searching against the database of insect_odb10 database (n ¼ 1,367).
To construct a chromosomal synteny between the C. formosanus and A. gifuensis pseudochromosomes, A BlastP-like alignment method was performed using Mmseq2 v11-e1a1c with default parameters for aligning protein sequences. The generated "all.blast" file and the integrated "all.gff" file were imported in MCScanX to perform collinearity analysis. A circus plot was created using Tbtools v1.0692 (Chen et al. 2020).

Genome Annotation
The genome assembly was annotated for repetitive sequences, protein-coding genes, and ncRNAs. To annotate repeats, a de novo repeat library was constructed using RepeatModeler v2.0.1 (Flynn et al. 2020) with the LTR search process (-LTRStruct). It was then combined with Dfam3.3 (Hubley et al. 2016) and the RepBase-20181026 databases (Bao et al. 2015) to form a custom library, which was used as input for RepeatMasker v4.0.9 (Smit et al. 2013(Smit et al. -2015 to search for repeats and generate a repeatmasked assembly. To annotate ncRNAs, the rRNAs, snRNAs, and miRNAs were identified based on the alignment with the Rfam library using Infernal v1.1.2 (Nawrocki and Eddy 2013), and the tRNAs were predicted using tRNAscan-SE v2.0.6 (Chan and Lowe 2019) and then filtered lowconfident sequences using the "EukHighConfidenceFilter" script.

Supplementary Material
Supplementary data are available at Genome Biology and Evolution online.