Chromosome level genome assembly of oriental armyworm Mythimna separata

The oriental armyworm, Mythimna separata, is an extremely destructive polyphagous pest with a broad host range that seriously threatens the safety of agricultural production. Here, a high-quality chromosome-level genome was assembled using Illumina, PacBio HiFi long sequencing, and Hi-C scaffolding technologies. The genome size was 706.30 Mb with a contig N50 of 22.08 Mb, and 99.2% of the assembled sequences were anchored to 31 chromosomes. In addition, 20,375 protein-coding genes and 258.68 Mb transposable elements were identified. The chromosome-level genome assembly of M. separata provides a significant genetic resource for future studies of this insect and contributes to the development of management strategies.


Background & Summary
The oriental armyworm, Mythimna separata (Lepidoptera, Noctuidae), is a notorious polyphagous pest that is widely distributed in Asia, Australia, New Zealand, and several Pacific islands [1][2][3] (Fig. 1a).This pest has a wide host range and poses a serious threat to the production of crops, particularly rice, maize, and wheat 4 (Fig. 1b).The outbreak of M. separata in China from 2012 to 2013 threatened 1743.7 million hectares of farmland 5 , and this threat has continued in recent years [6][7][8] .This situation also occurs in other countries and regions where M. separata infestations are present 9 .In recent years, with the changes in global climate, crop planting structure, variety distribution, and cultivation system, M. separata has shown new characteristics in adaptability, breakout, and damage 10,11 .Due to its gregariousness, migration capability, polyphagy, and gluttony, M. separata was included in the list of first-class crop diseases and insect pests by the Chinese Ministry of Agriculture and Rural Affairs in 2020.
Previous studies have shown that polyphagous insects respond to toxic secondary metabolites produced by different host plants by inducing changes in the expression of genes related to detoxifying enzymes.Such changes may enhance the ability of polyphagous insects to adapt to host plants and develop resistance against pesticides 12 .However, the scarcity of genomic resources prevents the above hypothesis from being verified in M. separata.Although several M. separata genome assemblies were published in 2022 and 2023 [13][14][15][16] , there are significant differences in the assembly method and quality of these genome assemblies.Hence, a high-quality chromosomal level genome is necessary to offer genetic resources and delve into the molecular mechanism of detoxification and host adaptation of M. separata, which will aid in providing theoretical support for optimizing management strategies for M. separata.
In the present study, we assembled a high-quality chromosome-level genome of M. separata by using a combination of Illumina short reads, PacBio high fidelity (HiFi) reads, and high-throughput chromosome conformation capture (Hi-C) data (Table 1).The genome assembly consisted of 172 contigs with a total length of 706.30Mb, of which the contig N50 was 22.08 Mb.In addition, 99.2% of the draft assembly (700.63Mb) was anchored to 31 chromosomes with a scaffold N50 of 23.00 Mb.We also identified 258.68 Mb of tandem repeats, accounting for 36.63% of the genome assembly.A total of 20,375 protein-coding genes were obtained, of which 98.53% were annotated.The results of phylogenetic analysis revealed that M. separata was diverged from Helicoverpa armigera approximately 25.91 Mya.Furthermore, 594 expanded gene families and 1329 contracted gene families were identified in M. separata genome.The high-quality chromosome-level genome assembly of M. separata will provide a genetic basis for further research on this polyphagous pest.

Methods
Sample collection and genome sequencing.M. separata was collected from maize fields in Anyang City, Henan Province, China, and was subsequently reared in climate incubators at a temperature of 26 ± 1 °C with a relative humidity of 70% and a photoperiod of 14 h L:10 h D 17 .Genomic DNA was extracted from a single surface-sterilized male pupa using the QIAamp DNA Mini Kit (QIAGEN) for both Illumina and PacBio HiFi sequencing to prevent contamination from other individuals and microorganisms.For Hi-C sequencing, genomic DNA was extracted from a single male adult.Total RNA was extracted from adults using the TRIzol kit for transcriptome sequencing.The purity and integrity of genomic DNA and RNA were validated by the NanoDrop 2000C spectrophotometer (Thermo, Wilmington, DE, USA) and agarose gel electrophoresis (1.5%).
The paired-end libraries with a 350 bp inserted fragment were constructed and sequenced on the Illumina NovaSeq6000 platform following the manufacturer's instruction.After removing adapter sequences and low-quality reads with HTQC (v1.92.310) software 18 , a total of 58.72 Gb clean reads were obtained for subsequent analyses.For PacBio HiFi sequencing, genomic DNA was sheared into ~15 Kb fragments using g-Tubes (Covaris, Woburn, MA, USA) and purified using 0.45 × AMPure PB beads (Beckman Coulter, Brea, CA, USA) for constructing SMRT bell libraries.Size selection was performed using the Sage ELF system (Sage Science, Beverly, MA) to collect SMRT bell libraries of 15-18 Kb.After annealing primers and binding Sequel II DNA polymerase to SMRT bell templates, sequencing was performed using 8 M SMRT cells on the Sequel II System (Biomarker Technologies Co., LTD, Beijing, China).A total of 986.03 Gb subreads were obtained and utilized to generate PacBio HiFi reads via the circular consensus sequencing (CCS) mode.Finally, a total of 70.62 Gb of CCS reads were produced, with an average read length of 16.67 kb, resulting in 99.98X coverages of the M. separata genome.The Hi-C library was constructed following the standard library preparation protocol 19 and sequenced on the Illumina NovaSeq6000 platform, and 76.08 Gb of 150-bp paired-end clean reads were obtained.Genome survey and assembly.Genome survey was essential to estimate the main characteristics, including genome size, repetitive sequence content and heterozygosity.The k-mer (K = 19) frequencies were constructed based on Illumina clean short-reads using Jellyfish (v2.2.10) 20 and used to perform genome survey by GenomeScope (v2.0) 21.The estimated genome scale of M. separata was 662.64 Mb, with a repetitive content of 39.00% and a heterozygosity of 0.76% (Fig. 2a).Subsequently, CCS reads were submitted to Hifiasm (v0.15.1) 22 and assembled with default parameters.After filtering haplotypic duplicates using purge_dups 23 with parameters of '−2 -T cutoffs -c PB.base.cov', the M. separata genome assembly was generated.The assembly consisted of 172 contigs with a total length of 706.30Mb and a contig N50 of 22.08 Mb.The clean Hi-C reads were aligned to the draft genome assembly using BWA (0.7.10) 24 with default parameters.The uniquely aligned read pairs were further processed using HiC-Pro (v2.10.0) 25 to assess and eliminate the invalid read pairs, including dangling-end, re-ligation, self-cycle, and dumped pairs.A total of 88,824,108 valid interaction pairs for scaffold correction were used to cluster, order, and orient contigs onto chromosomes using LACHESIS (v2e27abb) 26 with default parameters.Finally, 147 scaffolds were anchored to 31 chromosomes with a scaffold N50 of 23.00 Mb, covering a span of 700.63 Mb and representing 99.2% of the draft genome assembly (Fig. 2b,c, Table 2).In addition, the mitochondrial genome of M. separata was assembled through mitoZ 27 and NOVOplasty 28 , and subsequently annotated using MITOS 29 and GeSeq 30 (Fig. 3a, Table 3).

Genomic repeat annotation.
Repeat sequences mainly include tandem repeats and interspersed repeats, with the latter mainly being transposable elements (TE).The repeat sequences of TE were annotated using a combination of homology-based and de novo approaches.We initially customized a de novo repeat library using RepeatModeler 31 and LTR_retriever (v2.8) 32 based on the assembly sequences with default parameters.The predicted repeats were subsequently classified using PASTEClassifier (v1.0) 33 , and the results were combined with databases of Repbase 34 , REXdb (v3.0) 35 , and Dfam (v3.2) 36 to construct a species-specific TE library without redundancy.TE sequences were identified by homology search against the library using RepeatMasker (v4.10) 37 .A total of 258.68 Mb TE sequences were obtained, accounting for 36.63%genome assembly.In addition, 23.64 Mb (3.35%) tandem repeats were identified using MISA (v2.1) 38 and NCRF 39 (Table 4).Gene prediction and functional annotation.Three approaches, including de novo prediction, homolog-based and transcriptome-based methods, were combined to perform gene prediction after eliminating the interference of repeat sequences in the M. separata genome.The de novo gene models were predicted using two ab initio gene-prediction software tools of Augustus (v2.4) 40 and SNAP 41 with default parameters.Homology-based gene prediction was conducted using GeMoMa (v1.7) 42 against the protein sequences of For transcriptome-based gene prediction, the RNA-seq reads were assembled into unigenes using Trinity (v2.11) 43 , and resulting unigenes were then used to identify protein-coding genes via PASA (v2.0.2) 44 .Finally, gene models obtained from these three methods were integrated into a unified gene set using EVidenceModeler (v1.1.1) 45with default parameters.As a result, 20,375 protein-coding genes were identified from M. separata genome (Fig. 3b).

Phylogenetic analysis.
The protein sequences of seventeen insects, including eight Lepidoptera insects and nine others associated with Diptera, Coleoptera, Hymenoptera, Hemiptera, and Odonata, were downloaded from NCBI for phylogenetic analysis (Table 6).The orthologous gene families were detected using OrthoFinder (v2.4.0) 47 and annotated based on the PANTHER 48 database.The single-copy orthologous genes were aligned using MAFFT (v7.205) 49 , and ambiguously aligned regions were removed by applying Gblocks (v0.91b) 50 with default parameters.The phylogenetic trees were constructed by IQ-TREE (v1.6.10) 51with 1000 bootstrap replicates and the best model of LG + F + I + G4.The divergence time between different species was estimated using MCMCtree (PAML 52 package) based on the fossil records acquired from TimeTree database (http://www.timetree.org/).Furthermore, the results obtained from phylogenetic trees, which included divergence time, were employed to identify the expansion and contraction of gene families using CAFE (v5.0) 53 with a p-value threshold ≤ 0.05.

Genome synteny analysis.
In order to perform genome synteny analysis of M. separata with Spodoptera frugiperda, the similar gene pairs were identified using Diamond (v0.9.29) 54   in synteny blocks were obtained by MCScanX 55 , and synteny blocks were then visualized across chromosomes using CIRCOS (v 0.69-9) 56 .Only one fission event was identified between M. separata and Spodoptera frugiperda, which suggested a high degree of concordance between them (Fig. 5a).

Data Records
The raw data of Illumina sequencing, PacBio HiFi sequencing and Hi-C sequencing of the Mythimna separata genome was deposited at the NCBI SRA database with the accession number of SRP433040 57 .The final assembled Mythimna separata genome has been submitted to NCBI under accession number GCA_030763345.1 58 .
The annotation files of the Mythimna separata genome have been deposited at figshare 59 .

Technical Validation
evaluation of the genome assembly.The integrity and accuracy of genome assembly were verified from three aspects: Firstly, the clean reads acquired from Illumina sequencing were aligned against the genome assembly using BWA 24 .The results revealed that 99.26% of Illumina reads were aligned to the genome assembly.Secondly, the Core Eukaryotic Genes Mapping Approach (CEGMA) database contained 458 conserved core eukaryotic genes, of which 431 (94.10%) were identified in M. separata genome.Finally, the completeness of Species Table 6.Download link of 17 insect genomes used for phylogenetic analysis.the genome assembly was evaluated using BUSCO (v4) 20 with parameters of '-m prot -f -l eukaryota_odb9' , and 98.74% of the conserved core BUSCOs were identified in the genome of M. separata.These results showed that we obtained the high-quality M. separata genome assembly.Meanwhile, the contig N50 in our assembly was 22.08 Mb, which was significantly higher than the 7.31 Mb in recent assembly version of M. separata 13 .The scaffold N50 in our assembly was improved to 23.00 Mb, which was slightly higher than the 22.68 Mb in other recent assembly version of M. separata 15 (Table 7).
To assess the quality of chromosome assembly, the assembly was sheared into 100 kb bins, and the intensity of the interaction pairs was used to plot heatmaps.The Hi-C heatmap showed that the intensity of interaction along the diagonals was obviously higher than that at non-diagonal positions in 31 distinct chromosomes.
evaluation of gene prediction.The BUSCO analysis was also used to assess the results of gene prediction.
The 98.74% (942/954) of the BUSCOs were identified from the predicted gene set of our genome, which was slightly higher than the 98% and 98.2% in other recent M. separata assembly version 13,15 .Meanwhile, 83.84% of the RNA-seq data were aligned to the predicted exons (Fig. 5b).These results confirmed the completeness and accuracy of gene prediction across M. separata genome.In addition, 20,375 protein-coding genes were identified in our genome assembly, which was significantly more than the 17549 protein-coding genes in the best available recent reference genome assembly version of M. separata 15 .We further compared the set of protein-coding genes in the two genome assemblies using local BLASTN with E-values < 10 −5 .A total of 16,398 protein-coding genes were identified in both two genome assemblies, and 2,828 protein-coding genes were identified only in our genome assembly.

Genome assembly
This study Jiang et al. 13 Kakeru et al. 14  Fig. 6 Phylogenetic tree of M. spearata together with 17 other insects.The maximum likelihood phylogenomic tree was calculated based on 565 single-copy genes.The numbers of expanded gene families (green) and contracted gene families (red) are displayed to the right of each species branch.The coloured histogram indicates that genes of each species were categorized into five groups: 1:1:1 (single copy orthologous genes in common gene families); N: N: N (multiple copy orthologous genes in common gene common gene families); Specific (genes from unique gene families from each species); Other (genes that do not belong to any of the above ortholog categories); Unclustered (genes which are not clustered into any family).

Comparative genomic analysis.
A total of 27,002 orthologous gene families were identified from the 18 insect species, of which 565 single-copy orthologous gene families were used for phylogenetic analysis (Table S1).

Fig. 2
Fig. 2 Genome assembly of M. separata.(a) Genome scope profiles of 19-mer analysis.(b) Circle genome landscape of M. separata.Circle I represents chromosomes, while circles II-IV indicate repeat density, gene density, and GC content of each respective chromosome.(c) Hi-C interactive heatmap of M. separata.Color indicates the intensity of the interaction signal.The darker the color, the higher the intensity.

Fig. 3
Fig. 3 Mitochondrial genome assembly and protein-coding gene prediction of M. separata.(a) Circular map of M. separata mitochondrial genome.Gene map presents 37 annotated genes of different functional groups.(b) Venn diagrams of protein-coding genes obtained from three prediction methods.

Fig. 4
Fig. 4 Divergence time and distribution of detoxification and chemosensory genes in M. spearata and other eight lepidopteran insects.The branch node values indicate the inferred divergence time between species.The numbers in the right cells indicate the scale of the corresponding gene family in each species.The darker the background color of cells, the more the genes encoded in the corresponding species.

Fig. 5
Fig. 5 Genome synteny and verification of protein-coding genes of M. separata genome.(a) Whole-genome synteny between M. spearata and Spodoptera frugiperda.(b) RNA-seq clean data verified the accuracy of protein-coding gene prediction.

Table 1 .
Statistics of sequencing data of M. separata genome.

Table 2 .
Statistics of Hi-C assembly results.

Table 4 .
Statistics of repeat elements of M. separata genome.

Table 5 .
with default parameters.All genes Statistics of functional annotation in M. separata genome.

Table 7 .
Comparative statistic of five M. separata genome assemblies.