Chromosome-level genome assembly of predatory Arma chinensis

Arma chinensis is a natural enemy that preys on various species and can suppress agricultural and forest pests in the orders Lepidoptera and Coleoptera. Here, we aimed to determine the genome of A. chinensis assembled at the chromosome-level using PacBio and Hi-C technologies. The assembled genome was 986 Mb, with a contig N50 of 2.40 Mb, scaffold N50 of 134.98 Mb, and BUSCO completeness of 96.10%. Hi-C data aided in anchoring the assembly onto seven chromosomes. A sequence of ~ 496.2 Mb was annotated as a repeat element, constituting 51.15% of the genome. We functionally annotated 84.79% of 20,853 predicted protein-encoding genes. This high-quality A. chinensis genome provides a novel genomic resource for future research on Pentatomidae insects.


Background & Summary
Arma chinensis is a true bug that belongs to the suborder Heteroptera and family Pentatomidae, encompassing all stink bugs.It is distributed primarily in China, Mongolia, the Korean Peninsula, Japan, and other East Asian regions 1 .Hemimetabolous A. chinensis has three major developmental life stages (egg, nymph, and adult), with the nymphal stage divided into five instars (Fig. 1).Arma chinensis preys on many species and can suppress agricultural and forest pests in the orders Lepidoptera and Coleoptera 1,2 .Like most terrestrial predatory arthropods, A. chinensis uses extraoral digestion for relatively large prey and obtains prey and nutrient concentrations through refluxing and non-refluxing while injecting hydrolytic enzymes [3][4][5] .The utilization of nutrition from its prey or artificial diets has been evaluated using bioassays 6,7 , nutrigenomics 8,9 and metabolomics 10 .The chemoreception 11,12 and aggregation-sex pheromones 13 of A. chinensis have been functionally characterized and verified.Besides, A. chinensis has high tolerance of heat 14 , starvation 15 and drought 16 , revealing ecophysiological adaptation to extreme environmental conditions.In addition, it has more tolerance to insecticidal pyrethroids than its prey 17 , suggesting that it has potential compatibility with chemical insecticides in pest management programs.Although biological control applications and the physiological characteristics of A. chinensis have been extensively studied, the lack of genome data has hindered knowledge of deeper gene functions in this species.Therefore, a high-quality genome of this species is needed to facilitate further exploration of the genetic and molecular mechanisms of Pentatomidae insects.
Herein, we constructed a high-quality chromosome-level reference genome for A. chinensis using PacBio long-read sequencing and Hi-C sequencing.The assembled genome is 986 Mb, with a contig N50 of 2.40 Mb, scaffold N50 of 134.98 Mb.The Hi-C sequences were further clustered and ordered into seven chromosomes.A sequence of ~496.2Mb was annotated as a repeat element, constituting 51.15% of the genome.We predicted 20,853 protein-coding genes, of which 84.79% were functionally annotated.We also sequenced the developmental transcriptome of A. chinensis.This A. chinensis genome provides a novel genomic resource for future research on Pentatomidae insects.reared in our laboratory in Beijing, China, for > 60 generations.The insects were fed with Antheraea pernyi pupae and reared at 26 ± 1 °C under 60 ± 5% relative humidity, and 14-h light: 10-h dark photoperiods.We sequenced the genome of female progeny that had been successively inbred for nine generations to reduce background noise.The surfaces of the insects were cleaned with 75% ethanol.Gut contents were removed to eliminate pollutants, then the specimens were stored in liquid nitrogen.
Nucleic acid extraction and sequencing.Genomic DNA was extracted from A. chinensis tissues, using DNeasy Blood & Tissue Kits (QIAGEN, Hilden, Germany).The integrity of DNA was determined using an Agilent 4200 Bioanalyzer (Agilent Technologies, Palo Alto, CA, USA).Genomic DNA (8 μg) were sheared using g-Tubes (Covaris, Woburn, MA, USA) and concentrated with AMPure PB magnetic beads (Beckman Coulter, Brea, CA, USA).We constructed libraries using the Single Molecule Real Time (SMRT) bell template prep Kit 2.0 (Pacific Biosciences, Menlo Park, CA, USA).The libraries were size-selected on a BluePippin ™ system (Sage Science, Inc., Beverly, MA, USA) with a ≥ 15 kb cutoff, followed by primer annealing and binding SMRT bell templates to polymerases with a DNA/Polymerase Binding Kit (Pacific Biosciences), then sequenced on a Sequel platform (Pacific Biosciences).A total of 11 SMRT cells were run.Size-selected SMRT bell libraries were prepared with a minimum fragment length of 10-20 kb.Medium-and large-insert libraries were sequenced using a PacBio Sequel system (Pacific Biosciences).

Genome estimation and contig assembly.
The genome was surveyed using a k-mer based method.
The size of the A. chinensis genome estimated using the k-mer approach was ~826 Mb, with a heterozygosity of 1.01% and a repetitive sequence ratio of 32.25%, which suggests high heterozygosity and repetitive content (Fig. 2).We sequenced and assembled the genome of A. chinensis using SMRT (Pacific Biosciences) 20 and Hi-C sequencing.We used 130 × coverage of SMRT sequences (128.0Gb) for initial contig assembly, and SMRT sequences of 2.01 and 1.01 G, and contig N50 sizes of 0.20 and 0.87 Mb were respectively assembled using Canu (v2.0) and SmartDenovo (v1.0) (Supplementary Tables 1 and 2).Finally, we assembled the sequences using Quickmerge (v0.3), resulting in a 1.02 G sequence and a contig N50 size of 2.33 Mb (Supplementary Table 3).This assembly was slightly larger than the estimated genome size of 826 Mb.Considering that the size might be driven by underlying heterozygosity, we also reduced the assembly size to 986 Mb by scaffolding with Redundans.However, the Redundans assembly increased contig N50 to 2.40 Mb (Supplementary Table 4).We called blasr using smrtlink 5.0 with three generations of Subreads.bam,with the optional parameters: bam, bestn 5; minMatch, 18; nproc, 4; minSubread Length, 1,000; minAln Length 500; minPctSimilarity, 70; minPctAccuracy, 70; hitPolicy randombest, randomSeed and arrow correction was applied to the assembly result.Three generations of corrected genome sequences were obtained.Pilon v1.22 default parameters were combined with second-generation data for correction 21 .
Karyotype analysis of A. chinensis.We analyzed the correctness of the A. chinensis genome assembly using Hi-C data.We fixed, stained, and counted the number of chromosomes.Briefly, the lateral margin of the abdominal side of adult males was cut, then they were immersed in water for 30 min, fixed for 14 h in methyl alcohol-acetic acid (3:1 v/v) and stored in 70% alcohol at 4 °C.The gonads were dissected in 70% ethanol and crushed in a drop of 45% acetic acid.The coverslips were removed using dry ice as a decoverslipping agent 22 , then the slides were dehydrated in fresh fixative and air-dried.
We used the Feulgen-Giemsa method 23 and an Olympus OV100microimaging system (Olympus, Tokyo, Japan) for standard karyotype analysis.Karyotypes and male meiosis were evaluated in A. chinensis based on slides prepared from male gonads (Supplementary Figure 1).Analyses of metaphase I (Supplementary Figure 1g), anaphase (Supplementary Figure 1h), and metaphase II (Supplementary Figure 1i) revealed that A. chinensis possessed a diploid chromosome set (2n = 14) comprising six autosomal pairs and two sex chromosomes.

Chromosome-scale assembly of A. chinensis. Data processed by Illumina high-throughput sequencing
was restored to the raw image format and transformed to sequenced reads with adapters and low-quality calling bases.We avoided alignment errors by filtering and trimming the raw reads to create clean reads.The Hi-C reads were aligned using Bowtie2 (v2.0.5) to orient the primary contigs along the chromosomes 24 .Clean reads were first aligned to the reference genome using the bowtie2 end-to-end algorithm.Unmapped reads primarily comprised chimeric fragments spanning ligation junctions.HiC-Pro (v2.7.8) detected the ligation site using an exact matching procedure and aligned five fractions of the read on the genome 25 .Both mapping steps were merged into a single alignment file.Low-mapping-quality reads, multiple hits, and singletons were filtered out.Duplicates were removed, and reads that were uniquely mapped to the reference genome were retained.Clustering, ordering, and orientation proceeded using The LACHESIS assembly package (https://github.com/shendurelab/LACHESIS) 26.Based on the agglomerative hierarchical clustering algorithm, scaffolds were clustered into N groups.The longest acyclic spanning tree ("trunk") was built based on relationships between the normalized Hi-C interactions.Scaffolds excluded from the trunk were reinserted at sites that maximized the linkages between adjacent scaffolds.For each chromosomal cluster, we obtained the exact scaffold order of the internal groups and traversed all directions of the scaffolds using a weighted directed acyclic graph to predict the orientation of each scaffold.
Starting with the draft assembly, Hi-C data were used to correct mis-joins, scaffolds, and merge overlaps, generating an assembled A. chinensis genome with chromosome-length scaffolds.Finally, 1,357 contigs/620 scaffolds (97.70%) were clustered into seven groups (Figs. 3 and 4), that were consistent with previous karyotype analyses of A. chinensis.The 1,357 clustered contigs corresponded to a length of 967.93 Mb (99.77% of the length of the corrected contigs [970.20 Mb] and 97.7% of the total number of contigs [1,389] according to LACHESIS; Supplementary Tables 5 and 6).These results showed that the assembled A. chinensis draft genome has a high level of continuity and completeness.
repeat identification and non-coding rNA annotation.We used homologous sequence and de novo repeat identification to annotate repeat elements in A. chinensis.First, RepeatMasker (v4.09) and RepeatProteinMask (v4.09) identified tandem and interspersed repeats according to their sequence similarity with the repeats deposited in RepBase (http://www.girinst.org/repbase/) 27.Subsequently, RepeatModeler (open-1.0.11) trained a repeat database using the NCBI blast approach (-engine NCBI), and the repeat elements were annotated according to the database built using RepeatMasker (v4.09).Tandem repeats were also predicted and annotated directly using TRF software built into RepeatMasker (v4.09).Finally, we identified a ~496.2Mb sequence as a repeat element in A. chinensis, constituting ~51.15% of the genome (Supplementary Table 7).We found that DNA transposons accounted for 5.08% of the genome, whereas long interspersed nuclear elements We collected fresh samples of eggs, mixed first-, second-, third-, and fourth-instar nymphs, mixed female, and mixed male adults (n = 24 samples; n = 3 biological repeats/developmental stage).Total RNA was extracted using TRIzol reagent as described by the manufacturer (Invitrogen, Carlsbad, CA, USA).The integrity of the total RNA was determined by 1% agarose gel electrophoresis, and total RNA was quantified using 2100 RNA Nano 6000 Assay Kits (Agilent Technologies, Santa Clara, CA, USA).We prepared RNA-Seq libraries using TruSeq RNA sample preparation kits (Illumina, San Diego CA, USA) and sequenced them using an HiSeq PE150 platform (Illumina).Raw RNA-seq reads were processed to remove adapters and low-quality sequences using SeqTk (https://github.com/lh3/seqtk).Cleaned reads were used to generate a de novo RNA-seq assembly using the Trinity program with default parameters 38 .The resulting reads were processed via genome mapping using Hisat2 (version:2.0.4) 39 against the A. chinensis genome.

Data Records
Pacific Biosciences, Illumina, and Hi-C sequencing data were deposited in the NCBI GenBank under accession number JAGJRN000000000 40 .Developmental transcriptome data for eggs, larvae, and adults were deposited in the NCBI Sequence Read Archive under accession number PRJNA1123459 41 .

technical Validation
Genome assembly assessment.We analyzed the genome assembly to benchmark sets of universal single-copy orthologs (BUSCOs) to assess the completeness of the assembly.The A. chinensis gene set and genome had 96.1% complete (C), and 2.8% missing (M) BUSCOs (Supplementary Table 12).The distribution of GC-depth indicated that the assembled A. chinensis genome did not contain any visible bacterial contamination (Supplementary Figure 4).Therefore, we concluded that the A. chinensis dataset was comprehensive enough for further downstream analysis.
Chromosomal clustering assessment.The basic principles of HiC analysis are that intra-chromosome contacts are stronger than inter-chromosome contacts and that interactions weaken as distance increases.Consequently, interactions near the diagonal line are stronger than those located further from the diagonal line, and close bins are closely related in the heatmap.We separated the chromosomes predicted by LACHESIS into bins of equal lengths of 1 Mb or 500 kb, and constructed a heat map based on interaction signals revealed by valid mapped read pairs between bins.Failure of the heat map to conform to these rules suggested errors in the assembly results.
sample collection.Arma chinensis individuals were collected from a population

Fig. 1
Fig. 1 Seven stages of Arma chinensis life cycle.Eggs proceed through five nymphal instar stages, with final differentiation into adult males and females.

Fig. 2 K
Fig. 2 K-mer frequency distribution curve (k-mer = 17) of Illumina short reads of A. chinensis genome.X and Y axes respectively represent k-mer depth and k-mer frequency for a given depth.

Fig. 4
Fig. 4 Circos plot of A. chinensis genome profile.(A) Chromosome number and length.(B) Non-coding RNAs: yellow, tRNA; purple, other ncRNAs.(C) Abundance of repetitive sequences.Dark blue indicates greater quantity.(D) Abundance of genes.Dark green indicates greater quantity.(E) Transcriptome gene expression calculated by log2 FPKM.Red and blue, upregulated and downregulated expression, respectively.(F) GC content (10 k used as calculating unit).