The genome sequence of the Lesser Swallow Prominent, Pheosia gnoma (Fabricius, 1777)

We present a genome assembly from an individual male Pheosia gnoma (the Lesser Swallow Prominent; Arthropoda; Insecta; Lepidoptera; Notodontidae). The genome sequence is 271.3 megabases in span. Most of the assembly is scaffolded into 31 chromosomal pseudomolecules, including the Z sex chromosome. The mitochondrial genome has also been assembled and is 17.0 kilobases in length. Gene annotation of this assembly on Ensembl identified 11,628 protein coding genes.


Background
The Lesser Swallow Prominent, Pheosia gnoma (Fabricius, 1777) is a Palearctic species of moth, similar in appearance to the Swallow Prominent (Pheosia tremula) (Boyes et al., 2021), but is distinguished by a shorter, white wedge-shaped streak at the tornus of the forewing (Kimber, 2023). Pheosia gnoma is widespread and common across southern counties of the British Isles, presenting with a paler-headed phenotype in contrast to localised brown-headed northern populations. Adults fly in two generations, late April to June, and later again in August. P. gnoma has been recorded in a variety of habitats, particularly woodland, heathland, moorland, parks and gardens. The larvae feed on silver and downy birch (Betula), overwintering underground as pupae; adults come to light in small numbers (Barbour et al., 1998).
As the third largest insect order in the world, Lepidoptera are widely used in the study of speciation. Much research has also focused on co-evolutionary dynamics with their host plants and how populations and distributions are changing in relation to climate change (Chen et al., 2022). The genome of P. gnoma was sequenced as part of the Darwin Tree of Life Project, a collaborative effort to sequence all named eukaryotic species in the Atlantic Archipelago of Britain and Ireland. Here we present a complete chromosome-level genome sequence for P. gnoma, based on one male specimen from Wytham Woods, Oxfordshire, UK. The genome assembly of P. gnoma will contribute to resolving higher-level phylogenetic relationships and better understanding the reasons underpinning species diversification and morphological evolution.

Genome sequence report
The genome was sequenced from one male Pheosia gnoma ( Figure 1) collected from Wytham Woods, Oxfordshire, UK (latitude 51.77, longitude -1.31). A total of 60-fold coverage in Pacific Biosciences single-molecule HiFi long reads and 148-fold coverage in 10X Genomics read clouds were generated. Primary assembly contigs were scaffolded with chromosome conformation Hi-C data. Manual assembly curation corrected 60 missing joins or mis-joins and removed 6 haplotypic duplications, reducing the scaffold number by 52.5%, and increasing the scaffold N50 by 8.88%.
The final assembly has a total length of 271.3 Mb in 38 sequence scaffolds with a scaffold N50 of 9.8 Mb (Table 1). Most (99.93%) of the assembly sequence was assigned to 31 chromosomal-level scaffolds, representing 30 autosomes and the Z sex chromosome. Chromosome-scale scaffolds confirmed by the Hi-C data are named in order of size (Figure 2- Figure 5; Table 2). While not fully phased, the assembly deposited is of one haplotype. Contigs corresponding to the second haplotype have also been deposited. The mitochondrial genome was also assembled and can be found as a contig within the multifasta file of the genome submission.
Metadata for specimens, spectral estimates, sequencing runs, contaminants and pre-curation assembly statistics can be found at https://links.tol.sanger.ac.uk/species/988018.

Sample acquisition and nucleic acid extraction
A male Pheosia gnoma specimen (individual ilPheGnom1, specimen Ox000389) was collected from collected from Wytham Woods, Oxfordshire (biological vice-county Berkshire), UK (latitude 51.77, longitude -1.31) on 22 May 2020. The specimen was taken from woodland habitat by Douglas Boyes (University of Oxford) using a light trap. The specimen was identified by the collector and snap-frozen on dry ice.
DNA was extracted at the Tree of Life laboratory, Wellcome Sanger Institute (WSI). The ilPheGnom1 sample was weighed and dissected on dry ice with head and thorax tissue set aside for Hi-C and RNA sequencing. Abdomen tissue was   cryogenically disrupted to a fine powder using a Covaris cryoPREP Automated Dry Pulveriser, receiving multiple impacts. High molecular weight (HMW) DNA was extracted using the Qiagen MagAttract HMW DNA extraction kit. Low molecular weight DNA was removed from a 20 ng aliquot of extracted DNA using the 0.8X AMpure XP purification kit prior to 10X Chromium sequencing; a minimum of 50 ng DNA was submitted for 10X sequencing. HMW DNA was sheared into an average fragment size of 12-20 kb in a Megaruptor 3 system with speed setting 30. Sheared DNA was purified by solid-phase reversible immobilisation using AMPure PB beads with a 1.8X ratio of beads to sample to remove the shorter fragments and concentrate the DNA sample. The concentration of the sheared and purified DNA was assessed using a Nanodrop spectrophotometer and Qubit Fluorometer and Qubit dsDNA High Sensitivity Assay kit. Fragment size distribution was evaluated by running the sample on the FemtoPulse system. RNA was extracted from head and thorax tissue of ilPheGnom1 in the Tree of Life Laboratory at the WSI using TRIzol, according to the manufacturer's instructions. RNA was then eluted in 50 μl RNAse-free water and its concentration assessed using a Nanodrop spectrophotometer and Qubit Fluorometer using the Qubit RNA Broad-Range (BR) Assay kit. Analysis of the integrity of the RNA was done using Agilent RNA 6000 Pico Kit and Eukaryotic Total RNA assay.

Sequencing
Pacific Biosciences HiFi circular consensus and 10X Genomics read cloud DNA sequencing libraries were constructed Hi-C data were also generated from head and thorax tissue of ilPheGnom1 using the Arima2 kit and sequenced on the HiSeq X Ten instrument.

Genome assembly, curation and evaluation
Assembly was carried out with HiCanu (Nurk et al., 2020) and haplotypic duplication was identified and removed with purge_dups (Guan et al., 2020). One round of polishing  was performed by aligning 10X Genomics read data to the assembly with Long Ranger ALIGN, calling variants with FreeBayes (Garrison & Marth, 2012). The assembly was then scaffolded with Hi-C data (Rao et al., 2014) using SALSA2 (Ghurye et al., 2019. The assembly was checked for contamination and corrected using the gEVAL system (Chow et al., 2016) as described previously (Howe et al., 2021). Manual curation was performed using gEVAL, HiGlass (Kerpedjiev et al., 2018) and Pretext (Harry, 2022). The mitochondrial genome was assembled using MitoHiFi (Uliano-Silva et al., 2022), which runs MitoFinder (Allio et al., 2020) or MITOS (Bernt et al., 2013) and uses these annotations to select the final mitochondrial contig and to ensure the general quality of the sequence. To evaluate the assembly, MerquryFK was used to estimate consensus quality (QV) scores and k-mer completeness (Rhie et al., 2020). The genome was analysed within the BlobToolKit environment (Challis et al., 2020) and BUSCO scores (Manni et al., 2021;Simão et al., 2015) were calculated. Table 3 contains a list of software tool versions and sources.

Genome annotation
The Ensembl gene annotation system (Aken et al., 2016) was used to generate annotation for the Pheosia gnoma assembly (GCA_905404115.1). Annotation was created primarily through alignment of transcriptomic data to the genome, with gap filling via protein-to-genome alignments of a select set of proteins from UniProt (UniProt Consortium, 2019).