The genome sequence of a soldier beetle, Podabrus alpinus (Paykull, 1798)

We present a genome assembly from an individual female Podabrus alpinus (soldier beetle; Arthropoda; Insecta; Coleoptera; Cantharidae). The genome sequence is 777 megabases in span. Most of the assembly is scaffolded into seven chromosomal pseudomolecules, including the assembled X sex chromosome. The mitochondrial genome has also been assembled and is 18.8 kilobases in length. Gene annotation of this assembly on Ensembl identified 30,955 protein coding genes.


Background
Podabrus alpinus belongs to the Cantharidae family of beetles, also known as soldier beetles. Although quite closely related to click-beetles (Elateridae) (Kusy et al., 2021), soldier beetles are morphologically very distinct and form phenotypically characteristic lineages along with some other soft-bodied elateroids, i.e., net-winged beetles (Lycidae) and fireflies (Lampyridae), but not with Telegeusidae: Omethinae that had been placed in soldier beetles until quite recently (Bocakova et al., 2007;Crowson, 1972). The adult P. alpinus is medium-sized, elongate, and dorsoventrally depressed with feeble sclerotization of the body, especially the abdomen and elytra. Although highly variable in size and colouration, both light and dark forms are readily recognised among UK Cantharidae by the distinct 'neck' and prominent eyes.
Podabrus alpinus is widespread and common throughout the northern Palearctic region from western Europe to East Asia (Kazantsev & Brancucci, 2007). It is also common across much of the UK, except for the east and south-west of England (NBN Atlas, no date). These beetles are associated with open woodland habitats especially pine trees (Pinaceae), although it is more strongly associated with upland habitats across Europe (Dvořák, 2010). They are known to the public as they are apparent when sitting on leaves and flowers in forested areas.
Adults are predators, feeding on soft-bodied insects but have also been observed to feed on pollen. Adults are active over a relatively short period between the middle of May to August, and can often be seen resting on umbellifer flowers in warmer weather during the day. In the late afternoons and evenings, they become more active and dispersed. The biology of the larvae is underdocumented, but they are believed to be predators living in upper soil layers and organic debris, in line with other members of the Podabrus genus (Fitton et al., 1976).
Most Cantharidae employ chemical defences and aposematic colouration; as a result, Cantharidae are often members of extensive mimicry rings. Therefore, the reference genome of P. alpinus could help us understand genomic basis of chemical defence and mimicry. As one of the targeted UK species assembled for the Darwin Tree of Life project, this is the first high-quality genome assembled for P. alpinus.

Genome sequence report
The genome was sequenced from one female Podabrus alpinus specimen ( Figure 1) collected from Wytham Woods (51.768, -1,34). A total of 27-fold coverage in Pacific Biosciences singlemolecule HiFi long reads and 43-fold coverage in 10X Genomics read clouds were generated. Primary assembly contigs were scaffolded with chromosome conformation Hi-C data. Manual assembly curation corrected 101 missing joins and misjoins and removed 13 haplotypic duplications, reducing the assembly length by 0.73% and the scaffold number by 10.66%, and increasing the scaffold N50 by 12.32%.
The final assembly has a total length of 777.2 Mb in 243 sequence scaffolds with a scaffold N50 of 103.3 Mb (Table 1).  Most (95.62%) of the assembly sequence was assigned to seven chromosomal-level scaffolds, representing six autosomes and the X sex chromosome. Chromosome-scale scaffolds confirmed by the Hi-C data are named in order of size (Figure 2- Figure 5; Table 2). The orientation of chromosome 2 in the region 5.5-10.6 Mb is uncertain. The assembly has a BUSCO v5.3.2 (Manni et al., 2021) completeness of 98.5% (single 93.5%, duplicated 5%) using the endopterygota_odb10 reference set (n = 2,124).While not fully phased, the assembly deposited is of one haplotype. Contigs corresponding to the second haplotype have also been deposited.

Genome annotation report
The P. alpinus genome assembly (cPodAlpi1.1) was annotated using the Ensembl rapid annotation pipeline (Table 1; https:// rapid.ensembl.org/Podabrus_alpinus_GCA_932274525.1/).   DNA was sheared into an average fragment size of 12-20 kb in a Megaruptor 3 system with speed setting 30. Sheared DNA was purified by solid-phase reversible immobilisation using AMPure PB beads with a 1.8X ratio of beads to sample to remove the shorter fragments and concentrate the DNA sample. The concentration of the sheared and purified DNA was assessed using a Nanodrop spectrophotometer and Qubit Fluorometer and Qubit dsDNA High Sensitivity Assay kit. Fragment size distribution was evaluated by running the sample on the FemtoPulse system.

Sequencing
Pacific Biosciences HiFi circular consensus and 10X Genomics read cloud DNA sequencing libraries were constructed  according to the manufacturers' instructions. DNA sequencing was performed by the Scientific Operations core at the WSI on Pacific Biosciences SEQUEL II (HiFi) and HiSeq X Ten (10X) instruments. Hi-C data were also generated from tissue of icPodAlpi1 using the Arima v2 kit and sequenced on the HiSeq X Ten instrument.

Genome assembly
Assembly was carried out with Hifiasm (Cheng et al., 2021) and haplotypic duplication was identified and removed with purge_dups (Guan et al., 2020). One round of polishing was performed by aligning 10X Genomics read data to the assembly with Long Ranger ALIGN, calling variants with freebayes (Garrison & Marth, 2012). The assembly was then scaffolded with Hi-C data (Rao et al., 2014) using YaHS (Zhou et al., 2022. The assembly was checked for contamination as described previously (Howe et al., 2021). Manual curation was performed using HiGlass (Kerpedjiev et al., 2018) and Pretext (Harry, 2022). The mitochondrial genome was assembled using MitoHiFi (Uliano-Silva et al., 2022), which performed annotation using MitoFinder (Allio et al., 2020). The genome was analysed and BUSCO scores generated within the BlobToolKit environment (Challis et al., 2020). The genome sequence is released openly for reuse. The P. alpinus genome sequencing initiative is part of the Darwin Tree of Life (DToL) project. All raw sequence data and the assembly have been deposited in INSDC databases. Raw data and assembly accession identifiers are reported in Table 1.

Yes
Are the datasets clearly presented in a useable and accessible format? Yes Yes I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.