The genome sequence of the wood white butterfly, Leptidea sinapis (Linnaeus, 1758)

We present a genome assembly from an individual male Leptidea sinapis (the wood white; Arthropoda; Insecta; Lepidoptera; Pieridae). The genome sequence is 686 megabases in span. The majority (99.99%) of the assembly is scaffolded into 48 chromosomal pseudomolecules, with three Z sex chromosomes assembled. Gene annotation of this assembly on Ensembl has identified 14,800 protein coding genes.


Background
The wood white butterfly (Leptidea sinapis) is recognized by its white wings with dark apical spots on the forewings and its distinctively slow flight (Thomas & Lewington, 2016). The preferred habitats are forest openings and meadows where herbaceous host plants from the family Fabaceae are present (Friberg et al., 2008;Wiklund, 1977). The distribution range covers a major part of the western Palearctic, the African continent excluded. Within Britain and Ireland, wood whites are restricted to fragmented, sheltered areas in southern Wales and England and a small region around Burren in western Ireland (Thomas & Lewington, 2016). As a consequence of considerable population declines over the last decades, the wood white was included in the UK Biodiversity Action Plan in 2007, but the species has likely been under-surveyed (Jeffcoate & Joy, 2011).
The wood white has long been the subject of ecological studies, investigating, for example, interaction with recently discovered cryptic and sympatric sister species and habitat preference variation (Friberg et al., 2008;Friberg & Wiklund, 2009;Wiklund, 1977). Due to the presence of a striking chromosome number cline across the distribution range (Dincă et al., 2011;Lukhtanov et al., 2020), the wood white has also developed into a model species for understanding the mechanistic underpinnings and evolutionary consequences of rapid karyotype evolution (Lukhtanov et al., 2020;Šíchová et al., 2015;Talla et al., 2019). Previous genomic and cytogenetic research have revealed a drastically expanded and unusually repeat-rich genome compared to most studied butterflies (Talla et al., 2017), and the presence of an unexpected sex-chromosome system (Šíchová et al., 2015). Existing genomic resources have also paved way for investigating, for example, the genetic basis of local adaptation (Leal et al., 2018;Näsvall et al., 2021) and expression dynamics of sex-linked and autosomal genes (Höök et al., 2019). We foresee that the Darwin Tree of Life assembly presented here will be an important tool for forthcoming research on chromosome number dynamics, the association between structural rearrangements and reproductive isolation, the genetic basis of adaptive traits and the mechanistic underpinnings of microevolutionary processes in butterflies.

Genome sequence report
The genome was sequenced from a single male L. sinapis ( Figure 1) collected from Somiedo, Pigueces, Asturias, Spain (latitude 43.1489, longitude -6.3127). A total of 36-fold coverage in Pacific Biosciences single-molecule circular consensus (HiFi) long reads and 55-fold coverage in 10X Genomics read clouds were generated. Primary assembly contigs were scaffolded with chromosome conformation Hi-C data. Manual assembly curation corrected 20 missing/misjoins and removed 12 haplotypic duplications, reducing the assembly length by 1.30% and the scaffold number by 18.33%.
The final assembly has a total length of 686 Mb in 49 sequence scaffolds with a scaffold N50 of 14.4 Mb (  majority, 99.99%, of the assembly sequence was assigned to 48 chromosomal-level scaffolds, representing 45 autosomes (numbered by sequence length) and three Z sex chromosomes (Figure 2- Figure 5; Table 2). The assembly has a BUSCO v5.

Methods
Sample acquisition and nucleic acid extraction Two male L. sinapis specimens (ilLepSina1, genome assembly, Hi-C; ilLepSina2, RNA-Seq) were collected from Somiedo, Pigueces, Asturias, Spain (latitude 43.1489, longitude -6.3127) using a net by Konrad Lohse, University of Edinburgh, who also identified the samples. The samples were frozen at -80°C.
DNA was extracted at the Scientific Operations Core, Wellcome Sanger Institute. The ilLepSina1 sample was weighed and dissected on dry ice with tissue set aside for Hi-C sequencing. Whole organism tissue was disrupted by manual grinding with a disposable pestle. Fragment size analysis of 0.01-0.5 ng of DNA was then performed using an Agilent FemtoPulse. High molecular weight (HMW) DNA was extracted using the Qiagen MagAttract HMW DNA extraction kit. Low molecular weight DNA was removed from a 200-ng aliquot of extracted DNA using 0.8X AMpure XP purification kit prior to 10X Chromium sequencing; a minimum of 50 ng DNA was submitted for 10X sequencing. HMW DNA was sheared into an average fragment size between 12-20 kb in a Megaruptor 3 system with speed setting 30. Sheared DNA was purified by solid-phase reversible immobilisation using AMPure PB beads with a 1.8X ratio of beads to sample to remove the shorter fragments and concentrate the DNA sample. The concentration of the sheared and purified DNA was assessed using a Nanodrop spectrophotometer and Qubit Fluorometer and Qubit dsDNA High Sensitivity Assay kit. Fragment size distribution was evaluated by running the sample on the FemtoPulse system.
RNA was extracted from whole organism tissue of ilLepSina2 in the Tree of Life Laboratory at the WSI using TRIzol, according to the manufacturer's instructions. RNA was then eluted in 50 μl RNAse-free water and the RNA concentration assessed using a Nanodrop spectrophotometer and Qubit Fluorometer using the Qubit RNA Broad-Range (BR) Assay kit. Analysis of the integrity of the RNA was done using Agilent RNA 6000 Pico Kit and Eukaryotic Total RNA assay.

Sequencing
Pacific Biosciences HiFi circular consensus and 10X Genomics read cloud DNA sequencing libraries were constructed according to the manufacturers' instructions. Poly(A) RNA-Seq libraries were constructed using the NEB Ultra II RNA Library Prep kit. DNA and RNA sequencing was performed by the Scientific Operations core at the WSI on Pacific Biosciences SEQUEL II (HiFi), Illumina HiSeq X (10X) and Illumina HiSeq 4000 (RNA-Seq) instruments. Hi-C data were also generated from the whole organism of ilLepSina1 using the Arima v2 Hi-C kit and sequenced on an Illumina NovaSeq 6000 instrument.

Genome assembly
Assembly was carried out with Hifiasm (Cheng et al., 2021); haplotypic duplication was identified and removed with purge_dups (Guan et al., 2020). One round of polishing was performed by aligning 10X Genomics read data to the assembly with longranger align, calling variants with freebayes (Garrison & Marth, 2012). The assembly was then scaffolded with Hi-C data (Rao et al., 2014) using SALSA2 (Ghurye et al., 2019. The assembly was checked for contamination and corrected using the gEVAL system (Chow et al., 2016) as described previously (Howe et al., 2021). Manual curation (Howe et al., 2021) was performed using gEVAL, HiGlass (Kerpedjiev et al., 2018) and Pretext. The mitochondrial genome was assembled using MitoHiFi (Uliano-Silva et al., 2021), which performs annotation using MitoFinder (Allio et al., 2020). The genome was analysed and BUSCO scores generated within the BlobToolKit environment (Challis et al., 2020). Table 3 contains a list of all software tool versions used, where appropriate.

Genome annotation
The Ensembl gene annotation system (Aken et al., 2016) was used to generate annotation for the Leptidea sinapis assembly (GCA_905404315.1). Annotation was created primarily through alignment of transcriptomic data to the genome, with gap filling via protein-to-genome alignments of a select set of proteins from UniProt (UniProt Consortium, 2019).

Data availability
European Nucleotide Archive: Leptidea sinapis (wood white). Accession number PRJEB43801; https://identifiers.org/ena.embl/ PRJEB43801 (Wellcome Sanger Institute, 2022) The genome sequence is released openly for reuse. The L. sinapis genome sequencing initiative is part of the Darwin Tree of Life (DToL) project. All raw sequence data and the assembly have been deposited in INSDC databases. Raw data and assembly accession identifiers are reported in Table 1.

Author information
Members of the Wellcome Sanger Institute Tree of Life programme are listed here: https://doi.org/10.5281/zenodo. 6866293.