The genome sequence of the pale mottled willow, Caradrina clavipalpis (Scopoli, 1763)

We present a genome assembly from an individual male Caradrina clavipalpis (pale mottled willow; Arthropoda; Insecta; Lepidoptera; Noctuidae). The genome sequence is 474 megabases in span. The entire assembly (100%) is scaffolded into 31 chromosomal pseudomolecules with the Z sex chromosome assembled. The complete mitochondrial genome was also assembled and is 15.6 kilobases in length.


Background
The pale mottled willow, Caradrina clavipalpis (Scopoli, 1763) is a widespread noctuid moth of grassland and gardens found across the western Palaearctic from Europe to Sri Lanka. It is resident in the British Isles, but it is believed that its population is boosted by immigration, as large numbers of individuals have been recorded on nights with known influxes of migrants. In Scotland and northern England, this species has declined, although its British population overall seems to be stable (Randle et al., 2019).
The adult moth is attracted to light and sugar, and also feeds at flowers. It is thought to have two generations each year in the UK, with adults on the wing in May-July and again in August-October. The adult moth is quite small with a forewing length of 12-15mm. It has mottled forewings, with a series of dashes on the leading edge. The hindwing is pearly white.
The larvae feed on the grain of cereal crops (Graminaea) both in the field and in storage, and also plantains (Plantago spp.). Historic records from coal mines described the larvae as living on the fodder of the pit-ponies (Heath & Emmett, 1983). There are also records of the adult being infested with the mite Cheletomorpha lepidopterum which is found in hay bales; a previous name for this moth was the hay moth (Forgham, 2015). Larvae pupate in autumn in a robust cocoon underground from which they emerge in spring. This early generation gives rise to the second generation later in the year (Heath & Emmett, 1983).
The genome of C. clavipalpis was sequenced as part of the Darwin Tree of Life Project, a collaborative effort to sequence all of the named eukaryotic species in the Atlantic Archipelago of Britain and Ireland. Here we present a chromosomally complete genome sequence for C. clavipalpis, based on one ilCarClav1 specimen from Wytham Woods, Berkshire, UK.

Genome sequence report
The genome was sequenced from a single male C. clavipalpis collected from Wytham Woods, Berkshire, UK (Figure 1). A total of 43-fold coverage in Pacific Biosciences single-molecule HiFi long reads and 35-fold coverage in 10X Genomics read clouds were generated. Primary assembly contigs were scaffolded with chromosome conformation Hi-C data. Manual assembly curation corrected 1 misjoin, reducing the assembly size by 0.35% and the scaffold number by 6.06%, The final assembly has a total length of 474 Mb in 31 sequence scaffolds with a scaffold N50 of 16.8 Mb (Table 1). The entire assembly sequence (100%) was assigned to 31 chromosomal-level scaffolds, representing 30 autosomes (numbered by sequence length) and the Z sex chromosome (Figure 2- Figure 5; Table 2).   The assembly has a BUSCO v5.3.2 (Manni et al., 2021) completeness of 98.8% (single 98.5%, duplicated 0.3%) using the lepidoptera_odb10 reference set (n=5,286). While not fully phased, the assembly deposited is of one haplotype. Contigs corresponding to the second haplotype have also been deposited.

Sample acquisition and nucleic acid extraction
A single male C. clavipalpis specimen (ilCarClav1) was collected using a light trap from Wytham Woods, Berkshire, UK (latitude 51.772, longitude -1.338) by Douglas Boyes (University of Oxford). The specimen was identified by Douglas Boyes and snap-frozen on dry ice.
DNA was extracted at the Tree of Life laboratory, Wellcome Sanger Institute. The ilCarClav1 sample was weighed and dissected on dry ice with head tissue set aside for Hi-C sequencing. Thorax tissue was disrupted using a Nippi Powermasher fitted with a BioMasher pestle. Fragment size analysis of 0.01-0.5 ng of DNA was then performed using an Agilent FemtoPulse. High molecular weight (HMW) DNA was extracted using the Qiagen MagAttract HMW DNA extraction kit. Low molecular weight DNA was removed from a 200-ng aliquot of extracted DNA using 0.8X AMpure XP purification kit prior to 10X Chromium sequencing; a minimum of 50 ng DNA was submitted for 10X sequencing. HMW DNA was sheared into an average fragment size between 12-20 kb in a Megaruptor 3 system with speed setting 30. Sheared DNA was purified by solid-phase reversible immobilisation using AMPure PB beads with a 1.8X ratio of beads to sample to remove the shorter fragments and concentrate the DNA sample. The concentration of the sheared and purified DNA was assessed using a Nanodrop spectrophotometer and Qubit Fluorometer and Qubit dsDNA High Sensitivity Assay kit. Fragment size distribution was evaluated by running the sample on the FemtoPulse system. RNA was extracted from abdomen tissue of ilCarClav1 in the Tree of Life Laboratory at the WSI using TRIzol, according to the manufacturer's instructions. RNA was then eluted in 50 μl RNAse-free water and its concentration RNA assessed using a Nanodrop spectrophotometer and Qubit Fluorometer using the Qubit RNA Broad-Range (BR) Assay kit. Analysis of the integrity of the RNA was done using Agilent RNA 6000 Pico Kit and Eukaryotic Total RNA assay. The genome sequence is released openly for reuse. The C. clavipalpis genome sequencing initiative is part of the Darwin Tree of Life (DToL) project. All raw sequence data and the assembly have been deposited in INSDC databases. The genome will be annotated using the RNA-Seq data and presented through the Ensembl pipeline at the European Bioinformatics Institute. Raw data and assembly accession identifiers are reported in Table 1.

Xueyan Li
Chinese Academy of Sciences,, Kunming, China This data presents a high-quality and integrity genome assembly of a pale mottled willow, Caradrina Clavipalpis, which updates the genome repository for Lepidoptera. The data provide a genetic foundation for further studies on the systematics and complex phenotypic evolution of moths even Lepidoptera. Some problems are worthy of further clarification: Table 1 provides PolyA RNA-Seq Illumina data and the article also mentions, providing annotation results for this genome will be more conducive to the extensive use of the data. 1.
The abstract mentions the mitochondrial genome, but there are no corresponding results in the main text. The data note is well structured and based on the described methods and the publicly available sequencing data also reproducible by the scientific community. The primary assembly has a very high contiguity already at the contig level. Only two scaffolds consist of more than one contig and most of them do have the telomer sequence motif at both ends.

Methods:
Genome assembly methods are a bit short and it could be hard to reproduce the assembly without further documentation of the used program arguments. Even if all tools were run in default mode, it would be worth mentioning. Some of the used tools, such as purge_dups depend itself on other programs like minimap2. But those could not be found in Table 3. I could not find any information if all variants from Freebayes were used for the error polishing or if a variant filtering step was applied beforehand.
I do only have some minor comments and suggestions: I could not find the HiC read coverage in the Data Note ○ the plots from figure 3 and 4 are for such a high-contiguous assembly not very informative ○ additional statistics about the sequencing data (e.g. read length N50 of HiFi reads and 10X read clouds, kmer based genome size and heterozygosity estimates) and the final assembly (e.g. merqury QV scores, repeat content) would be nice to see as well ○ error-polishing strategies of HiFi-based assemblies are currently under debate. I assume you applied bcftools consensus on the filtered Freebayes VCF files similar to the VGP assembly pipeline (https://github.com/VGP/vgp-assembly/tree/master/pipeline). How big was the improvement based in the QV-score? Were the alternate contigs included in the ○ longranger alignment step?
RNAseq was produced but it was not used for an annotation, why? ○ It is great that you also included the ccs bam file, including the kinetics information. A further great feature would be to make the potential methylation sites (5mc in CpG islands) available for both assemblies ○ Is the rationale for creating the dataset(s) clearly described? Yes Are the protocols appropriate and is the work technically sound? Yes

Are sufficient details of methods and materials provided to allow replication by others? Yes
Are the datasets clearly presented in a useable and accessible format? Yes