The genome sequence of the Mullein moth, Shargacucullia verbasci (Linnaeus, 1758)

We present a genome assembly from an individual female Shargacucullia verbasci (the Mullein moth; Arthropoda; Insecta; Lepidoptera; Noctuidae). The genome sequence is 422.7 megabases in span. Most of the assembly is scaffolded into 32 chromosomal pseudomolecules, including the W and Z sex chromosomes. The mitochondrial genome has also been assembled and is 15.32 kilobases in length.


Background
The Mullein moth, Shargacucullia verbasci, is a member of the family Noctuidae with a wide geographic distribution across the northern Palaearctic.The moth has been recorded from Denmark and Sweden in the north to Portugal, Spain, Italy and Greece in the south; there are scattered records further east from Russia and Tajikistan (GBIF Secretariat, 2022).In Britain, the species is most frequent in the south of England and across Wales, with no verified records from Scotland and Northern Ireland (GBIF Secretariat, 2022;NBN Atlas Partnership, 2021).The moth was recorded in Ireland until 1952 before being declared locally extinct; it was rediscovered in 2021 (Parnell, 2021).
The adult moth is variable in size (wingspan 44-56 mm) with narrow pointed cream-coloured forewings edged in rich chocolate brown.The larval stage is more frequently encountered due to its diurnal feeding habit and conspicuous colouration, comprising white ground colouration with bands of bright yellow overlain with a regular pattern of black spots (Figure 1).The usual larval food plants is mullein (Verbascum sp.) with small numbers of larvae capable of stripping a plant completely of leaves; feeding has also been reported on figwort (Scrophularia sp.) and Buddleia (Bretherton et al., 1983).The species has been proposed as a potential biological control agent for invasive Verbascum in North America (Maw, 1980;Plant Conservation Alliance, 2005).In southern England, the adult moth is on the wing in April and May with the resultant larvae feeding through June and July; pupation occurs in a tough cocoon in the soil, with the pupal stage lasting up to 4 or 5 years (Brooks, 1991).
A complete genome sequence for Shargacucullia verbasci will facilitate research into the evolution of food plant specificity in Lepidoptera and the molecular basis of pupal diapause.The genome of S.verbasci, based on one specimen from Saffron Walden, UK, was sequenced as part of the Darwin Tree of Life Project, a collaborative effort to sequence all named eukaryotic species in the Atlantic Archipelago of Britain and Ireland.

Genome sequence report
The genome was sequenced from one Shargacucullia verbasci larva collected from Saffron Walden, UK (52.02, 0.25).The specimen was determined to be female based on its karyotype post sequencing.A total of 41-fold coverage in Pacific Biosciences single-molecule HiFi long was generated.Primary assembly contigs were scaffolded with chromosome conformation Hi-C data.Manual assembly curation corrected 14 missing joins or misjoins and removed 5 haplotypic duplications, reducing the scaffold number by 19.05%.
The final assembly has a total length of 422.7 Mb in 33 sequence scaffolds with a scaffold N50 of 14.5 Mb (Table 1).Most (99.98%) of the assembly sequence was assigned to 32 chromosomal-level scaffolds, representing 30 autosomes and the W and Z sex chromosomes.Chromosome-scale scaffolds confirmed by the Hi-C data are named in order of size (Figure 2-Figure 5; Table 2).While not fully phased, the assembly deposited is of one haplotype.Contigs corresponding to the second haplotype have also been deposited.The mitochondrial genome was also assembled and can be found as a contig within the multifasta file of the genome submission.
Metadata for specimens, spectral estimates, sequencing runs, contaminants and pre-curation assembly statistics can be found at https://links.tol.sanger.ac.uk/species/987469.

Sample acquisition and nucleic acid extraction
A female Shargacucullia verbasci (specimen ID SAN0001261, ToLID ilShaVerb5) larva was collected from Saffron Walden, UK (latitude 52.02, longitude 0.25) on 2020-06-02.The specimen was taken from a Verbascum plant in a garden by Mara Lawniczak (Wellcome Sanger Institute).This specimen was collected during the Covid19 lockdown and processed in a makeshift laboratory in ML's bathroom.Using a scalpel to remove the head and forceps to dissect the specimen, the gut and its contents were removed to prevent excessive food plant material from being sequenced.The remainder of the
caterpillar was processed into lentil sized pieces using a scalpel on a petri dish sitting on dry ice.The specimen was identified by Liam Crowley (University of Oxford) and preserved on dry ice.
The specimen was prepared for DNA extraction at the Tree of Life laboratory, Wellcome Sanger Institute (WSI).The ilShaVerb5 sample was weighed on dry ice with tissue set aside for Hi-C sequencing.Tissue of the whole organism was cryogenically disrupted to a fine powder using a Covaris cryoPREP Automated Dry Pulveriser, receiving multiple impacts.DNA was extracted at the WSI Scientific Operations core using the Qiagen MagAttract HMW DNA kit, according to the manufacturer's instructions.
RNA was extracted from tissue from the whole organism of ilShaVerb2 in the Tree of Life Laboratory at the WSI using TRIzol, according to the manufacturer's instructions.RNA was then eluted in 50 μl RNAse-free water and its concentration assessed using a Nanodrop spectrophotometer and Qubit Fluorometer using the Qubit RNA Broad-Range (BR) Assay kit.Analysis of the integrity of the RNA was done using Agilent RNA 6000 Pico Kit and Eukaryotic Total RNA assay.

Sequencing
Pacific Biosciences HiFi circular consensus DNA sequencing libraries were constructed according to the manufacturers'  A Hi-C map for the final assembly was produced using bwa-mem2 (Vasimuddin et al., 2019) in the Cooler file format (Abdennur & Mirny, 2020).To assess the assembly metrics, the k-mer completeness and QV consensus quality values were calculated in Merqury (Rhie et al., 2020).This work was done using Nextflow (Di Tommaso et al

Wellcome Sanger Institute -Legal and Governance
The materials that have contributed to this genome note have been supplied by a Tree of Life collaborator.The Wellcome Sanger Institute employs a process whereby due diligence is carried out proportionate to the nature of the materials themselves, and the circumstances under which they have been/are to be collected and provided for use.The purpose of this is to address and mitigate any potential legal and/or ethical implications of receipt and use of the materials as part of the   Table 3. Software tools: versions and sources.

Software tool Version
research project, and to ensure that in doing so we align with best practice wherever possible.The overarching areas of consideration are: • Ethical review of provenance and sourcing of the material In this article, the authors describe the sequencing and assembly of the Shargacucullia verbasci genome using DNA from a single larval specimen collected in the UK.The primary genome sequence assembly includes proposed chromosomal pseudomolecule sequences for 30 autosomes, the Z sex chromosome, the W sex chromosome and a complete mitochondrial genome.On the basis of the genome assembly, which included a W chromosome, the sex of the specimen was inferred to be female.On the whole, this is a useful contribution to the scientific literature, but please see my comments below, especially regarding the identification of the specimen, the preparation of the specimen for nucleic acid isolation, the role of the RNAseq dataset in the project, and details of mitogenome assembly.

Some suggestions to the authors:
Specimen identification: The individual researcher who did the specimen identification was named, but keys/species descriptions consulted, or the morphological characters used for the identification have not been included in the manuscript.Further, the image included in Figure 1 is not of the specimen that was sequenced, and the Methods appear to describe the removal of the head of the caterpillar, the removal of the gut, and the processing of the remainder of the specimen into "lentil sized species" prior to specimen identification.Morphology-based species identification based on a minced caterpillar body seems problematic. 1.
"lentil-sized pieces" I am not a lentil expert, but I will note that dried lentils vary between 3 and 9 mm in length, with geographic variation in human cultural preferences for the size of preferred varieties (larger in Europe and the Middle East, smaller in South and East Asian).Cooked lentils are presumably larger than dried lentils.To which are the authors referring?All of this is to say, as fond of lentils as any of us may be, it would be far more informative for anyone hoping to repeat the nucleic acid protocol if the authors specified the approximate diameter of the pieces using metric measurements.

2.
Sequencing: Included in this manuscript is a description of the creation of an RNAseq 3.
dataset.The authors do not describe how this dataset was used in the genome assembly.If it wasn't used, it's not clear why the RNAseq dataset was created or included in this study.
The authors describe how "The mitochondrial genome was assembled using MitoHiFi (Uliano-Silva et al. 2022) which runs MitoFinder (Allio et al. 2020) or MITOS (Bernt et al. 2013) and uses these annotations to select the final mitochondrial contig…".The authors do not describe which of the algorithms generated the selected contig for the mitogenome.The paper is a very basic description of the process of generating and assembling the genome sequence of the Mullein moth.It is clearly presented, citing the appropriate papers mostly for software tools to perform the assembly.The sequencing effort was extensive (41x coverage of PacBio HiFi) and represents the state-of-the-art for single-pass de novo assembly.Relatively little hand work was involved in assembling chromosome level scaffolds.
• Is the study design appropriate and does the work have academic merit?Yes.Everything in the paper is clear and seems accurate.The only thing that diminishes the academic merit is that the authors do not push the analysis very far.It seems odd to produce this sequence and not also present first-pass annotation of the coding genes, transposable elements, and other structural aspects.Lepidoptera are holocentric, and it seems some discussion of the structure of regions that might be involved in chromosome pairing would be interesting.
• Are sufficient details of methods and analysis provided to allow replication by others?
Generally, the paper is very clear about the procedures (down to use of the bathroom for the dissection).One detail that was missing was the procedure for isolating large fragment DNA and the distribution of fragment sizes that entered the library preparation.
• If applicable, is the statistical analysis and its interpretation appropriate?
The only statistical analysis was the BUSCO and BLobToolkit analysis of features of the assembly, including contig size, GC content, etc.. Very basic stuff, and there was no interpretation really.I found the BlobToolkit graphics to be less than satisfying, and simple plots of histograms of contig size, GC content, level of polymorphism seen in comparing the portions with phased haplotypes, etc would have been more informative.
• Are all the source data underlying the results available to ensure full reproducibility?Yes, all data have been appropriately deposited.
• Are the conclusions drawn adequately supported by the results?
The paper is a very straightforward presentation of a genome assembly project, derived from a single individual moth larva.The sequencing effort was quite extensive (41x coverage of PacBio HiFi) and so the standard tools, like HIFIasm were adequate to produce a quite impressive genome assembly.In short, all the conclusions that are drawn are well supported.It seems strange to have stopped so short of drawing more interesting conclusions, but perhaps the idea is to get the completed genome assembly into the public domain as quickly as possible.

Are sufficient details of methods and materials provided to allow replication by others? Yes
Are the datasets clearly presented in a useable and accessible format?Yes Competing Interests: No competing interests were disclosed.
Reviewer Expertise: Population genetics I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.There is redundancy between the end of "Background" and First sentence under "Genome sequencing report" (these paragraphs follow each other).Some redundancy with beginning of Methods too.
○ I assume the numbers in the first line of "Genome sequence report" (52.02, 025) are GPS coordinates, but I would specify this.This is also redundant with the first line of the Methods section, so I would revise to keep just one.

○
Second paragraph of "Genome sequencing report".As I understand it, the stats you give are for the first haplotype.Is this correct?It would perhaps make this more clear if you give the stats for the second haplotype as well.And you say both have been deposited.Do they have accessions that can be stated in the text?Were they submitted together?Reviewer Expertise: Evolution, ecology and genomics I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.

Figure 2 .
Figure 2. Genome assembly of Shargacucullia verbasci, ilShaVerb5.1:metrics.The BlobToolKit Snailplot shows N50 metrics and BUSCO gene completeness.The main plot is divided into 1,000 size-ordered bins around the circumference with each bin representing 0.1% of the 422,739,721 bp assembly.The distribution of scaffold lengths is shown in dark grey with the plot radius scaled to the longest scaffold present in the assembly (20,510,064 bp, shown in red).Orange and pale-orange arcs show the N50 and N90 scaffold lengths (14,523,181 and 9,713,589 bp), respectively.The pale grey spiral shows the cumulative scaffold count on a log scale with white scale lines showing successive orders of magnitude.The blue and pale-blue area around the outside of the plot shows the distribution of GC, AT and N percentages in the same bins as the inner plot.A summary of complete, fragmented, duplicated and missing BUSCO genes in the lepidoptera_odb10 set is shown in the top right.An interactive version of this figure is available at https://blobtoolkit.genomehubs.org/view/Shargacucullia%20verbasci/dataset/CANOAR01/snail.

Figure 5 .
Figure 5. Genome assembly of Shargacucullia verbasci, ilShaVerb5.1:Hi-C contact map of the ilShaVerb5.1 assembly, visualised using HiGlass.Chromosomes are shown in order of size from left to right and top to bottom.An interactive version of this figure may be viewed at https://genome-note-higlass.tol.sanger.ac.uk/l/?d=fQ4fdfAvQrWrYfcAej1e_Q.

4 .
Is the rationale for creating the dataset(s) clearly described?Partly Are the protocols appropriate and is the work technically sound?Yes Are sufficient details of methods and materials provided to allow replication by others?Partly Are the datasets clearly presented in a useable and accessible format?Yes Competing Interests: No competing interests were disclosed.Reviewer Expertise: Evolutionary biology of insects, phylogenomics I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.Reviewer Report 19 August 2024 https://doi.org/10.21956/wellcomeopenres.21883.r88912© 2024 Clark A. This is an open access peer review report distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.Andrew G Clark Cornell University, Ithaca, NY, USA • Is the work clearly and accurately presented and does it engage with the current literature?

○○
On the bottom of page 6 you mention contamination and correction as previously described.Could you please very briefly describe what this entailed.You mention Blobtoolkit further down.Perhaps this belongs up here?○ What data did you give to MitoHifi?The assembly or the raw reads?○ At the end of the methods, you mention BUSCO, reference used.Could you please include these details.Even if they are in the figure it would be nice to have them in the main text too.Is the rationale for creating the dataset(s) clearly described?YesAre the protocols appropriate and is the work technically sound?YesAre sufficient details of methods and materials provided to allow replication by others?YesAre the datasets clearly presented in a useable and accessible format?YesCompeting Interests: No competing interests were disclosed.