The genome sequence of a drosophilid fruit fly, Drosophila limbata von Roser 1840

We present a genome assembly from an individual male Drosophila limbata (drosophilid fruit fly; Arthropoda; Insecta; Diptera; Drosophilidae). The genome sequence is 233.5 megabases in span. Most of the assembly is scaffolded into 6 chromosomal pseudomolecules. The mitochondrial genome has also been assembled and is 16.09 kilobases in length.


Background
Drosophila limbata von Roser 1840 is a medium sized (ca.3.0-3.5 mm) yellowish-brown drosophilid 'fruit fly' (Figure 1A and 1B).It is one of around 30 British and Irish species of Drosophila (Chandler, 2023), and is a member of the quinaria species group within the subgenus Drosophila (Bächli et al., 2004).Flies are superficially similar in appearance to their close relative Drosophila kuntzei (Bächli et al., 2004), but can be separated on the shape of the abdominal bands, by dissection of the terminalia, and (in wild flies) by their overall darker brown colouration (Figure 1A).Unlike most other members of the quinaria group, which are predominantly fungus specialists (Scott Chialvo et al., 2019), D. limbata uses decaying plant matter as a substrate, including several species of Cucurbitaceae and Apiaceae (Hummel et al., 1979;Offenberger & Klarenberg, 1992;van Alphen et al., 1991).Although Drosophila limbata have been maintained in laboratory culture, the species seems to have been remarkably little studied, with just a handful of papers discussing such disparate topics as population dynamics (Hummel et al., 1979), parasitism (Gillis & Hardy, 1997;van Alphen et al., 1991), alcohol tolerance (Mercot et al., 1994), and courtship song (Neems et al., 1997).
In nature, D. limbata is broadly distributed across the palearctic, from the West of Ireland to the East of Russia, and from Crete in the south to central Finland to the north (Bächli, 2024).Relatively few records are available for the UK (GBIF Secretariat, 2024), and the species was not reported either from Scotland by Basden in 1950-52 (43,629 flies examined;Basden (1955)) or from a survey of Southern England by Dyson-Hudson in 1952-53 (18,535 flies examined in the survey, although a total of eight D. limbata were reported to have been caught separately; Dyson-Hudson (1954)).Nevertheless, the adults can be seen across much of the year (GBIF Secretariat, 2024), and the species is not reported to be threatened.It thus seems likely that the scarcity of UK records reflects the challenge of identification, and the failure of these flies to come to fruit baits.
Here we present a chromosomally complete genome sequence for Drosophila limbata, derived from the DNA of three male offspring of a wild female that was collected from courgette and squash plants at Cherry Gardens Farm, East Sussex, as part of the Darwin Tree of Life Project.This genome sequence will help to resolve relationships among the Drosophilidae and will further build on the value of this family as a model clade for comparative genomics and molecular evolution.This project is a collaborative effort to sequence all named eukaryotic species in the Atlantic Archipelago of Britain and Ireland.

Genome sequence report
The genome was sequenced from a male Drosophila limbata (Figure 1) reared at the Institute of Ecology and Evolution, University of Edinburgh.A total of 107-fold coverage in Pacific Biosciences single-molecule HiFi long reads was generated.Primary assembly contigs were scaffolded with chromosome conformation Hi-C data.Manual assembly curation corrected 28 missing joins or mis-joins and removed 5 haplotypic duplications, reducing the scaffold number by 1.92%, and decreasing the scaffold N50 by 6.57%.
The final assembly has a total length of 233.5 Mb in 510 sequence scaffolds with a scaffold N50 of 29.2 Mb (Table 1).The snail plot in Figure 2 provides a summary of the assembly statistics, while the distribution of assembly scaffolds on GC proportion and coverage is shown in Figure 3.The cumulative assembly plot in Figure 4 shows curves for subsets of scaffolds assigned to different phyla.Most (71.2%) of the assembly sequence was assigned to 6 chromosomal-level scaffolds, representing 5 autosomes and the X sex chromosome.Chromosome-scale scaffolds confirmed by the Hi-C data are named in order of size (Figure 5; Table 2).The X chromosome was identified based on PacBio read coverage.We expected to find a Y chromosome, but this could not be identified and is likely in the unplaced contigs.The order and orientation of contigs along Chromosome 6 between 1.6 Mb and 7.4 Mb is uncertain.While not fully phased, the assembly deposited is of one haplotype.Contigs corresponding to the second haplotype have also been deposited.The mitochondrial genome was also assembled and can be found as a contig within the multifasta file of the genome submission.The estimated Quality Value (QV) of the final assembly is 59.6 with k-mer completeness of 100.0%, and the assembly has a BUSCO v completeness of 97.4% (single = 96.9%,duplicated = 0.5%), using the diptera_odb10 reference set (n = 3,285).

Sample acquisition and nucleic acid extraction
The Drosophila limbata specimens used in the genome assembly were first-generation male progeny from a wild-collected female.

Sequencing
Pacific Biosciences HiFi circular consensus DNA sequencing libraries were constructed according to the manufacturers' instructions.Poly(A) RNA-Seq libraries were constructed using the NEB Ultra II RNA Library Prep kit.DNA and RNA sequencing was performed by the Scientific Operations core at the WSI on Pacific Biosciences Sequel IIe (HiFi) and Illumina NovaSeq 6000 (RNA-Seq) instruments.Hi-C data were also generated from specimen idDroLimb3 using the Arima2 kit and sequenced on the Illumina NovaSeq 6000 instrument.

Final assembly evaluation
The final assembly was post-processed and evaluated with the three Nextflow (Di Tommaso et al   report, computes k-mer completeness and QV consensus quality values with FastK and MerquryFK, and a completeness assessment with BUSCO (Manni et al., 2021).
The sanger-tol/blobtoolkit pipeline is a Nextflow port of the previous Snakemake Blobtoolkit pipeline (Challis et al., 2020).It aligns the PacBio reads with SAMtools and minimap2 (Li, 2018) and generates coverage tracks for regions of fixed size.In parallel, it queries the GoaT database (Challis et al., 2023) to identify all matching BUSCO lineages to run BUSCO (Manni et al., 2021).For the three domain-level BUSCO lineage, the pipeline aligns the BUSCO genes to  et al., 1990).All those outputs are combined with the blobtools suite into a blobdir for visualisation.
All three pipelines were developed using the nf-core tooling (Ewels et al., 2020), use MultiQC (Ewels et al., 2016), and make extensive use of the Conda package manager, the Bioconda initiative (Grüning et al., 2018)   1.In the background, the authors provide a brief description of the organism and its geographical distribution.2. In Figure 3, the label on the Y-axis which reads "ERR1220…" is completely meaningless.This should be changed to something meaningful to help the reader make sense of this figure.
3. The authors mention that 'It aligns the PacBio reads with SAMtools and minimap2..'.For the sake of correctness, please verify that indeed SAMtools was used for alignment and if needed, please edit this statement to accurately and correctly reflect what was done and what was used.4. there seems to be a small typing error where it says "taxonomically lineage, and …" Please check and correct that statement if needed.
Overall, the authors provide a high-quality genome and also assign much of it to chromosomes.Although much work remains to order the scaffolds and complete the gaps and also structurally and functionally annotate the genome, the current work will indeed be valuable to the whole community.I therefore recommend the indexing of this genome so that this resource becomes widely accessible to the scientific community.

Is the rationale for creating the dataset(s) clearly described? Yes
Are the protocols appropriate and is the work technically sound?Yes

Are sufficient details of methods and materials provided to allow replication by others? Yes
Are the datasets clearly presented in a useable and accessible format?Yes Competing Interests: No competing interests were disclosed.
Reviewer Expertise: Genomics and molecular biology I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.
Reviewer Reviewer Expertise: Genomics, population genetics, evolutionary biology I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.

Figure 1 .
Figure 1.Drosophila limbata specimens.A: Wild-collected male (above) and female (below) Drosophila limbata presented with a 3 mm scale bar.B: The four lab-reared brothers selected for sequencing: specimen ID SAN00001918, ToLID idDroLimb2 (second from left) used for PacBio sequencing, specimen ID SAN00001919, ToLID idDroLimb3 (second from right) used for Hi-C sequencing, and specimen ID SAN00001920, ToLID idDroLimb4 (right) used for RNA sequencing.C: The vegetable patch from which the mother of the sequenced flies was collected on 2021-09-05 (Cherry Gardens Farm, East Sussex, England; 51.0994 N, 0.1639 E).

Figure 2 .
Figure 2. Genome assembly of Drosophila limbata, idDroLimb2.1:metrics.The BlobToolKit snail plot shows N50 metrics and BUSCO gene completeness.The main plot is divided into 1,000 size-ordered bins around the circumference with each bin representing 0.1% of the 233,538,449 bp assembly.The distribution of scaffold lengths is shown in dark grey with the plot radius scaled to the longest scaffold present in the assembly (37,002,035 bp, shown in red).Orange and pale-orange arcs show the N50 and N90 scaffold lengths (29,161,486 and 202,746 bp), respectively.The pale grey spiral shows the cumulative scaffold count on a log scale with white scale lines showing successive orders of magnitude.The blue and pale-blue area around the outside of the plot shows the distribution of GC, AT and N percentages in the same bins as the inner plot.A summary of complete, fragmented, duplicated and missing BUSCO genes in the diptera_odb10 set is shown in the top right.An interactive version of this figure is available at https://blobtoolkit.genomehubs.org/view/Drosophila_limbata/dataset/GCA_963924055.1/snail.

Figure 3 .
Figure 3. Genome assembly of Drosophila limbata, idDroLimb2.1:BlobToolKit GC-coverage plot.Sequences are coloured by phylum.Circles are sized in proportion to sequence length.Histograms show the distribution of sequence length sum along each axis.An interactive version of this figure is available at https://blobtoolkit.genomehubs.org/view/Drosophila_limbata/dataset/GCA_963924055.1/blob.

Figure 4 .
Figure 4. Genome assembly of Drosophila limbata, idDroLimb2.1:BlobToolKit cumulative sequence plot.The grey line shows cumulative length for all sequences.Coloured lines show cumulative lengths of sequences assigned to each phylum using the buscogenes taxrule.An interactive version of this figure is available at https://blobtoolkit.genomehubs.org/view/Drosophila_limbata/dataset/GCA_963924055.1/cumulative.

Figure 5 .
Figure 5. Genome assembly of Drosophila limbata, idDroLimb2.1:Hi-C contact map of the idDroLimb2.1 assembly, visualised using HiGlass.Chromosomes are shown in order of size from left to right and top to bottom.An interactive version of this figure may be viewed at https://genome-note-higlass.tol.sanger.ac.uk/l/?d=A6WVNgRTTCy8cVHqR08cvA.

the
Uniprot Reference Proteomes database (Bateman et al., 2023) with DIAMOND (Buchfink et al., 2021) blastp.The genome is also split into chunks according to the density of the BUSCO genes from the closest taxonomically lineage, and each chunk is aligned to the Uniprot Reference Proteomes database with DIAMOND blastx.Genome sequences that have no hit are then chunked with seqtk and aligned to the NT database with blastn (Altschul Report 02 September 2024 https://doi.org/10.21956/wellcomeopenres.24882.r94744© 2024 Zhou Q.This is an open access peer review report distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.Qingsong ZhouChinese Academy of Sciences, Beijing, China Obbard et al. present a genome assembly of D. limbata using Pacific HIFI long reads.The background of the species D. limbata is well described.Additionally, the chromosomes were scaffolded with Hi-C data into five autosomes and one sex chromosome (X).The genome size falls within the range of Drosophila genomes (130-257 Mb), and the scaffold N50 of 29.2 Mb indicates that the genome assembly is of high quality.The methodological aspects of this paper are described in detail.However, I find the presentation of the Hi-C contact map in Figure5unclear, making it difficult to distinguish the five autosomes.Furthermore, it may be beneficial to combine Figures2-4into a single figure to enhance the manuscript's conciseness and clarity.Is the rationale for creating the dataset(s) clearly described?YesAre the protocols appropriate and is the work technically sound?YesAre sufficient details of methods and materials provided to allow replication by others?YesAre the datasets clearly presented in a useable and accessible format?Yes Competing Interests: No competing interests were disclosed.YesAre sufficient details of methods and materials provided to allow replication by others?YesAre the datasets clearly presented in a useable and accessible format?Yes Competing Interests: No competing interests were disclosed.

Table 3
Wellcome Sanger Institute -Legal and GovernanceThe materials that have contributed to this genome note have been supplied by a Darwin Tree of Life Partner.The submission of materials by a Darwin Tree of Life Partner is subject to the '

Darwin Tree of Life Project Sampling Code of Practice', which
can be found in full on the Darwin Tree of Life website here.By agreeing with and signing up to the Sampling Code of Practice, the Darwin Tree of Life Partner agrees they

Table 3 . Software tools: versions and sources. Software tool Version which
they have been/are to be collected and provided for use.The purpose of this is to address and mitigate any potential legal and/or ethical implications of receipt and use of the materials as part of the research project, and to ensure that in doing so we align with best practice wherever possible.The overarching areas of consideration are: