The genome sequence of Gwynne’s mining bee, Andrena bicolor Fabricius, 1775

We present a genome assembly from an individual female Andrena bicolor (Gwynne’s mining bee; Arthropoda; Insecta; Hymenoptera; Andrenidae). The genome sequence is 351.7 megabases in span. Most of the assembly is scaffolded into 5 chromosomal pseudomolecules. The mitochondrial genome has also been assembled and is 21.02 kilobases in length.


Background
Andrena (Euandrena) bicolor Fabricius, 1775 (Andrenidae: Andreninae) is a ground-nesting species that is commonly distributed throughout Europe, north Africa and the eastern Palearctic.In Europe it is found in a wide range of habitats from lowlands to the tree line in montane environments (Amiet et al., 2010;Praz et al., 2019).In the UK, the species is bivoltine with the first generation occurring in March to early June, and a second generation flying in mid-June to late August (Falk & Lewington, 2019).It is found across Great Britain including northern Scotland.
In the Western Palearctic the subgenus Euandrena is species rich with more than 70 species recorded (Michez et al., 2019).A. bicolor is extremely polylectic, with contemporary and historic records from Britain indicating that the species forages from more than 20 genera of plants in multiple families (Wood & Roberts, 2017).The species larvae are attacked by the cleptoparasitic species Nomada fabriciana (Linnaeus, 1767) (Apidae: Nomadinae) (Falk & Lewington, 2019).In Britain the flight times of the two generations of N. fabriciana mirror that of A. bicolor, its primary host.However, it is expected to also use several other Andrena hosts (Falk & Lewington, 2019).
The genome of Gwynne's mining bee, Andrena bicolor, was sequenced and assembled to chromosome level as part of the Darwin Tree of Life Project.

Genome sequence report
The genome was sequenced from one female Andrena bicolor (Figure 1) collected from Wytham Woods, Oxfordshire, UK (51.77, -1.31).A total of 58-fold coverage in Pacific Biosciences single-molecule HiFi long reads was generated.Primary assembly contigs were scaffolded with chromosome conformation Hi-C data.Manual assembly curation corrected 46 missing joins or mis-joins and removed one haplotypic duplication, reducing the assembly length by 0.16% and the scaffold number by 2.40%, and increasing the scaffold N50 by 168.08%.
The final assembly has a total length of 351.7 Mb in 325 sequence scaffolds with a scaffold N50 of 50.6 Mb (Table 1).The snail plot in Figure 2 provides a summary of the assembly statistics, while the distribution of assembly scaffolds on GC proportion and coverage is shown in Figure 3.The cumulative assembly plot in Figure 4 shows curves for subsets of scaffolds assigned to different phyla.Most (70.27%) of the assembly sequence was assigned to 5 chromosomal-level scaffolds.Chromosome-scale scaffolds confirmed by the Hi-C data are named in order of size (Figure 5; Table 2).While not fully phased, the assembly deposited is of one haplotype.Contigs corresponding to the second haplotype have also been deposited.The mitochondrial genome was also assembled and can be found as a contig within the multifasta file of the genome submission.

Sample acquisition and nucleic acid extraction
A female Andrena bicolor (specimen ID Ox001225, ToLID iyAndBico1) was netted in Wytham Woods, Oxfordshire (biological vice-county Berkshire), UK (latitude 51.77, longitude -1.31) on 2021-04-19.The specimen was collected and identified by Steven Falk (independent researcher).The male specimen used for Hi-C sequencing (specimen ID Ox001274, ToLID iyAndBico2) was netted in the same location on 2021-04-23.The specimen was collected and identified by Liam Crowley (University of Oxford).The specimens were snap-frozen on dry ice.
The workflow for high molecular weight (HMW) DNA extraction at the Wellcome Sanger Institute (WSI) includes a sequence of core procedures: sample preparation; sample homogenisation, DNA extraction, fragmentation, and clean-up.In sample preparation, the iyAndBico1 sample was weighed and dissected on dry ice (Jay et al., 2023).Tissue from the thorax was homogenised using a PowerMasher II tissue disruptor (Denton et al., 2023a).HMW DNA was extracted using the Automated MagAttract v1 protocol (Sheerin et al., 2023).DNA was sheared into an average fragment size of 12-20 kb in a Megaruptor 3 system with speed setting 30 (Todorovic et al., 2023).Sheared DNA was purified by solid-phase reversible immobilisation (Strickland et al., 2023): in brief, the method employs a 1.8X ratio of AMPure PB beads to sample to eliminate shorter fragments and concentrate the DNA.The concentration of the sheared and purified DNA was assessed using a Nanodrop spectrophotometer and Qubit Fluorometer and Qubit dsDNA High Sensitivity Assay kit.Fragment size distribution was evaluated by running the sample on the FemtoPulse system.Protocols developed by the WSI Tree of Life laboratory are publicly available on protocols.io(Denton et al., 2023b).

Sequencing
Pacific Biosciences HiFi circular consensus DNA sequencing libraries were constructed according to the manufacturers' instructions.DNA sequencing was performed by the Scientific Operations core at the WSI on a Pacific Biosciences SEQUEL II instrument.Hi-C data were also generated from whole organism tissue of iyAndBico2 using the Arima2 kit and sequenced on the Illumina NovaSeq 6000 instrument.

Genome assembly, curation and evaluation
Assembly was carried out with Hifiasm (Cheng et al., 2021) and haplotypic duplication was identified and removed with purge_dups (Guan et al., 2020).The assembly was then A Hi-C map for the final assembly was produced using bwa-mem2 (Vasimuddin et al., 2019) in the Cooler file format (Abdennur & Mirny, 2020).To assess the assembly metrics, the k-mer completeness and QV consensus quality values were calculated in Merqury (Rhie et al., 2020).This work  et al., 2020) andBUSCO scores (Manni et al., 2021;Simão et al., 2015) were calculated.
Table 3 contains a list of relevant software tool versions and sources.

Wellcome Sanger Institute -Legal and Governance
The materials that have contributed to this genome Further, the Wellcome Sanger Institute employs a process whereby due diligence is carried out proportionate to the nature of the materials themselves, and the circumstances under which they have been/are to be collected and provided for use.The purpose of this is to address and mitigate any potential legal and/or ethical implications of receipt and use of the materials as part of the research project, and to ensure that in doing so we align with best practice wherever possible.The overarching areas of consideration are: • Ethical review of provenance and sourcing of the material

Natalia Araujo
Unit of Evolutionary Biology & Ecology -, Universite Libre de Bruxelles, Brussels, Brussels, Belgium The article reports the assemblies of the nuclear and mitochondrial genomes of Andrena bicolor a mining bee of western palearctic distribution, no annotation of the reference genome is provided.The nuclear genome spanned 351.7 Mb across 399 contigs, mostly comprised within 5 large scaffolds.The authors used the Darwin Tree of Life pipeline to generate this assembly, a largely accepted methodology, so I have no major concerns about this work.A few minor suggestions for improving the report are: I don't work with Andrena bees, so I was surprised about the small chromosome numbers reported and their large sizes (up to 65.38 Mb).Looking at the Hi-C map (Figure 5), it was not so clear to me why during manual curation scaffolds 1, 2, 4, and 5 were not split as they show low support regions that could represent possible missjoints.Then, I searched for the expected chromosome numbers in Andrena and I saw these bees have an expected small chromosome number.I thus concluded that the authors, based on this information, considered these regions could be explained by specific chromosome conformations in this species.This is reasonable, but I suggest it be clarified further in the text.Is the chromosome number of this species confirmed by other methods?Did the authors consider this information during scaffolding?I understand these are pseudochromosomes, but it would be nice to know how their number fits within the species or genus expected numbers. 1.
In the abstract and the Genome sequence report sections, it says the genome was assembled from a single female.Actually, the Pacbio sequencing used a single female but the Hi-C data was generated using a male.Thus, two specimens were used to produce this assembly and not only one as stated.This should be corrected.

2.
The annotation of the mitochondrial genome is mentioned but there is no information about it.Did it comprise all genes?Where it can be assessed? 3.
In the MitoHiFi description, in the methods section, the authors describe that this pipeline may use two approaches for annotation (MitoFinder or MITOS) which one did they use for the current assembly? 4.
Is the rationale for creating the dataset(s) clearly described?Yes Are the protocols appropriate and is the work technically sound?Yes

Are sufficient details of methods and materials provided to allow replication by others? Yes
Are the datasets clearly presented in a useable and accessible format?Yes Competing Interests: No competing interests were disclosed.
Reviewer Expertise: bee genomics and evolution, bioinformatics.
I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.

Figure 2 .
Figure 2. Genome assembly of Andrena bicolor, iyAndBico1.1:metrics.The BlobToolKit Snailplot shows N50 metrics and BUSCO gene completeness.The main plot is divided into 1,000 size-ordered bins around the circumference with each bin representing 0.1% of the 351,724,186 bp assembly.The distribution of scaffold lengths is shown in dark grey with the plot radius scaled to the longest scaffold present in the assembly (65,380,999 bp, shown in red).Orange and pale-orange arcs show the N50 and N90 scaffold lengths (50,647,002 and 1,261,894 bp), respectively.The pale grey spiral shows the cumulative scaffold count on a log scale with white scale lines showing successive orders of magnitude.The blue and pale-blue area around the outside of the plot shows the distribution of GC, AT and N percentages in the same bins as the inner plot.A summary of complete, fragmented, duplicated and missing BUSCO genes in the hymenoptera_odb10 set is shown in the top right.An interactive version of this figure is available at https://blobtoolkit.genomehubs.org/view/iyAndBico1_1/dataset/iyAndBico1_1/snail.

Figure 3 .
Figure 3. Genome assembly of Andrena bicolor, iyAndBico1.1:BlobToolKit GC-coverage plot.Sequences are coloured by phylum.Circles are sized in proportion to sequence length.Histograms show the distribution of sequence length sum along each axis.An interactive version of this figure is available at https://blobtoolkit.genomehubs.org/view/iyAndBico1_1/dataset/iyAndBico1_1/blob.
note have been supplied by a Darwin Tree of Life Partner.The submission of materials by a Darwin Tree of Life Partner is subject to the 'Darwin Tree of Life Project Sampling Code of Practice', which can be found in full on the Darwin Tree of Life website here.By agreeing with and signing up to the Sampling Code of Practice, the Darwin Tree of Life Partner agrees they will meet the legal and ethical requirements and standards set out within this document in respect of all samples acquired for, and supplied to, the Darwin Tree of Life Project.

Figure 4 .
Figure 4. Genome assembly of Andrena bicolor, iyAndBico1.1:BlobToolKit cumulative sequence plot.The grey line shows cumulative length for all sequences.Coloured lines show cumulative lengths of sequences assigned to each phylum using the buscogenes taxrule.An interactive version of this figure is available at https://blobtoolkit.genomehubs.org/view/iyAndBico1_1/dataset/iyAndBico1_1/cumulative.

Figure 5 .
Figure 5. Genome assembly of Andrena bicolor, iyAndBico1.1:Hi-C contact map of the iyAndBico1.1 assembly, visualised using HiGlass.Chromosomes are shown in order of size from left to right and top to bottom.An interactive version of this figure may be viewed at https://genome-note-higlass.tol.sanger.ac.uk/l/?d=fXWov8moQemPAtY5Oo7rhg.

:
Proposed standards and metrics for defining genome assembly quality" from Rhie et al. (2021).

Table 3 . Software tools: versions and sources. Software tool Version Yes Are the datasets clearly presented in a useable and accessible format? Yes Competing Interests:
No competing interests were disclosed.

have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.
This is an open access peer review report distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.