The genome sequence of the Orange-tailed Mining Bee, Andrena haemorrhoa (Fabricius, 1781)

We present a genome assembly from an individual female Andrena haemorrhoa (the Orange-tailed Mining Bee; Arthropoda; Insecta; Hymenoptera; Andrenidae). The genome sequence is 330.7 megabases in span. Most of the assembly is scaffolded into 7 chromosomal pseudomolecules. The mitochondrial genome has also been assembled and is 16.46 kilobases in length. Gene annotation of this assembly on Ensembl identified 10,908 protein coding genes.


Background
The Orange-tailed Mining Bee, Andrena haemorrhoa, is a widespread and locally common species of mining bee in the UK.It is widely distributed across all but the most southerly regions of Europe and can be one of the most abundant bee species where it occurs (Banaszak-Cibicka & Żmihorski, 2012).It occurs across a wide variety of habitats and is only absent from the most extreme environments in its range.It is a medium-sized (7-10 mm forewing length) mining bee and both sexes are distinctive.Females have a neat pile of short, rich red hairs across the dorsal thorax, with white hairs on the face and sides of the thorax.The abdomen is sparsely haired, with a tuft of red hairs on the tip that give rise to the common name.The hairs of the males are more brownish-red with a reddish cluster of hairs on the tip of the abdomen.The hind tibiae and tarsi are orange, and males have a distinctive dark brown spot in the middle of the tibiae.
It is an early, univoltine Andrena species, with a flight period from March into July, leading to the alternative common name: the Early Mining Bee.Males typically emerge slightly earlier than females and may congregate around shrubs.After mating, females excavate nests with a preference for light soils, south facing banks, short swards and the margins of paths and tracks (Falk & Lewington, 2019).Nesting often occurs in dispersed aggregations in suitable areas.It is widely polylectic, visiting a wide range of spring-flowering plants, especially Rosaceae including hawthorn (Crataegus monogyna) (Wood & Roberts, 2017), and may be an important pollinator of crops such as apple (Kendall & Solomon, 1973).
The complete genome sequence for this species will facilitate studies into the evolution of mining bees, conservation of important pollinator species, reproductive evolution and foraging behaviour.

Genome sequence report
The genome was sequenced from one female Andrena haemorrhoa (Figure 1) collected from Wytham Woods, Oxfordshire, UK (51.77,.A total of 77-fold coverage in Pacific Biosciences single-molecule HiFi long reads and 91-fold coverage in 10X Genomics read clouds were generated.Primary assembly contigs were scaffolded with chromosome conformation Hi-C data.Manual assembly curation corrected 27 missing joins or mis-joins and removed 1 haplotypic duplication, reducing the assembly length by 0.24% and the scaffold number by 4.74%, and increasing the scaffold N50 122.57%. The final assembly has a total length of 330.7 Mb in 402 sequence scaffolds with a scaffold N50 of 41.0 Mb (Table 1).Most (88.08%) of the assembly sequence was assigned to 7 chromosomal-level scaffolds.Chromosome-scale scaffolds confirmed by the Hi-C data are named in order of size (Figure 2-Figure 5; Table 2).The specimen is a diploid female.While not fully phased, the assembly deposited is of one haplotype.Contigs corresponding to the second haplotype have also been deposited.The mitochondrial genome was also assembled and can be found as a contig within the multifasta file of the genome submission.
Metadata for specimens, spectral estimates, sequencing runs, contaminants and pre-curation assembly statistics can be found at https://links.tol.sanger.ac.uk/species/444401.

Sample acquisition and nucleic acid extraction
The specimen used for DNA and Hi-C sequencing was a female Andrena haemorrhoa (specimen ID Ox000414, ToLID iyA-ndHaem1), which was netted in Wytham Woods, Oxfordshire (biological vice-county Berkshire), UK (latitude 51.77, longitude -1.34) on 2020-05-22.The specimen used for RNA sequencing was a male Andrena haemorrhoa (specimen ID Ox001075, ToLID iyAndHaem3), collected from the same    RNA was extracted from abdomen tissue of iyAndHaem3 in the Tree of Life Laboratory at the WSI using TRIzol, according to the manufacturer's instructions.RNA was then eluted in 50 μl RNAse-free water and its concentration assessed  A Hi-C map for the final assembly was produced using bwa-mem2 (Vasimuddin et al., 2019)    Table 3. Software tools: versions and sources.

Genome annotation
The Ensembl gene annotation system (Aken et al., 2016) was used to generate annotation for the Andrena haemorrhoa assembly (GCA_910592295.1).Annotation was created primarily through alignment of transcriptomic data to the genome, with gap filling via protein-to-genome alignments of a select set of proteins from UniProt (UniProt Consortium, 2019).

Wellcome Sanger Institute -Legal and Governance
The materials that have contributed to this genome note have been supplied by a Darwin Tree of Life Partner.The submission of materials by a Darwin Tree of Life Partner is subject to the 'Darwin Tree of Life Project Sampling Code of Practice', which can be found in full on the Darwin Tree of Life website here.By agreeing with and signing up to the Sampling Code of Practice, the Darwin Tree of Life Partner agrees they will meet the legal and ethical requirements and standards set out within this document in respect of all samples acquired for, and supplied to, the Darwin Tree of Life Project.
Further, the Wellcome Sanger Institute employs a process whereby due diligence is carried out proportionate to the nature of the materials themselves, and the circumstances under which they have been/are to be collected and provided for use.The purpose of this is to address and mitigate any potential legal and/or ethical implications of receipt and use of the materials as part of the research project, and to ensure that in doing so we align with best practice wherever possible.The overarching areas of consideration are: • Ethical review of provenance and sourcing of the material

Data availability
European Nucleotide Archive: Andrena haemorrhoa (red-tailed mining bee).Accession number PRJEB45180; https://identifiers. org/ena.embl/PRJEB45180.(Wellcome Sanger Institute, 2021) The genome sequence is released openly for reuse.The Andrena haemorrhoa genome sequencing initiative is part of the Darwin Tree of Life (DToL) project.All raw sequence data and the assembly have been deposited in INSDC databases.Raw data and assembly accession identifiers are reported in Table 1.

Are sufficient details of methods and materials provided to allow replication by others? Yes
Are the datasets clearly presented in a useable and accessible format?Yes Competing Interests: No competing interests were disclosed.
Reviewer Expertise: Phylogenetics, ecology and genomics of parasitoid wasps, especially Braconidae I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.
Reviewer Report 28 May 2024 https://doi.org/10.21956/wellcomeopenres.22128.r75008 © 2024 Batista T. This is an open access peer review report distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Thiago Mafra Batista
Centro de Formacao em Ciencias Ambientais, Universidade Federal do Sul da Bahia, Porto Seguro, State of Bahia, Brazil This study describes the genome of the Andrena haemorrhoa bee, sequenced with 77-fold coverage using PacBio HiFi and 91-fold coverage using 10X Genomics.The assembly boasts an N50 of 41 Mb.The contigs were scaffolded using Hi-C into 7 chromosomal pseudomolecules, accounting for 88.08% of the genome.The genome completeness is 96.8% (BUSCO Hymenoptera_odb10 n= 5,991).The methods are clear, detailed, and reproducible, and metadata is available.

Strengths:
High Coverage: The 77-fold coverage with PacBio HiFi and 91-fold coverage with 10X Genomics significantly enhance the accuracy and reliability of the genome assembly.Quality Assembly: An N50 of 41 Mb indicates a high-quality assembly, suggesting long and contiguous contigs.

Use of Advanced Technologies:
The effective combination of cutting-edge sequencing technologies such as PacBio HiFi, 10X Genomics, and Hi-C showcases a robust approach to achieving a high-quality genome assembly.

Chromosomal Pseudomolecule Organization:
The assembly into 7 chromosomal pseudomolecules, covering 88.08% of the genome, provides a useful framework for future structural and functional genomic studies.High Genome Completeness: The 96.8% completeness (BUSCO) demonstrates that the majority of conserved genes are present, essential for the functional integrity of the sequenced genome.
Transparency and Reproducibility: Detailed methods and the availability of metadata ensure that other researchers can replicate the study, promoting scientific transparency. Weaknesses: Partial Genome Coverage: While 88.08% of the genome is covered, 11.92% remains unrepresented in the chromosomal pseudomolecules.The study could benefit from a discussion on the missing genomic content and its potential implications.
Suggestions for Improvement: Additional Information in

Trevor Sless
Biology, York University, Toronto, Ontario, Canada Summary: The submitted paper presents the complete genome sequence of the mining bee Andrena haemorrhoa.This marks the first representative of the genus Andrena (as well as the family Andrenidae) to have a genome assembly released, alongside several other recently sequenced Andrena species from the Wellcome Sanger Institute and Chongqing Normal University.
The combined use of 10X Genomics, HiFi, and Hi-C sequencing technologies is in keeping with the latest standards in the field of genomics and ensures a high-quality genome, as borne out by the resulting assembly statistics described in the paper.The annotation methodology is similarly robust.
All assembly and raw read datasets have been made publicly available as indicated in the data availability section and will be a valuable resource to future studies into the genetics and genomics of Andrena haemorrhoa and related species.In summary, I recommend that the paper be accepted for indexing in Wellcome Open Research, but provide a few small suggestions as detailed below.
Specific Suggestions: Background: "It is widely distributed across all but the most southerly regions of Europe…" Based on GBIF records, Andrena haemorrhoa is widespread throughout the Palearctic including as far east as Japan, while this text seems to suggest it is only found in Europe; please amend.
Methods: "Low molecular weight DNA was removed from a 20 ng aliquot of extracted DNA using the 0.8X AMpure XP purification kit prior to 10X Chromium sequencing; a minimum of 50 ng DNA was submitted for 10X sequencing."Was the purification kit also used on the HMW DNA sent for HiFi sequencing, or only the 10X Chromium material?
Table 1: The "Percentage of assembly mapped to chromosomes" metric stands out slightly as the only one which does not reach the provided benchmark value.Though I realize the scope of this paper does not leave much space for discussion, are there any possible explanations for this (e.g., perhaps microchromosomes/B chromosomes, or methodological factors?) Figure 3: I feel that this figure would benefit from rescaling the axes to make better use of the available space.Could the outlier point (which I assume represents the mitogenome) be removed or otherwise included as an inset to allow this?
Is the rationale for creating the dataset(s) clearly described?Yes Are the protocols appropriate and is the work technically sound?Yes

Are sufficient details of methods and materials provided to allow replication by others? Yes
Are the datasets clearly presented in a useable and accessible format?Yes Competing Interests: No competing interests were disclosed.
Reviewer Expertise: Bees, Evolution, Genomics, Phylogenetics I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.

Figure 2 .
Figure 2. Genome assembly of Andrena haemorrhoa, iyAndHaem1.1:metrics.The BlobToolKit Snailplot shows N50 metrics and BUSCO gene completeness.The main plot is divided into 1,000 size-ordered bins around the circumference with each bin representing 0.1% of the 330,687,150 bp assembly.The distribution of scaffold lengths is shown in dark grey with the plot radius scaled to the longest scaffold present in the assembly (59,446,405 bp, shown in red).Orange and pale-orange arcs show the N50 and N90 scaffold lengths (41,012,438 and 915,300 bp), respectively.The pale grey spiral shows the cumulative scaffold count on a log scale with white scale lines showing successive orders of magnitude.The blue and pale-blue area around the outside of the plot shows the distribution of GC, AT and N percentages in the same bins as the inner plot.A summary of complete, fragmented, duplicated and missing BUSCO genes in the hymenoptera_odb10 set is shown in the top right.An interactive version of this figure is available at https://blobtoolkit.genomehubs.org/view/iyAndHaem1.1/dataset/CAJUZA01/snail.

Figure 3 .
Figure 3. Genome assembly of Andrena haemorrhoa, iyAndHaem1.1:BlobToolKit GC-coverage plot.Scaffolds are coloured by phylum.Circles are sized in proportion to scaffold length.Histograms show the distribution of scaffold length sum along each axis.An interactive version of this figure is available at https://blobtoolkit.genomehubs.org/view/iyAndHaem1.1/dataset/CAJUZA01/blob.

Figure 4 .
Figure 4. Genome assembly of Andrena haemorrhoa, iyAndHaem1.1:BlobToolKit cumulative sequence plot.The grey line shows cumulative length for all scaffolds.Coloured lines show cumulative lengths of scaffolds assigned to each phylum using the buscogenes taxrule.An interactive version of this figure is available at https://blobtoolkit.genomehubs.org/view/iyAndHaem1.1/dataset/CAJUZA01/ cumulative.

Figure 5 .
Figure 5. Genome assembly of Andrena haemorrhoa, iyAndHaem1.1:Hi-C contact map of the iyAndHaem1.1 assembly, visualised using HiGlass.Chromosomes are shown in order of size from left to right and top to bottom.An interactive version of this figure may be viewed at https://genome-note-higlass.tol.sanger.ac.uk/l/?d=JrLYTlFyTR2idPajNfYFDg.

Table 2 :
Include a column in Table2with information on the percentage of the genome each pseudomolecule represents, cumulatively totaling 88.08%.

the rationale for creating the dataset(s) clearly described? Yes Are the protocols appropriate and is the work technically sound? Yes Are sufficient details of methods and materials provided to allow replication by others? Yes Are the datasets clearly presented in a useable and accessible format? Yes Competing Interests:
No competing interests were disclosed.

Reviewer Expertise: Genomics I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.
This is an open access peer review report distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.