The genome sequence of the Large Scabious Mining Bee, Andrena hattorfiana (Fabricius, 1775)

We present a genome assembly from an individual female Andrena hattorfiana (the Large Scabious Mining Bee; Arthropoda; Insecta; Hymenoptera; Andrenidae). The genome sequence is 428.5 megabases in span. Most of the assembly is scaffolded into seven chromosomal pseudomolecules. The mitochondrial genome has also been assembled and is 22.7 kilobases in length. Gene annotation of this assembly on Ensembl identified 11,349 protein coding genes.


Background
Andrena hattorfiana is a species of mining bee commonly found throughout Europe, from the south of the Scandinavian countries to north Africa, and eastwards to the Caucasus, but its population has declined in Europe and the UK due to habitat loss (Dimond, 2015).In contrast to the common European bee that lives collectively in a colony, A. hattorfiana is solitary and constructs nests in the ground, where it lays its eggs and provisions them with a mixture of pollen and nectar (Larsson & Franzén, 2007;Dimond, 2015).Andrena hattorfiana is oligolectic, feeding on pollen from a single family or genus of flowering plants (Cane & Sipes, 2006;Larsson & Franzén, 2007).Specifically, it feeds on pollen from the flowers of Knautia arvensis and Scabiosa columbaria (Jefferson, 2022;Reemer et al., 2012;Varga et al., 2022).Given its limited diet, A. hattorfiana is endangered due to a number of factors: pollen competition, not enough variability in its habitat, and because of the dearth of traditionally managed meadows (Carvell et al., 2006;Falk, 1991;Larsson & Franzén, 2007;Varga et al., 2022) and the effects of climate, which exacerbates habitat loss.
The generation of a reference genome sequence for A. hattorfiana is likely to aid in the preservation efforts for this vulnerable species, as well as to help in the understanding of the broader biology of this species.For instance, the reference genome would help in subsequent assessment of whether there is sufficient genetic diversity in the population for it to adapt to changing conditions, if deleterious genes are becoming fixed in the population (MacDonald, 2021), and for identifying genetic elements that uniquely characterise A. hattorfiana versus other species of bees.

Genome sequence report
The genome was sequenced from one female Andrena hattorfiana specimen (Figure 1) collected from Wytham Woods, Oxfordshire (biological vice-county: Berkshire), UK (latitude 51.77, longitude -1.33).A total of 49-fold coverage in Pacific Biosciences single-molecule HiFi long reads was generated.Primary assembly contigs were scaffolded with chromosome conformation Hi-C data.Manual assembly curation corrected 134 missing joins or mis-joins, reducing the scaffold number by 49.6%, and increasing the scaffold N50 by 196.76%.
The final assembly has a total length of 428.5 Mb in 125 sequence scaffolds with a scaffold N50 of 79.4 Mb (Table 1).Most (98.36%) of the assembly sequence was assigned to seven chromosomal-level scaffolds.Chromosome-scale scaffolds confirmed by the Hi-C data are named in order of size (Figure 2-Figure 5; Table 2).The scaffold order and orientation are uncertain in the following regions: chromosome 1 (61. .The mitochondrial genome was also assembled and can be found as a contig within the multifasta file of the genome submission. The estimated Quality Value (QV) of the final assembly is 64.5 with k-mer completeness of 100%, and the assembly has a BUSCO v5.3.2 completeness of 96.3% (single = 96.1%,duplicated = 0.2%), using the hymenoptera_odb10 reference set (n = 5,991).
Metadata for specimens, spectral estimates, sequencing runs, contaminants and pre-curation assembly statistics can be found at https://links.tol.sanger.ac.uk/species/1126402.

Sample acquisition and nucleic acid extraction
A female Andrena hattorfiana specimen (iyAndHatt1) was collected from Wytham Woods, Oxfordshire (biological vice-county: Berkshire), UK (latitude 51.77, longitude -1.33) on 4 August 2020.The specimen was taken from woodland habitat by Steven Falk (independent researcher) by netting.The specimen was identified by Steven Falk and preserved on dry ice.
DNA was extracted at the Tree of Life laboratory, Wellcome Sanger Institute (WSI).The iyAndHatt1 sample was weighed and dissected on dry ice with tissue set aside for Hi-C   sequencing.Thorax tissue was disrupted using a Nippi Powermasher fitted with a BioMasher pestle.High molecular weight (HMW) DNA was extracted using the Qiagen MagAttract HMW DNA extraction kit.HMW DNA was sheared into an average fragment size of 12-20 kb in a Megaruptor 3 system with speed setting 30.Sheared DNA was purified by solid-phase reversible immobilisation using AMPure PB beads with a 1.8X ratio of beads to sample to remove the shorter fragments and concentrate the DNA sample.The concentration of the sheared and purified DNA was assessed using a Nanodrop spectrophotometer and Qubit Fluorometer and Qubit dsDNA High Sensitivity Assay kit.Fragment size distribution was evaluated by running the sample on the FemtoPulse system.
RNA was extracted from abdomen tissue of iyAndHatt1 in the Tree of Life Laboratory at the WSI using TRIzol, according to the manufacturer's instructions.RNA was then eluted in 50 μl RNAse-free water and its concentration assessed using a Nanodrop spectrophotometer and Qubit Fluorometer using the Qubit RNA Broad-Range (BR) Assay kit.Analysis of the integrity of the RNA was done using Agilent RNA 6000 Pico Kit and Eukaryotic Total RNA assay.

Sequencing
Pacific Biosciences HiFi circular consensus DNA sequencing libraries were constructed according to the manufacturers' instructions.Poly(A) RNA-Seq libraries were constructed using the NEB Ultra II RNA Library Prep kit.DNA and RNA A Hi-C map for the final assembly was produced using bwa-mem2 (Vasimuddin et al., 2019) in the Cooler file format (Abdennur & Mirny, 2020).To assess the assembly metrics, the k-mer completeness and QV consensus quality values were calculated in Merqury (Rhie et al., 2020).This  Table 3 contains a list of relevant software tool versions and sources.

Genome annotation
The Ensembl gene annotation system (Aken et al., 2016) was used to generate annotation for the Andrena hattorfiana assembly (GCA_944738655.1).Annotation was created primarily through alignment of transcriptomic data to the genome, with gap filling via protein-to-genome alignments of a select set of proteins from UniProt (UniProt Consortium, 2019).

Ethics and compliance issues
The

Miriam Richards
Department of Biological Sciences, Brock University, Saint Catharines, Ontario, Canada Falk and Tan The genome sequence of the Large Scabious Mining Bee, Andrena hattorfiana This paper provides a chromosome-level genome sequence for the solitary andrenid, A. hattorfiana , a common species which nevertheless may be declining across Europe.The genome was sequenced from a single female specimen collected from the UK.The sequencing quality is high, based on comparison to several benchmarks.An initial estimate of genome size (428.5 Mb) places it at the high end for bees, with seven chromosomal-level scaffolds, identifying 11349 proteincoding genes, 4338 non-coding genes, and ~27,000 different transcripts.The data were filtered to ensure that the genome sequences were not from parasites or symbionts.Having a genome sequence is likely to be useful in conservation efforts, noting that a reference genome will be useful in future conservation genetic and taxonomic studies.I agree with this statement, and I am happy to see this new bee genome.I suspect that as bee genomes accumulated, many kinds of useful comparative studies will be enabled in other fields as well, including bioinformatics and molecular evolution.
A minor concern with the paper is that there is no mention of voucher specimens, in case future studies raise queries about the identify of the sequenced specimen.I suppose a DNA barcode, inherent in the genome, could be enough to address this issue.A second minor concern is that the bioinformatics pipelines are described, but it would be useful to provide details of options and switches that may have been used with each piece of software.I'm aware that few papers provide sufficient bioinformatics detail to easily replicate the methods!I was pleased to see Table 3, which simplifies the tracking down of software tools.

Is the rationale for creating the dataset(s) clearly described? Yes
Are the protocols appropriate and is the work technically sound?Yes

Are sufficient details of methods and materials provided to allow replication by others? Partly
Are the datasets clearly presented in a useable and accessible format?Yes Competing Interests: No competing interests were disclosed.
Reviewer Expertise: Bee ecology, behaviour, evolution, bee genomes I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.

Lucio Navarro-Escalante
Department of Molecular Biosciences, The University of Texas at Austin, Austin, Texas, USA This report presents the full genome sequences of the large scabious mining bee, Andrena hattorfiana, an endangered Hymenoptera species.This is a high quality genome assembly at chromosome-scale level performed using state-of-the-art DNA sequencing technologies.The method section describes sufficiently each one of the methodological steps followed.The sequencing, assembly and annotation data are clearly described in the report and they are available from repositories for the scientific community.Despite the few comments I have bellow, I support the indexing of this report and its data.

Comments:
Background: Define clearly the species name for the "common European bee". 1.
Consider including a small paragraph or some lines highlighting better the relevance of the large scabious mining bee.To my understanding the bee is listed provisionally within the UK "Nationally Rare Species".

Results:
I would suggest to delimit and name the putative chromosome boundaries in the Hi-C contact map in Figure 5. 1.

Methods:
Describe in more detail how the specimen was taxonomically identified.Did you used morphological or molecular characters (e.g.COI sequences)? 1.
Please, add a brief description of the overall method or name the tool used for checking contamination in the genome assembly.

2.
There should be an explanation of why only the abdomen was used for RNA isolation and sequencing.

3.
Is the rationale for creating the dataset(s) clearly described?Yes Are the protocols appropriate and is the work technically sound?Yes

Are sufficient details of methods and materials provided to allow replication by others? Yes
Are the datasets clearly presented in a useable and accessible format?Yes Competing Interests: No competing interests were disclosed.
I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.

Denilce Lopes
Departamento de Biologia Geral, Universidade Federal de Vicosa, Viçosa, State of Minas Gerais, Brazil This report accurately describes the genome assembly process of Andrena, which seems of good quality with good N50 value and BUSCO score.My concern with this manuscript and the other Data Notes from this Consortium is that, except for the Background item, the text is practically the same, changing only the name of the species and the sequencing data.I suggest reviewing this for future publications.In particular for bees, few complete genomes have been sequenced and the data will be useful in preservation efforts and other behavioral studies on these insects.
Is the rationale for creating the dataset(s) clearly described?

Michael Orr
Entomologie, Staatliches Museum fur Naturkunde Stuttgart, Stuttgart, Baden-Württemberg, Germany This paper describes the genome generation of a species of vulnerable bee, and so could be useful for future conservation work, as the authors aptly note.I am not myself an expert of genome quality assessment/etc.but Darwin Tree of Life has already sequenced many bees and numerous Andrena successfully so their practices have assumedly been very well refined by now.Generally, the parameters I'm familiar with seem reasonable overall and the openness and completeness of the methods and data sourcing look good to me.The genome size is also pretty close to the other Andrena they've sequenced (with one exception I noticed being A. dorsata with <3mb but that'd be more a potential issue with that assembly?), which bodes well.Overall, I appreciate the efforts of Darwin Tree of Life and am glad for the many new questions that can be approached with these genomes.
Overall, I think that this can be approved and that just some small changes and information added would suffice.My comments are generally rather simple and should be easy to incorporate (although without line numbers or a way to upload tracked changes files it's a little cumbersome...).

Specific comments are given below:
Background: First paragraph ascribes their conservation worries to habitat loss but the second paragraph notes specialization may also play a role, maybe best to explain the underlying cause in one place and using the more complete second listing of potential factors.The first mention could just be that it's listed as vulnerable by the IUCN (which a citation thereof).
○ "the common European bee" in the first paragraph is very vague.Please say honey bee explicitly and list Apis mellifera in parentheses.

○
Maybe not necessary, but another interesting use of the genome that could be mentioned in the third paragraph is the relevance for understanding the genomic toolkits of floral specialists in general.

Methods
A bit more should be stated under sample acquisition.ID was in the field, correct?Or was it using a leg later sequenced for COI, or pulling COI from the genome sequence?Are there vouchers that were collected at the same site?Where are they deposited if so?
○ Third paragraph on RNA: shouldn't a whole body have been used for RNA?I guess maybe there are limits on collecting these vulnerable species though…?This could be stated somewhere in the methods explicitly as justification.Also, was the gut removed?We usually did this to reduce the number of nontarget reads but I think it's now less necessary as long read tech matures Reviewer Expertise: Bee biodiversity, conservation, and systematics.
I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.

Figure 2 .
Figure 2. Genome assembly of Andrena hattorfiana, iyAndHatt1.2:metrics.The BlobToolKit Snailplot shows N50 metrics and BUSCO gene completeness.The main plot is divided into 1,000 size-ordered bins around the circumference with each bin representing 0.1% of the 428,532,965 bp assembly.The distribution of scaffold lengths is shown in dark grey with the plot radius scaled to the longest sequence present in the assembly (91,187,844 bp, shown in red).Orange and pale-orange arcs show the N50 and N90 scaffold lengths (79,350,654 and 36,459,576 bp), respectively.The pale grey spiral shows the cumulative scaffold count on a log scale with white scale lines showing successive orders of magnitude.The blue and pale-blue area around the outside of the plot shows the distribution of GC, AT and N percentages in the same bins as the inner plot.A summary of complete, fragmented, duplicated and missing BUSCO genes in the hymenoptera_odb10 set is shown in the top right.An interactive version of this figure is available at https://blobtoolkit.genomehubs.org/view/iyAndHatt1.2/dataset/CALYFQ02/snail.

Figure 3 .
Figure 3. Genome assembly of Andrena hattorfiana, iyAndHatt1.2:GC coverage.BlobToolKit GC-coverage plot.Scaffolds are coloured by phylum.Circles are sized in proportion to scaffold length.Histograms show the distribution of scaffold length sum along each axis.An interactive version of this figure is available at https://blobtoolkit.genomehubs.org/view/iyAndHatt1.2/dataset/CALYFQ02/blob.

Figure 4 .
Figure 4. Genome assembly of Andrena hattorfiana, iyAndHatt1.2:cumulative sequence.BlobToolKit cumulative sequence plot.The grey line shows cumulative length for all scaffolds.Coloured lines show cumulative lengths of scaffolds assigned to each phylum using the buscogenes taxrule.An interactive version of this figure is available at https://blobtoolkit.genomehubs.org/view/iyAndHatt1.2/dataset/ CALYFQ02/cumulative.

Figure 5 .
Figure 5. Genome assembly of Andrena hattorfiana, iyAndHatt1.2:Hi-C contact map.Hi-C contact map of the iyAndHatt1.2assembly, visualised using HiGlass.Chromosomes are shown in order of size from left to right and top to bottom.An interactive version of this figure may be viewed at https://genome-note-higlass.tol.sanger.ac.uk/l/?d=NVN2MuBATkat5Wk8tNuFqA.

Reviewer
Report 03 October 2023 https://doi.org/10.21956/wellcomeopenres.21534.r67328© 2023 Navarro-Escalante L. This is an open access peer review report distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

○
Is the rationale for creating the dataset(s) clearly described?YesAre the protocols appropriate and is the work technically sound?YesAre sufficient details of methods and materials provided to allow replication by others?PartlyAre the datasets clearly presented in a useable and accessible format?YesCompeting Interests: No competing interests were disclosed.

Table 2 . Chromosomal pseudomolecules in the genome assembly of Andrena hattorfiana, iyAndHatt1. INSDC accession Chromosome Size (Mb) GC%
materials that have contributed to this genome note have been supplied by a Darwin Tree of Life Partner.The submission of materials by a Darwin Tree of Life Partner is subject to the Darwin Tree of Life Project Sampling Code of Practice.
Collaboration Agreement or Material Transfer Agreement entered into by the Darwin Tree of Life Partner, Genome Research Limited (operating as the Wellcome Sanger Institute), and in some circumstances other Darwin Tree of Life collaborators.

Yes Are the protocols appropriate and is the work technically sound? Yes Are sufficient details of methods and materials provided to allow replication by others? Yes Are the datasets clearly presented in a useable and accessible format? Yes Is the rationale for creating the dataset(s) clearly described? Yes Are the protocols appropriate and is the work technically sound? Yes Are sufficient details of methods and materials provided to allow replication by others? Yes Are the datasets clearly presented in a useable and accessible format? Yes Competing Interests:
No competing interests were disclosed.

have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.
Reviewer Report 22 August 2023 https://doi.org/10.21956/wellcomeopenres.21534.r65569© 2023 Orr M. This is an open access peer review report distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.