De Novo Genome Assemblies for Three North American Bumble Bee Species: Bombus bifarius, Bombus vancouverensis, and Bombus vosnesenskii

Bumble bees are ecologically and economically important insect pollinators. Three abundant and widespread species in western North America, Bombus bifarius, Bombus vancouverensis, and Bombus vosnesenskii, have been the focus of substantial research relating to diverse aspects of bumble bee ecology and evolutionary biology. We present de novo genome assemblies for each of the three species using hybrid assembly of Illumina and Oxford Nanopore Technologies sequences. All three assemblies are of high quality with large N50s (> 2.2 Mb), BUSCO scores indicating > 98% complete genes, and annotations producing 13,325 – 13,687 genes, comparing favorably with other bee genomes. Analysis of synteny against the most complete bumble bee genome, Bombus terrestris, reveals a high degree of collinearity. These genomes should provide a valuable resource for addressing questions relating to functional genomics and evolutionary biology in these species.


Bombus bifarius
Bombus vancouverensis Bombus vosnesenskii Illumina Oxford Nanopore Technologies hybrid assembly MaSuRCA Bumble bees (Hymenoptera: Apidae: Bombus) are a widespread and iconic pollinator genus of approximately 250 species globally (Cameron et al. 2007;Cameron and Sadd 2020). Bumble bees provide economic and ecological benefits through their pollination services (Greenleaf and Kremen 2006;Velthuis and van Doorn 2006) and have recently generated significant public interest because, like many pollinators, numerous species have undergone rapid declines across the globe Cameron and Sadd 2020). Bumble bees also have a long history of study with respect to adaptations in traits like thermoregulation, flight biomechanics, host-parasite interactions, and evolution of sociality, and the advent of genomic tools has greatly accelerated our understanding of their biology (Woodard et al. 2015;Lozier and Zayed 2016). However, there remain a limited number of genetic resources to study such adaptive traits and help conservation efforts, with only two annotated reference genomes available on the National Center for Biotechnology Information (NCBI) for B. terrestris and B. impatiens (Sadd et al. 2015) and one additional genome published for B. terricola (Kent et al. 2018). Species-specific reference genomes can be valuable for addressing population genetic questions about novel targets of selection, comparing structural genetic variation within and between species, and more accurate functional genomics studies including transcriptomics (RNAseq) or epigenetics (e.g., bisulfite sequencing) (Fuentes-Pardo and Ruzzante 2017). Additional genomic resources could also prove to be useful from a conservation standpoint, including better understanding of species and population-specific genetic variation (Allendorf 2017).
We present three new assemblies for species within the Bombus subgenus Pyrobombus (Cameron et al. 2007;Williams et al. 2008 (Ghisbain et al. 2020) (Figure 1). These species are among the most common bumble bees in western North America (Koch et al. 2012) and have been the focus of recent research that includes studies investigating gene flow, foraging range, and genetic diversity in natural and agricultural systems Rao and Strange 2012;Jha and Kremen 2013;Jha 2015;Geib et al. 2015;Jackson et al. 2018;Mola et al. 2020), climate-driven local adaptation across spatial-environmental gradients (Jackson et al. 2020), and color pattern variation and speciation Ghisbain et al. 2020).
We employ a combination of long-read (Oxford Nanopore Technologies, ONT, Oxford, UK) and short-read (Illumina, San Diego, CA) sequencing technologies to produce high-quality hybrid assemblies for each of the three species. The assemblies produced here perform well compared to other published Bombus genomes and will provide a valuable resource as reference genomes for research into comparative and evolutionary genetics of Bombus.

DNA Extraction and Sequencing
Individuals used for genome sequencing include a field-collected female B. vancouverensis nearcticus worker (JDL1245, diploid) from Sequoia National Park, CA, a field-collected male B. bifarius (JDL3187, haploid) from Arapaho National Forest, CO, and two males from a laboratory B. vosnesenskii colony (JDL3184-5, both haploid) reared at the USDA-ARS Pollinating Insects Research Unit in Logan, UT (queen from Emigrant Lake, OR) ( Table 1). Whole genomic DNA was extracted from the thorax, and in some cases include abdominal tissue if additional DNA was required for Illumina sequencing, using the Qiagen (Valencia, CA) MagAttract High Molecular Weight kit or DNeasy Blood and Tissue kit (for Illumina sequencing only). Extractions were cleaned using Ampure XP beads (Beckman-Coulter, Pasadena, CA) at a 0.4x concentration to remove small DNA fragments and improve purity.
Long read sequences were obtained using the SQK-LSK109 reaction kit on an ONT Gridion instrument. Libraries were prepared following the manufacturer's protocol. One R9.4.1 flow cell each was used for each specimen. Basecalling was performed with Guppy (v3.30 for B. vancouverensis, v.3.0.3 for other taxa) using default system settings. ONT reads for each species were concatenated into single fastq files and cleaned of adapters and chimeric reads using Porechop (https://github.com/rrwick/Porechop), including the-discard_middle flag. Short read sequence data (all 150 bp paired end reads) were generated by Illumina sequencing. Whole genome library preparation and sequencing for the B. vosnesenskii male (only JDL3184 was used for Illumina) was performed by HudsonAlpha Institute for Biotechnology (Huntsville, AL) on an Illumina Hiseq X, and data for B. bifarius and B.vancouverensis were generated by Psomagen, Inc (Rockville, MD) using the Illumina NovaSeq6000 S4 platform.

Assembly
Preliminary assemblies indicated that reads included common bee symbionts and other bacteria, so we attempted to eliminate many such contaminants by preprocessing long and short read data sets using the bbtools program bbduk (Bushnell 2020). Decontamination used k-mer filtering (k = 29, although preliminary runs suggested altering this parameter made little difference in filtering results) against a reference dataset. This included a default set of contaminant reference fasta files provided with bbtools and supplemented with common bumble bee symbionts and other bacterial genomes detected as possible contaminants in preliminary assemblies, all downloaded from NCBI GenBank (Table S1). We filtered for human contaminants by mapping to a masked hg19 version of the human genome as a reference with the bbtools program bbmap, following recommended protocols in the manual (human contamination was minimal with ,0.0001% of reads removed in all data sets).
The cleaned long and short reads were assembled using the MaSuRCA hybrid assembler (Zimin et al. 2013) with default parameters except for the Jellyfish estimate, which was set at 2.5 · 10 10 (estimated genome size of 250 Mb, based on the closely related Bombus impatiens genome length (Hines 2008, Sadd et al. 2015, multiplied by an estimated coverage of 100x). MaSuRCA v3.3.1 was used for Bombus vosnesenskii and version 3.3.5 was used for other taxa.
We evaluated initial assemblies for contaminants using BlobTools v1.0 (Laetsch et al. 2017) to identify possible contaminants in the draft assembly based on scaffold taxonomic assignment (family level), sequencing coverage, and GC content. To prepare data for BlobTools, the Illumina and ONT reads were aligned to the draft genomes with BWA v0.7.15-r1140 (Li and Durbin 2009) and minimap2 v2.10 (Li 2018), respectively, and resulting files were sorted in SAMtools v1.10 . A reference database for taxonomic assignment of scaffolds was created with blastn v2.9.0 using the NCBI nt database (downloaded on Feb. 18, 2019), with settings following the BlobTools manual (https://blobtools.readme.io/docs/taxonomy-file). Reads not mapping to scaffolds with a BLAST assignment to Hymenoptera were filtered out of the datasets. Remaining reads were used to run a second round of assembly with MaSuRCA, and then re-checked with Blob-Tools. As a final step, we filtered likely mitochondrial contaminants or other artifacts of assembly by excluding residual scaffolds , 10kb, matching families not within the order Hymenoptera, or with GC , 25%.
The genome assemblies were polished using PILON v1.23 (Walker et al. 2014), which uses the accurate Illumina data to correct for small base pair errors and small indels. Illumina reads were mapped to second-round assemblies with BWA and sorted with SAMtools. PILON was run with default parameters and repeated for a total of four sequential rounds of polishing.

Quality assessment and species verification
Basic assembly statistics (genome length, number of scaffolds, GC content, and N50) were generated using QUAST v5.0 (Gurevich et al. 2013). We expected assemblies to be 250 Mb in size with a 38% GC content based on other Bombus genomes (Sadd et al. 2015;Kent et al. 2018). We examine genome completeness using BUSCO v4 (Simão et al. 2015) to search for orthologous genes using the OrthoDB v.10 (Kriventseva et al. 2019) Hymenopteran dataset (n = 5,991 genes) and using the bombus_impateins1 AUGUSTUS (Stanke and Morgenstern 2005) species parameter to optimize gene prediction. BUSCO genes detected in the assemblies were classified as single copy, duplicated, or fragmented. In the case of single or duplicated genes, this indicates that the gene is present and within 95% of its expected length, whereas fragmented genes fall outside of this limit (Simão et al. 2015). We perform basic comparisons of assembly statistics against two existing Bombus genomes (Sadd et al. 2015): Bombus impatiens assembly version 2.2 (BIMP_2.2 Assembly Accession GCF_000188095.3) and Bombus terrestris assembly version 1.0 (Bter_1.0 Assembly Accession GCF_000214255.1).
Finally, using the completed assemblies, we tested that genomes for B. bifarius and B. vancouverensis, two sister species in a morphologically cryptic complex (Ghisbain et al. 2020), were representative of diversity for their respective lineages. To confirm that our new genomes reflect these newly delimited species, we examined two nuclear genes (serrate RNA effector and sodium/ potassium-exchanging ATPase subunit alpha) that were previously determined to produce diagnostic haplotypes for B. bifarius and B. vancouverensis and were employed for species delimitation in samples from throughout the B. bifarius -B. vancouverensis range (Ghisbain et al. 2020). We used blast to identify the relevant regions from the assemblies before aligning with GenBank sequences for the two species (NCBI PopSet 1803131478 for the RNA effector; PopSet 1803131398 for the ATPase) and generated a neighbor-joining distance tree (Jukes Cantor model) in Geneious Prime 2020.1.1 (Biomatters, Aukland NZ).

Annotation
Genomes were submitted to NCBI RefSeq (Rajput et al. 2019) for annotation using the Eukaryotic Genome Annotation pipeline v8.4. This method has been used in other Bombus genomes (www.ncbi.nlm.nih.gov/genome/annotation_euk/all/), and thus provides standardization in the annotation methodology among members of the genus. For the annotations, in addition to transcript resources already on GenBank, we provided available RNAseq data generated from several individuals for each species from other prior and ongoing studies (see Data Availability below).

Synteny analysis
We assessed synteny with the D-GENIES web-based software (Cabanettes and Klopp 2018), using minimap2 for alignment. We visualized synteny between our novel genomes and the previously assembled Bombus terrestris v1.0 genome. We use B. terrestris, which is more divergent from focal taxa than B. impatiens (18-20 million years ago; Hines 2008) but is a near-complete assembly with 18 linkage groups corresponding to the 18 bumble bee chromosomes (Owen et al. 1995;Stolle et al. 2011;Sadd et al. 2015). We restricted analyses to larger scaffolds (.100kb) for clearer visualization. To illustrate an example of one utility for these new assemblies, we performed a more detailed analysis of synteny for one scaffold which contains a high density of SNPs previously associated with signatures of selection relating to color pattern and environmental adaptation (NT_176739.1 in B. impatiens), especially in the region of the Xanthine dehydrogenase/oxidase-like gene in B. bifarius and B. vancouverensis Ghisbain et al. 2020;Jackson et al. 2020). We had previously hypothesized that the outlier behavior across this region could involve a large-scale structural mutation . We examined synteny across homologous scaffolds for our new genomes (identified with blastn) and B. impatiens NT_176739.1 using the MAUVE (Darling et al. 2004) plug-in for Geneious to align scaffolds and compute co-linear blocks.
n■ Table 2 Sequencing statistics. Number of reads (read pairs for Illumina) at each stage in data filtering, including raw data, bacteriafiltered first-assembly data, and filtered data passed to the final second-round assembly. Estimated coverage is based on the number of sequence bases provided to the final assembly and an assumed genome size similar to B. impatiens (245.9 Mb)

Species
Sequencing

Assembly quality
After the preliminary assembly round, more than 95% of the scaffolds had BLAST hits to Apidae and other Hymenoptera (Figure 2). Some scaffolds represented bacterial or other genomes not removed during the pre-assembly decontamination and many small scaffolds did not match any taxonomic group. Such scaffolds represented less than 5% of the total genome assemblies, however (, 0.10 Gb for both Illumina and ONT reads; Figure 2, Table 2). Following filtering with Blob-Tools, the second-round assemblies had many fewer contaminants, with ,0.01% of scaffolds having non-Hymenopteran BLAST hits ( Figure 2). The final assembly lengths (Table 3)    contents were similar to other Bombus at 38% (Sadd et al. 2015) ( Figure 2, Table 3). The assemblies are all highly complete, with N50's of 2.2 -3.06 and BUSCO scores of .98% for the 5,991 genes in the OrthoDB v.10 Hymenoptera lineage dataset. Most of these (.97.7%) were single copy, with ,1% of the reference gene set duplicated or fragmented, and 1-1.3% missing (Figure 2, Table 3). Although haploid males allow sequencing of phased haplotypes, there was not a clear influence of ploidy of starting material (haploid male for B. bifarius, diploid worker for B. vancouverensis, two haploid males for B. vosnesenskii) on the final assembly quality in terms of N50 or BUSCO scores (Table 3). Finally, we also confirm that sequences from B. bifarius and B. vancouverensis assemblies for two genes previously employed as diagnostic evidence for species delimitation (Ghisbain, et al. 2020) were representative of the diversity found in each species ( Figure 3A, B).

Annotation
Gene predictions from the NCBI Eukaryotic Annotation Pipeline resulted in 13,325 -13,687 genes for the new genomes, with statistics for numbers of genes, transcripts, and other features largely consistent across assemblies and with previous Bombus assemblies (Table 4).

Genome comparisons
We analyzed the synteny of genomes with respect to the B. terrestris genome assembly. The new genomes were highly collinear with the 18 B. terrestris linkage groups (Figure 4), although several rearrangements were apparent in each species. Such patterns may reflect some true rearrangements between the focal species assemblies and B. terrestris genome that have occurred over 18-20 million years of divergence but may also be the result of a small number of assembly artifacts. Our focused MAUVE analysis of the region containing a high density of putatively adaptive loci from prior studies (including a possible color-associated gene Xanthine dehydrogenase/oxidase-like; Pimsler et al. 2017;Ghisbain et al. 2020;Jackson et al. 2020) revealed a single collinear block ( Figure 5A, B). This suggests no major rearrangements in the new genomes with respect to the B. impatiens genome previously used as a reference genome for population genomics in these species. Thus, major structural mutations may not explain unusual patterns of variation in the region, although more detailed intraspecific analyses will be necessary to fully rule out a role for segregating rearrangements in local adaptation.
In conclusion, the assemblies for each of the three focal species were highly complete, demonstrating how Illumina and long read ONT sequences can be used to quickly and inexpensively assemble high quality de novo genomes. The newly assembled genomes are intact, with 90% of each assembly contained in scaffolds $ 50kb and N50's . 2.2 Mb, and compare well to the other published bumble bee genomes in terms of overall size, GC content, BUSCO scores, and numbers of predicted genes and other features. Further, the results support previous observations that bumble bee genome structure tends to be conserved over relatively deep timescales, with large-scale n■ Table 4 Annotation statistics from NCBI Eukaryotic genome annotation pipeline for the three focal species genomes in comparison to other Bombus genomes. In the case of the three focal genomes, all are on annotation release 100 whereas B. impatiens and B. terrestris are on 103 and 102, respectively. Details on data used for annotation and comparative statistics are available at the NCBI links given in footnotes a-c  synteny over 18-20 million years of separation (Sadd et al. 2015). We expect these new references will quickly become useful for bumble bee biologists. For example, the species included here have all recently been studied with genomic data to detect the influence of climate and landscape composition on dispersal and local adaptation (Jackson et al. 2018;Jackson et al. 2020;Mola et al. 2020), however, such studies have largely been limited to genome-reduction methods (e.g., RAD-seq) and have required cross-species mapping to available reference genomes. These new assemblies will open the door for more extensive whole genome resequencing to uncover unique genomic variation that may be shaped by environmental conditions for each species (Fuentes-Pardo and Ruzzante 2017). These assemblies will also provide additional data points for multi-genome comparative studies in Bombus and other bees (e.g., Kapheim et al. 2015;Lin et al. 2019). These new resources should thus prove valuable for researchers looking to answer questions relating to diverse aspects of bumble bee evolutionary biology.