A Chromosome-Level Genome Assembly and Evolution Analysis of Andrena camellia (Hymenoptera: Andrenidae)

Abstract Andrena camellia, an effective pollinator of the economically significant crop Camellia oleifera, can withstand the toxic pollen of C. oleifera, making An. camellia crucial for resource conservation and cultivation of C. oleifera. In this study, the whole genome of An. camellia was sequenced on the Oxford Nanopore platform. The assembled genome size was 340.73 Mb including 50 scaffolds (N50 = 47.435 Mb) and 131 contigs (N50 = 17.2 Mb). A total of 11,258 protein-coding genes were annotated; in addition, 1,104 noncoding RNAs were identified. Further analysis shows that some chromosomes of An. camellia have a high level of synteny with those of Apis mellifera, Osmia bicornis, and Andrena minutula. Thus, our reported genome of An. camellia serves as a valuable resource for studying species evolution, behavioral biology, and adaption to toxic pollen of C. oleifera.


Introduction
Currently, in the wasp and bee superfamily Apoidea (Hymenoptera), over 20,000 species of bee have been identified and are widely distributed around the world (Danforth et al. 2013;Ascher and Pickering 2020). The family Andrenidae is a well-established clade of the bee family, containing more than 3,000 species (Ascher and Pickering 2020). Approximately 1,550 species have been described in the Andrena genus (Ascher and Pickering 2020). Although the taxonomy and phylogeny of this genus have been well studied, molecular phylogeny and evolution have rarely been addressed. However, Andrena minutula has been sequenced and assembled in 358 sequence scaffolds with a total length of 380 Mb and a scaffold N50 of 50.2 Mb, and 92.19% assembly sequence was assigned to seven chromosomes (Falk and Blomfield-Smith 2022).

GBE
Andrena camellia is a typical solitary bee, whose foraging behavior can improve the fruit setting rate of Camellia oleifera. Therefore, it was considered as an effective pollinator (Huang et al. 2021;Li et al. 2021). However, studies have shown that the pollen of C. oleifera is toxic to many bee species, which prevents bee colonies from feeding on it (Su et al. 2011). Therefore, we are interested in the possible reasons why C. oleifera pollen is not poisonous to An. camellia. To investigate these issues and generate a needed community resource, we sequenced the entire genome of An. camellia and annotated noncoding RNAs, genomic elements, protein-coding genes (PCGs), and repeat sequences. In addition, an evolution analysis of gene families of several important bee lineages was carried out. In conclusion, this study not only provides important data for an in-depth understanding of the evolutionary history and behavioral biology of An. camellia, but also helps to elucidate its resistance to toxic pollens.

Results and Discussion
Genome Sequencing and Assembly A total of 38.32 Gb second-generation-sequencing clean reads (N50 = 21.01 kb) and 33.28 Gb nanopore clean data (N50 = 6.15 kb) were obtained. In addition, 40.99 Gb data were obtained on the Illumina NovaSeq 6000 platform, including 10.02 Gb transcriptome reads and 30.97 Gb Hi-C reads. The predicted genome size is 332.84 Mb, with low heterozygosity (supplementary table  S1 . 1a).
The genome integrity was assessed using the insect BUSCO reference database (genes = 1,367). The integrity of the initial assembly increased through to the final version, with a nearly complete single-copy set of BUSCO genes, and low duplication rate (table 1). The mapping rates of the second-generation whole-genome data, RNA-seq and the third-generation nanopore ONT raw sequences were 98.59%, 95.26%, and 98.63%, respectively. These results indicate that our genome assembly had high gene content with minimal duplicate content.
Compared with the genomes available in the family Andrenidae, our genome size is smaller than Andrena hattorfiana (428.5 Mb), Andrena fulva (461.7 Mb), and Andrena bucephala (379.8 Mb); larger than Andrena haemorrhoa (330.7 Mb) and Andrena dorsata (277.3 Mb); slightly less than An. minutula (380.4 Mb), though the contig N50 is much longer than most of these previous assemblies (

Genome Annotation
RepeatMasker analysis showed that there are 508,701 repeats in the whole genome, accounting for 44.65% of the content. Further analysis revealed that the highest percentage of repeats was DNA elements (16.03%), followed by unknown (16.75%), long-terminal repeats (LTRs; 6.79%), long interspersed elements (LINEs; 1.08%), and simple repeats (0.73%; supplementary table S2, Supplementary Material online).
Through the MAKER pipeline, 11,258 PCGs were annotated with an average length of 7,029 bp. The average number of exons per gene is 6.9 and the average length is 360.6 bp, whereas the number of introns per gene is 5.9 and the length is 839.2 bp. In addition, there were 6.7 CDS per gene, with the mean length 259.7 bp. BUSCO analysis identified 1,141 single-copy, 5 fragmented, 14 missing, and 207 duplicated BUSCOs (table 1). The genome contains 10,866 genes encoding 9,342 proteins (82.98%). A total of 8,602 and 4,049 PCGs were   Comparative genomic assessment showed a significant correspondence between the chromosomes of An. camellia, Apis mellifera, and Osmia bicornis. For example, chromosome 7 of An. camellia had extensive synteny with chromosomes 14 and 15 of Ap. mellifera, chromosomes 9 and 10 of O. bicornis, chromosome 7 ends and chromosome 8 of Ap. mellifera have homologous segments on chromosome 3 of An. camellia, the terminal differentiation of An. camellia chromosome 6 into Ap. mellifera chromosome 8 and O. bicornis chromosome 10. In comparison with An. minutula, there was a clear one-to-one block between our assembled genome and the An. minutula genome, that is, they was high collinearity and the syntenic regions of An. camellia chromosome 4 were located on chromosome 1, 3, and 5 of An. minutula, which provides evidence for the phylogeny of Apoidea at the chromosomal level ( fig. 1c and d).

DNA and RNA Sequencing
Andrena camellia was collected at the C. oleifera base in Shiti Town, Xiushan, Chongqing (28.43N, 109.13E), and the samples were placed in liquid nitrogen and stored at −80 °C before being sent to Beijing Berry Genomics for sequencing. Species identification was performed using universal primers (LCO1490 and HCO2198) for amplification and blast on NCBI (Folmer et al. 1994). A 1D DNA ligation sequencing kit (SQK-LSK109) was used to extract genome DNA for single-molecule sequencing using Oxford nanopore triple sequencing. For nanopore sequencing, an insert with a length of 40 kb was created, whereas in secondgeneration sequencing based on CTAB method, a 350-bp insert size was created for sequencing at the Beijing Genomics Institute. RNA was extracted and libraries were constructed using TRIzol Reagent and TruSeq RNA v2 kits, respectively; Hi-C library was also constructed for Illumina NovaSeq 6000 sequencing (2 × 150 bp), one specimen each for Nanopore, second-generation, transcriptome, and Hi-C sequencing.
Quality control and trimming of raw data were performed by BBTools v38.82 (Bushnell 2014). Duplicate sequences were moved by clumpify.sh-, whereas bbduk.sh was used for specific quality control, removal of sites with a base quality score below 20 (>Q20), filter sequences below 15 bp in length, trim-off poly-A/G/C tails longer than 10 bp, then compare orthobases using overlapping reads.
We also performed K-mer analysis using GenomeScope v2.0 (Ranallo-Benavidez et al. 2020) to predict genome size, the maximum k-mer coverage cutoff is set to 10,000, the k-mer frequencies are evaluated using khist.sh (BBTools) with the length set to 21 -mer., and the GenomeScope parameter is set to: -k 21 -p 2 -m 10000.
Genome integrity assessment was performed using BUSCO v5.2.2 (Manni et al. 2021) based on the insec-ta_odb10 (n = 1,367) reference database. In addition, we mapped DNA-and RNA-sequencing reads to the assembled genome using Minimap2. SAMtools was used to count the mapping rate to check the raw data utilization and genome assembly integrity.

Genome Annotation
The repeat-sequence database of An. camellia was created using the LTR search program (-LTRStruct) in RepeatModeler v2.0.2a based on ab initio prediction and repeat-sequencespecific structure, which was then merged with the RepBase-20181026 (Bao et al. 2015) and Dfam 3.3 (Hubley GBE et al. 2016) databases. Finally, RepeatMasker v4.1.2pl (Flynn et al. 2020) was used to search for repeat sequences.
To annotate gene function, we adopted two strategies: gene functions were first predicted by comparison with the already existing database UniProtKB, and using the sensitive mode (-very- To annotate rRNA, snRNA, and miRNA, we compared them with well-known noncoding RNA libraries (Rfam database) by utilizing Infernal v1.1.4 (Nawrocki and Eddy 2013). Additionally, tRNA prediction was performed by utilizing tRNAscan-SE v2.0.9 (Chan and Lowe 2019).
The genomes of An. camellia, Ap. Mellifera, and O. bicornis were assessed for chromosome-level gene content and shared gene order; we also compared An. camellia with An. minutula. Pairwise protein sequence analysis was performed using the Blastp in MMseq2 v12-113e3 with parameters of s 7.5 -alignment-mode 3 -num-iterations 4 -e 1e-5 -max-accept 5 (Steinegger and Soding 2017). In addition, this all.blast result file and the integrated comment file all.gff file were used as input files to obtain the covariance analysis results in MCScanX and visualized by TBtools v1.0692 (Chen et al. 2020).