Short reads from honey bee (Apis sp.) sequencing projects reflect microbial associate diversity

High throughput (or ‘next generation’) sequencing has transformed most areas of biological research and is now a standard method that underpins empirical study of organismal biology, and (through comparison of genomes), reveals patterns of evolution. For projects focused on animals, these sequencing methods do not discriminate between the primary target of sequencing (the animal genome) and ‘contaminating’ material, such as associated microbes. A common first step is to filter out these contaminants to allow better assembly of the animal genome or transcriptome. Here, we aimed to assess if these ‘contaminations’ provide information with regard to biologically important microorganisms associated with the individual. To achieve this, we examined whether the short read data from Apis retrieved elements of its well established microbiome. To this end, we screened almost 1,000 short read libraries of honey bee (Apis sp.) DNA sequencing project for the presence of microbial sequences, and find sequences from known honey bee microbial associates in at least 11% of them. Further to this, we screened ∼500 Apis RNA sequencing libraries for evidence of viral infections, which were found to be present in about half of them. We then used the data to reconstruct draft genomes of three Apis associated bacteria, as well as several viral strains de novo. We conclude that ‘contamination’ in short read sequencing libraries can provide useful genomic information on microbial taxa known to be associated with the target organisms, and may even lead to the discovery of novel associations. Finally, we demonstrate that RNAseq samples from experiments commonly carry uneven viral loads across libraries. We note variation in viral presence and load may be a confounding feature of differential gene expression analyses, and as such it should be incorporated as a random factor in analyses.

244 evidence of Apis association (Fig. 2). Of the former, we detected most common gut symbionts 245 known to be associated with honey bees (Kwong & Moran 2016), i.e., Bartonella, 246 Bifidobacterium, Frischella, Gilliamella, Lactobacillus, and Snodgrassella, but not 247 Parasaccharibacter. Of the pathogens screened for, we detected Crithidia/Lotmaria (which are 248 inconsistently annotated in NCBI's nucleotide database and therefore not differentiated here; 249 Schwarz et al. 2015), Nosema, and Spiroplasma (Fig. 2). We did not find evidence for 250 chalkbrood, American foulbrood, or European foulbrood in the investigate DNA libraries 251 (Ascosphaera, Paenibacillus, and Melissococcus, respectively). Sequences from organisms not 252 naturally associated with honey bees included those from fungi (Ascomycota) and plants, that 253 were likely not part of the native microbiome of the sequenced samples. These contaminations 254 were crossed-checked via manual online BLAST searches and were confirmed to represent 'true' 255 hits with high and continuous identities with the respective database sequences.

256
Because the majority of hits in the DNA sequencing libraries were Lactobacilli, we repeated 257 the screening, this time using only Lactobacillus 16S sequences as baits. We found 121 258 Lactobacillus sequences in 40 of the 993 investigated libraries, corresponding to 25 OTUs 259 (estimated with mothur using a 5% cutoff). In our phylogenetic analysis based on 16S rRNA 260 sequences, most of the detected strains clustered within Lactobacillus groups known to be 261 associated with honey bees (Fig. 3a). Of the recovered sequences not clustering within these 262 lineages, three were found to group with other Apis-associated Lactobacilli as sister group to the 263 Lactobacillus coryniformis group (Fig. 3a). Online BLAST searches revealed Fructobacillus 264 species as closest matches based on 16S rRNA sequence.

265
Next, we aimed at recovering draft genome sequences of bee-associated Lactobacilli. We 266 chose an Apis mellifera intermissa sequencing library from which 16S sequences of both L. Manuscript to be reviewed 267 kunkeei and Fructobacillus isolates were detected in our screen. The contigs of a meta-assembly 268 were taxonomically annotated, and reads matching to the respective target taxa were then 269 assembled and annotated separately. For each assembly, we performed a phylogenetic analysis 270 based on all single copy orthologs shared with related genomes (Fig. 3b, c), thus confirming the 271 identity of the strains as L. kunkeei (Fig. 3b) and Fructobacillus (Fig. 3c). Both genomes were 272 highly covered and mostly complete based on the presence of conserved markers (Fig. 3d).
273 Finally, we recovered the genome of a Spiroplasma melliferum strain from another Apis 274 sequencing library (Fig. 4). In the meta-assembly, Spiroplasma and Apis contigs could be clearly 275 separated by coverage and taxonomic annotations (Fig. 4b (Fig. 5a). In many cases, the viral reads made up only a 285 negligible proportion of a sequencing library (Fig. 5b). However, in 75 RNA libraries, more than 286 5%, and up to 54% of the reads were of viral origin, indicating substantial viral loads in the 287 honey bees from which the samples were originally prepared (Fig. 5b) Manuscript to be reviewed 290 and B types of DWV and considerable genetic variation is evident especially in the DWV A 291 samples (Fig. 6). Furthermore, viruses extracted from libraries of the same project were often 292 phylogenetically closely related (Fig. 6), which is not surprising given that samples in one 293 project were commonly from a single geographic location (Table S2).
294 Viral infections usually result in a host response that should be measurable by means of 295 differentially expressed genes (e.g., immunity-related genes). We tested this prediction using 346 or from data that will become available as by-product of future honey bee sequencing projects.

347
In addition to the targeted microbes, a number of taxa that are likely not part of the natural 348 Apis microbiome were detected. For example, we detected Aspergillus in several sequencing 349 libraries that originated from museum material, which likely represents post mortem saprophytic 350 growth. We also retrieved hits to plant sequences which might originate from co-amplified and 351 sequenced pollen DNA (Fig. 2). This 'false discovery' illustrates an important caveat in our 352 approach: the differentiation between host-associated microbes and microbes from other sources 353 may not always be possible, and will be particularly difficult for museum specimens. Though not 354 problematic in the examples we present, the situation is likely more complicated for study of host 355 species with a less well-investigated microbiome, or for symbionts that are very similar to 356 environmental taxa. In these cases, the approach will establish candidates that will then require 357 direct validation. Manuscript to be reviewed 380 honey bees. We also did not find evidence for these taxa in Apis with our screening approach and 381 thus provide further support for their absence based on a large sample size.   1 Table 2: Evidence for differential gene expression of honey bees in response to viral infections 2 across four RNAseq projects (see Table 1). 20 candidate loci were investigated and differential 3 expression was determined with edgeR (see text for details). Manuscript to be reviewed  Manuscript to be reviewed  Table S5 for a complete list.   Table S3). Blue taxon label corresponds to the L. kunkeei strain recovered from 'contaminants' in library SRR1046114. Bootstrap values are given on nodes. See Table S3 for sources of genomes. c) Maximum likelihood tree of Fructobacillus (F.) and Leuconostoc (L.) species based on 435 concatenated single copy orthologs (145,069 amino acid positions).
Tree is rooted with Lactobacillus delbruecki. Numbers on nodes correspond to bootstrap values. Again, blue taxon label denotes the Fructobacillus genome recovered from the 'contaminated' library SRR1046114. Note that the phylogenetic distance between Fructobacillus fructosus and the novel genome is similar to other between-species distances in this tree. See Table S3 for Table S3. Manuscript to be reviewed