A bioinformatics approach to identifying Wolbachia infections in arthropods

Wolbachia is the most widespread endosymbiont, infecting >20% of arthropod species, and capable of drastically manipulating the host’s reproductive mechanisms. Conventionally, diagnosis has relied on PCR amplification; however, PCR is not always a reliable diagnostic technique due to primer specificity, strain diversity, degree of infection and/or tissue sampled. Here, we look for evidence of Wolbachia infection across a wide array of arthropod species using a bioinformatic approach to detect the Wolbachia genes ftsZ, wsp, and the groE operon in next-generation sequencing samples available through the NCBI Sequence Read Archive. For samples showing signs of infection, we attempted to assemble entire Wolbachia genomes, and in order to better understand the relationships between hosts and symbionts, phylogenies were constructed using the assembled gene sequences. Out of the 34 species with positively identified infections, eight species of arthropod had not previously been recorded to harbor Wolbachia infection. All putative infections cluster with known representative strains belonging to supergroup A or B, which are known to only infect arthropods. This study presents an efficient bioinformatic approach for post-sequencing diagnosis and analysis of Wolbachia infection in arthropods.

Currently, all Wolbachia strains are classified as a single species, with further classification into at least sixteen supergroups, A-Q (Lo et al., 2007;Lindsey et al., 2016a). The four most well studied clades are supergroups A-D. Supergroups A and B are monophyletic and are the most common supergroups known to infect arthropods, while supergroups C and D infect filarial nematodes (Gerth et al., 2014). Supergroup G was discovered to be a recombinant between supergroups A and B; thus it is no longer considered a distinct lineage (Baldo & Werren, 2007). Supergroups E-Q infect a variety of hosts including nematodes, springtails, termites, fleas, aphids, and mites (Lo et al., 2002;Casiraghi et al., 2005;Ros et al., 2009;Haegeman et al., 2009;Augustinos et al., 2011;Bing et al., 2014;Glowska et al., 2015).
Wolbachia has a relatively small genome at about 0.9-1.5 Mbp. Historically, Wolbachia infection was diagnosed using 16S rRNA sequences; however, strains range in divergence from 0.2% to 2.6%, and when used independently, 16S provides limited information for inferring phylogenetic relationships (O'Neill et al., 1992). wsp, ftsZ and the groE operon are all protein-encoding genes used for the detection and phylogenetic analysis of Wolbachia (Van Borm et al., 2003). The ftsZ gene is involved in cell division and is highly conserved in unculturable bacteria species (Holden, Brookfield & Jones, 1993), but regions that are relatively higher in divergence make it a candidate for better phylogenetic resolution allowing the distinction between supergroups A and B to become apparent (Werren, 1997a). The wsp gene, which codes for the surface protein WSP in Wolbachia, shows an even higher variability and faster evolutionary rate than 16S or ftsZ and can be used in identifying groups and strains of Wolbachia (Zhou, Rousset & O'Neil, 1998;Braig et al., 1998), but also displays recombination, which can be misleading when used in phylogenetic analyses (Baldo & Werren, 2007). The groE-homologous operon has been noted as another candidate for resolving strain taxonomy (Masui, Sasaki & Ishikawa, 1997). Only a single copy of the operon exists in the genome and it includes the genes that encode the heat shock proteins GroES and GroEL, which are separated by a non-coding intergenic region that is thought to be faster evolving than either of the coding regions (Masui, Sasaki & Ishikawa, 1997).
With the use of antibiotics, Wolbachia infections in some species have been cured and the phenotypic changes that are induced by infection are consequently reversed (Stouthamer, Luck & Hamilton, 1990;Bourtzis et al., 1994;Giordano, Jackson & Robertson, 1997). More recently, Wolbachia has been proposed as a natural solution to controlling the spread of vector-borne diseases like malaria, yellow fever, and dengue (Hoffmann et al., 2011;Walker & Moreira, 2011;Baldini et al., 2014). Arthropods are present in nearly every habitat on Earth and they play important ecological roles in a variety of niches. With an estimated 2.4-10.2 million species of arthropods (Ødegaard, 2000) it is important to quantify the prevalence and distribution of Wolbachia infection.
Wolbachia infections are typically diagnosed via polymerase chain reaction (PCR), using Wolbachia-specific primers. However, PCR-based tests may produce false positives or false negatives, depending on the strain of Wolbachia and the presence of other related bacterial symbionts (Simões et al., 2011). A metagenomics-based approach can also be useful for characterizing microbiomes, including looking for Wolbachia and other symbionts (e.g. Dittmer & Bouchon, 2018), and can even provide whole-genome sequence information for the symbiont (e.g., Salzberg et al., 2005;Richardson et al., 2012;Saha et al., 2012;Campana, Robles García & Tuross, 2015;Derks et al., 2015;Wang & Chandler, 2016;Lindsey et al., 2016b;Gerth & Hurst, 2017). While performing a high-throughput sequencing-based screen for Wolbachia involving hundreds of different species would require a huge sampling effort and could be cost-prohibitive, screening existing sequence datasets generated for other projects offers a powerful opportunity to diagnose novel infections and better characterize variation in symbionts.
Here, using publicly accessible next-generation sequencing data available in the NCBI Sequence Read Archive (SRA), we looked for evidence of Wolbachia infection in a diverse assemblage of arthropod species. We present methods for bioinformatically identifying Wolbachia infections in genomic samples. We then used these sequence data to assemble a draft genome sequence for each Wolbachia isolate and reconstruct the phylogenetic relationships among the identified Wolbachia strains. Using this approach, we uncover novel Wolbachia infections, as well as find possible evidence for horizontal transfer of Wolbachia between hosts and parasites. These results illustrate how existing genetic databases can provide a wealth of information on symbiotic microbes as a byproduct of host sequencing.

Retrieving data
All samples tested are available through the NCBI Sequence Read Archive (SRA) (Table S1). To identify samples for testing, all accession numbers that matched the criteria of Arthropoda genomic DNA were sent to the NCBI Run Selector (as of January 2017). In the Run Selector samples were selected based on the criteria that they were run on an Illumina platform, have a genomic library source, a random library selection and the library layout is paired. Transcriptome samples were excluded because of the possibility that some RNA preparation methods may select against bacterial RNA (e.g., poly-A enrichment Westermann, Gorski & Vogel, 2012) thus increasing the likelihood of false negatives and because assembling Wolbachia genomes would be impossible with these data. Similarly, targeted sequencing (e.g., RAD-seq) samples were excluded due to the possibility that the sequences used for detecting Wolbachia infections might be excluded during the library preparation process. Only paired-end datasets were considered in order to facilitate whole-genome assembly for positive samples, but there were relatively few species (only 22) in the database with single-end datasets that otherwise met our criteria. Every species that had a sample that met our criteria was chosen for sampling. Some species were over-represented in the number of runs that are available in the SRA; depending on the number of samples available in the SRA, an appropriate amount to include in our dataset was determined on a case-by-case basis (Table S2). Fastq-dump v. 2.8.0 from the SRA Toolkit (NCBI SRA) was used to download, at most, 5 × 10 7 reads from each accession.

Diagnosing Wolbachia Infection
Magic-Blast v1.1.0 (NCBI) was used to compare the SRA reads to selected reference wsp, ftsZ, and groE operon sequences isolated from Wolbachia samples that are representative of supergroups A-D (Table 1). A custom R script identified SRA samples where there were matches at least 98 bp in length, ≥95% identity to one or more of the reference genes, and with three or more matching sequence reads. All samples that met these criteria were called Wolbachia positive samples.
To look for previous reports of Wolbachia infection in the species that tested positive, first Google Scholar was used.
[species] + Wolbachia was used for the search terms. If no published results were found, next we used NCBI GenBank with the same search parameters to look for deposited sequences that may be unpublished that would indicate that Wolbachia had been found in the host species previously.

Assembling the Wolbachia gene and genome sequences
From all the samples that tested positive (Table S3) if there were more than 3 samples from one species a maximum of 3 samples were chosen for downstream analysis (Table S4). Velvet v1.2.10 (Zerbino & Birney, 2008) was used to separately assemble the wsp, ftsZ, and groE sequences for each biological sample using the sequence reads that aligned to each gene in the previous step. It was run for kmer values of 21, 31, 41, and 51, using the automatic coverage cutoff flag. To select the optimal assembly of each gene for each sample, we performed BLASTn v2.28 (Altschul et al., 1990), which searched against a database made of each respective reference gene (Table 1). BEDTools v2.25.0 (Quinlan & Hall, 2010) and a custom script was used to parse the BLASTn results for the single longest contig matching each gene from each sample.
To assemble draft genomes for each Wolbachia isolate we identified, an iterative bait-and-assemble approach was used. Independent SRA experiments or runs from the same BioSample were first combined into a single dataset. For each sample, the mirabait tool from MIRA v4.0.2 (Chevreux, Wetter & Suhai, 1999) was then used to extract all reads from the full dataset that shared at least one kmer with at least one of seven reference Wolbachia genomes representing Wolbachia isolates from insects and nematodes (wPip, GCF_000073005.1; wMel, GCF_000008025.1; wNo, GCF_000376585.1; wRi, GCA_000022285.1; wVol, GCF_000530755.1; wCle, GCF_000829315.1; wTpre, GCF_001439985.1), using k = 31. These reads, and their corresponding paired-end partners, were assembled using SPAdes 3.11.1 (Bankevich et al., 2012). All resulting contigs were then aligned to the reference Wolbachia genomes using dc-megablast 2.7.0+ (Camacho et al., 2009), and any contig that matched any of the reference genomes with an e-value of  Table 2 Wolbachia sequences of known origin for phylogenetic analysis. Wolbachia genes used as controls and the species name from which they were isolated. The supergroup of the Wolbachia strain is listed and these genes served as a control during the creation of the phylogeny.  Pickup & Cook (2005) 10 −10 or better, alignment length of at least 100 bp, and percent identity of at least 70%, was retained. This process was then repeated for a total of five iterations, in each cycle using mirabait to identify reads sharing one or more kmers with the last set of assembled contigs, re-assembling these putatively Wolbachia-derived reads, and retaining any of the newly assembled contigs that show similarity to a Wolbachia reference genome in BLAST searches. The quality of each final assembly was evaluated using QUAST v4.4 (Gurevich et al., 2013) and BUSCO v3.0.2b (Simão et al., 2015) with the Bacteria odb9 reference gene set. Finally, we mapped all sequencing reads from each associated BioSample (not just those used for the assembly process) to the corresponding assembly using bwa mem v.0.7.17 (Li, 2013), and then used the sambamba depth command (Tarasov et al., 2015) to extract coverage information for each assembled contig over 400 bp in length, excluding 150 bp from the ends of the contigs (where coverage tends to drop off because reads extending beyond the contig may fail to map successfully).

Phylogenetic analysis
We first constructed phylogenies using the assembled ftsZ and groE sequences, as well as from a concatenated dataset of both genes; wsp was excluded from phylogenetic analyses because of its high frequency of recombination (Baldo & Werren, 2007). Wolbachia gene sequences representing ftsZ and the groE operon from other studies where the supergroup classification was determined were used as control samples; in this analysis, we included only reference sequences where both genes had been sequenced from the same biological sample (Table 2). MAFFT v7.310 (Katoh & Standley, 2013) was used to align the sequences for each respective gene. Samples that lacked sufficient length of matching base pairs (at least 800 bp in total across both genes) were discarded from downstream analysis. GBLOCKS v0.91b (Castresana, 2000) removed the poorly aligned portions of the sequences from each gene alignment using the default parameters. MEGA v7.0 (Kumar, Stecher & Tamura, 2016) was used to construct phylogenies using maximum likelihood. The model for which the phylogenies were constructed was chosen according to MEGA's suggestion for best fit based on the lowest Bayesian information criterion (BIC) ( Table S5). Node support was assessed by bootstrapping with 1,000 replicates. We also constructed phylogenies based on whole-genome data from a subset of the assemblies which appeared the most complete based on the BUSCO assessment. For these phylogenies, we used REALPHY (Bertels et al., 2014), to align genome sequences and identify loci for inclusion in the phylogenetic analysis, using as a reference the seven Wolbachia genomes used in the assembly process and merging the reference alignments with the default parameters. We then performed phylogenetic analysis by maximum likelihood in RAxML v8.2.11 (Stamatakis, 2014) using the TVM+I+G model as selected by ModelTest-NG v0.1.2 (the successor to jModelTest); (Darriba et al., 2012) using AIC. The RAxML analysis included 100 independent replicate searches for the best-scoring tree and 200 bootstrap replicates to assess node support.

Diagnosing Wolbachia infections from publicly available sequence data
A total of 2,545 individual 'runs' from the SRA, representing 288 species and subspecies were tested for Wolbachia (Table S1). Of those, 173 runs from 34 unique species tested positive for the selected reference Wolbachia genes (Table S3). That is, 11.8% of species tested positive for Wolbachia in at least one sample and only 6.8% of all SRA runs tested positive. All samples that tested positive were from samples that are in the class Insecta and representative of five orders: Coleoptera, Diptera, Hymenoptera, Hemiptera, and Lepidoptera. According to our literature search eight of these species have not previously been confirmed to have Wolbachia infections-Bembidion lapponicum, Ceratina calcarata, Delias oraia, Diachasma alloeum, Diploeciton nevermanni, Ecitophyla simulans, Gerris buenoi and Isocolus centaureae (Table 3).

Assembling Wolbachia genomes
In total, we assembled draft genomes for 51 Wolbachia isolates (Table 4), including at least one for each of the 34 unique host species. There were only two cases in which the assembly was substantially smaller than the expected genome size. In one of those (Biorhiza pallida 3), infection was confirmed in independent biological samples, and in the other (Mycopsylla proxima) the small assembly probably resulted from the small size of the input dataset. The rest of these assemblies appeared nearly complete, with total assembly sizes of at least 1 Mb and high numbers of BUSCO reference genes represented by a single gene in the assembly. All assemblies were missing at least 13 of the BUSCO reference genes.
We also sought to determine whether each sample that tested positive was likely to represent an actual Wolbachia infection, or the result of Wolbachia sequences horizontally transferred into the host genome. If the sequencing depth of the Wolbachia-like contigs in the assembly differs substantially from the sequencing depth of the host genome, then horizontal transfer can be ruled out. However, performing whole-genome assembly with every sequence dataset to estimate the sequencing depth of the host genome was computationally time-consuming, and our attempts to estimate sequencing depth more rapidly by counting k-mers in the raw data were unsuccessful in most cases because of low sequencing depth. Therefore, we obtained estimates of the genome size of host species from other sources, such as draft assemblies available at NCBI (Table 4), when available; although draft assemblies can differ substantially in size from actual genome sizes, for   Table 4 Wolbachia genome assemblies Information on Wolbachia draft genome assemblies. Expected host coverage is calculated as (total sequence data/host genome size). ''Evidence of multiple infections'' indicates whether or not the assembly contains signs pointing to multiple, distinct Wolbachia strains within the same biological host sample used for generating the sequence data (though some of these consisted of pooled individuals). BUSCO comp., BUSCO dup., BUSCO frag., and BUSCO missing refer to the number of BUSCO orthologs that were found to be complete and single copy, duplicated, fragmented, and missing from the Wolbachia assembly, out of 148 BUSCOs present in the Bacteria odb9 reference gene set. Grey rows at the bottom of the table were omitted from the whole-genome phylogenetic analysis because the assemblies appeared less complete (as indicated by missing BUSCO genes) or showed evidence of being chimeric or a mixture of two independent strains.       (Table 4). In a few cases, there was evidence of multiple infections in a single sample. This evidence included an unusual number of duplicated BUSCO reference genes in the assembly (e.g., Homalodisca vitripennis 1), the presence of multiple peaks in the coverage distribution histogram (e.g., Callosobruchus chinensis), assembly sizes much larger than previously sequenced Wolbachia genomes (e.g., Dactylopius coccus), or some combination of these (Table 4).

Wolbachia phylogeny
All phylogenetic trees based on individual or concatenated datasets using the ftsZ and groE sequences show two distinct branches representing supergroups A and B ( Fig. 1; Figs. S1-S2). The tree resulting from the concatenated dataset has the most robust bootstrap support for most clades. Positive control samples that were included in the phylogeny cluster with other control samples of the same known supergroup. Of the species where Wolbachia had been previously unidentified, according to all trees, the strains isolated from D. alloeum falls within supergroup A, while the B. lapponicum, I. centaureae, G. buenoi, D. nevermanni and D. oraia isolates all fall within supergroup B (Figs. 1 and 2, Figs. S1-S2).
The phylogeny generated from whole-genome sequencing data (Fig. 2) was similar in overall topology to the trees based on ftsZ and groE, with two clear clades representing supergroups A and B, but with higher bootstrap support for most branches.

Observed low infection rates
While Wolbachia is estimated to infect between 20-76% of arthropod species (Werren, Windsor & Guo, 1995;Jeyaprakash & Hoy, 2000), in this set of data only 11.8% of species tested positive. Given the source, this low rate of infection can be hypothesized to be the result of five possible scenarios: (1) Underrepresentation in the amount of data available per host species. For example, only 43 out of the 288 (14.9%) species and subspecies tested had ≥10 samples available in the SRA that met the criteria of this study (Table S2). When >100 individuals are tested for Wolbachia, results are skewed towards finding a positive sample (Hilgenboecker et al., 2008).
(2) Bias in the source of the samples. Sources vary between wild-caught individuals, lab stocks, and unreported sources. Since the phenotypic consequences of Wolbachia are well established, if uninfected individuals are needed for a study they may be selectively chosen (see Ðordević et al., 2017;Becking et al., 2017), or the researchers may even actively treat infections with antibiotics (Dobson & Rattanadechakul, 2001;Casiraghi et al., 2002;Koukou et al., 2006)   cases the sequencing data will consequently test negative for Wolbachia using the methods employed here.
(3) Tissue sampled. In some species infection has only been detectable in the gonads, indicating that infection density in somatic tissue may be variable or low (Dobson et al., 1999). For many samples in the SRA, specific tissue has not been indicated.
(4) Bioinformatic removal of bacterial contaminants. Even if Wolbachia is sequenced with the host's DNA, the researcher may have eliminated these reads bioinformatically before depositing the reads as relatively standard practice in sequence processing (Kunin et al., 2008;Schmieder & Edwards, 2011;Derks et al., 2015). (5) False negatives. It is possible that some infections may have been missed due to the limited set of available reference genes; more divergent strains might not have been detected in these analyses.

Strain supergroup affiliation
For 13 of the species that tested positive, previous information was available about supergroup affiliations of Wolbachia strains that have been found to infect them (Table 3).
Our results are mostly consistent with previously reported phylogenetic relationships. Previously, C. chinensis, D. coccus, and D. simulans have been found to be infected with A and/or B strains (Kondo, Shimada & Fukatsu, 1999;Kondo et al., 2002;Riegler et al., 2004;Ellegaard et al., 2013;Ramirez-Puebla et al., 2016). Here, evidence of both A and B supergroup strains was found in D. simulans (Fig. 1), though the whole-genome phylogeny was somewhat inconsistent here, suggesting possible recombination for some genes. Moreover, while the single-gene phylogenies suggested that the endosymbionts of these C. chinensis and D. coccus samples were members of supergroups A and B, respectively, the whole-genome assemblies for both of these endosymbionts contained strong evidence of dual infections, so we cannot rule out the presence of both A and B supergroup strains in these samples. Wolbachia infection has also been documented prior to this study in B. pallida, P. aegeria, and P. sp. PSW-54 but the supergroup relationships were not reported (Subandiyah et al., 2000;Rokas et al., 2001;Russell et al., 2012;Kautz, Rubin & Moreau, 2013;Baldini et al., 2014). Our concatenated results suggest that a supergroup A strain infects B. pallida and P. sp. PSW-54, while a B strain infects P. aegeria.
Two species showed a different supergroup strain than what has been previously reported-D. spinosa and A. gambiae. D. spinosa has previously been identified to harbor a supergroup A strain, but here we discovered an infection that clusters within supergroup B. It may be possible for D. spinosa to harbor both A and B strains since other species in the genus have been shown to have supergroup B infections (Plantard et al., 1999).
Particularly notable is our identification of a supergroup B strain in A. gambiae. Anopheline mosquitoes were once thought to lack infection by Wolbachia in nature (Kittayapong et al., 2000;Ricci et al., 2002;Rasgon & Scott, 2004), though they are capable of experimental infection in the lab (Hughes et al., 2011). However, there have been recent reports of natural infections in wild populations (Baldini et al., 2014;Gomes et al., 2017). In particular, a supergroup A strain was found to infect A. gambiae mosquitoes in Mali that reduces the transmission of the malaria parasite (Gomes et al., 2017). The strain identified here clearly belongs to supergroup B, and is related to strains infecting Hymenoptera and Lepidoptera, rather than fleas (Siphonaptera) like the previously identified supergroup A strain. In addition, other recent surveys have found evidence of diverse Wolbachia strains, including supergroup B strains, within the A. gambiae species complex (Ayala et al., 2018;Jeffries et al., 2018). Combined, these results suggest that the diversity of Wolbachia infections in Anopheles may be currently underappreciated. Importantly, we have good evidence that the sample here is an actual Wolbachia infection rather than an integrated piece of Wolbachia DNA in the host genome. First, we assembled a nearly complete Wolbachia genome from this dataset; more importantly, given that the A. gambiae genome is roughly 280 Mb (Holt, 2002), and this dataset contained roughly 9.1 Gb of raw sequence reads, we would expect to have roughly ∼32×coverage of the host genome, but the Wolbachia genome had only ∼9×coverage, suggesting that Wolbachia DNA was present at lower densities in this sample than the host DNA.
The assembled ftsZ and groE sequences from C. calcarata, E. simulans. M. fici, and M. proxima assemblies were too short to be included in our individual gene-based phylogenetic reconstruction; the infection density, and thus the sequencing coverage, for these species may have been too low to yield reliable assemblies for these genes. We were able to assign the E. simulans infection to supergroup B based on its draft genome sequence, but the supergroup relationships for the others are still unknown. Previously, Wolbachia sequence information has been isolated in M. fici, and M. proxima (Fromont et al., unpublished data; Table 3) but supergroup affiliation was not suggested. According to our literature search this is the first detection of Wolbachia in C. calcarata and E. simulans. The other six previously unidentified species infections were included in the phylogeny. D. alloeum clustered with known supergroup A infections while B. lapponicum, D. oraia, D. nevermanni, G. buenoi, I. centaureae isolate clustered within the supergroup B clade.
Finally, our phylogeny also offers some hints into possible mechanisms of horizontal transmission of Wolbachia infections. In particular, the strain identified here infecting Diachasma alloeum is closely related to the strain found in Rhagoletis pomonella (as well as D. melanogaster and D. yakuba). This is intriguing because D. alloeum is a parasitoid wasp that uses R. pomonella and R. mendax as its host (Maier, 1981;Stelinski, Pelz & Liburd, 2004), suggesting that this may represent a natural horizontal transfer of Wolbachia from one lineage to another; previous studies have found evidence of horizontal transmission between predators and prey or hosts and parasites (Heath et al., 1999;Le Clec'h et al., 2013). However, contamination by host material in parasitoid samples, or vice versa, could also explain this outcome, so this result should be interpreted cautiously until this path of transmission can be experimentally confirmed. Double (Perrot-Minnot, Guo & Werren, 1996;Narita, Nomura & Kageyama, 2007) and even triple Wolbachia infections (Rousset, Braig & O'Neill, 1999;Kondo et al., 2002) have been reported in arthropod populations and individuals, both naturally and through experimental injection. The initial screening methods presented here are not capable of identifying multiple infections because we only looked for a positive or negative test result and then used only the single longest contig for phylogenetic construction. In conventional PCR there is a tradeoff between specificity and sensitivity of primers; additionally no one primer is capable of identifying Wolbachia in all samples (Simões et al., 2011). PCR is useful in initial infection confirmation but sequencing is usually necessary to confirm group relationships. Techniques used to identify multiple infections currently include quantitative PCR with highly specific primers (Kondo et al., 2002;Narita, Nomura & Kageyama, 2007), cloning and sequencing (Jamnongluk et al., 2002), and Southern hybridization (Perrot-Minnot, Guo & Werren, 1996).

Multiple infections and integration of wolbachia into the host genome
We were able to identify evidence of possible multiple infections through genome assembly. In some cases, the assembly was approximately double the expected size, contained a large number of duplicated genes, or showed evidence of multiple peaks in a coverage histogram, all of which are signs of infection by multiple, independent strains. Again, these results should be interpreted cautiously pending experimental validation. For instance, some of the multiply infected samples consisted of pooled DNA from multiple individuals (e.g., Drosophila yakuba and Diabrotica vinifera), so the ''multiple'' infection might simply result from different individuals in the sample harboring different endosymbiont strains. Nevertheless, these results show that high-throughput sequencing can be a powerful way to detect multiple infections, especially when a priori sequence information for designing strain-specific primers is unavailable.
A related issue is that Wolbachia DNA is frequently integrated into host genomes (Vavre et al., 1999;Leclercq et al., 2016); in some cases, these insertions even consist of nearly whole Wolbachia genome sequences (Dunning Hotopp et al., 2007). This complicates our analyses because some of the identified ''infections'' could actually be Wolbachia DNA integrated into the host genome; in fact, horizontally transferred Wolbachia DNA has already been identified in four orders which are all represented by the positive results in this study, Coleoptera, Diptera, Hemiptera, and Hymenoptera (Hotopp, 2011). We were able to rule out horizontally transferred DNA in some, but not all, cases of positive samples, using sequencing depth information; if the sequencing depth of the assembled Wolbachia contigs differs from the sequencing depth of the host's nuclear DNA, that suggests a true, active infection. True infections could also be validated experimentally when necessary, for example, using fluorescence in situ hybridization (Hughes et al., 2011). Either way, horizontally transferred Wolbachia DNA would still indicate that a species at least had a history of infection at some point in the past.
This work shows that it is often possible to assemble draft genomes of endosymbionts from host DNA, similar to previous studies in which Wolbachia genomes were assembled from sequencing host organisms (Ghedin et al., 2004;Salzberg et al., 2005;Richardson et al., 2012;Saha et al., 2012;Campana, Robles García & Tuross, 2015;Derks et al., 2015;Lindsey et al., 2016b), even when the endosymbiont was not the focus or original reason for performing the sequencing in the first place. Although they may be fragmented, these draft genomes can still provide valuable information about the phylogenetics and evolution of the endosymbiont. While Wolbachia is relatively well studied, there are many other endosymbionts that have received less attention, such as some Spiroplasma, Cardinium, Arsenophonus, and Flavobacetrium species (Duron et al., 2008), and others await discovery. This study shows that extensive field sampling may not even be necessary to get a better understanding of the diversity of these endosymbionts; the sequencing data are probably already available in public databases. With the right reference databases and metagenomics software, there is a lot of potential to learn more about these endosymbionts just from already existing resources.

CONCLUSIONS AND RECOMMENDATIONS
Wolbachia is a well-known endosymbiont of many arthropod species and while standard Wolbachia diagnostic techniques utilize various Wolbachia primers to confirm infection via PCR (Simões et al., 2011) there are trade-offs that limit large scale surveys. Here, we present a method to identify Wolbachia bioinformatically using publicly accessible host raw sequencing data. In eight arthropod species, Wolbachia was identified where infection has not previously been reported, and in 27 other arthropod species infection was confirmed. Isolates of Wolbachia from positive samples all clustered within either supergroups A or B, and for seven of the newly identified hosts we identified the supergroup of the strain. From these isolates we assembled draft Wolbachia genomes, which provided robustly supported phylogenetic information as well as information about potential HGT events or signs of multiple infection.
These results highlight the importance of depositing raw sequencing datasets to public archives like the NCBI SRA and the value that they have in studying endosymbionts. At the same time, we offer some suggestions for best practices when depositing sequence data into public archives to maximize its usefulness for future researchers (Wilkinson et al., 2016;Griffin et al., 2017). First, we encourage everyone performing high-throughput sequencing to deposit their data into public databases like the NCBI SRA, where it can easily be searched and accessed, as opposed to depositing only in smaller, taxon-specific databases or personal/lab web sites. Second, data should be minimally filtered; while ''contaminant'' sequences like endosymbiont DNA may be a nuisance to those who generated the data, they may be of interest to others. Finally, all sequence data should be accompanied by as much metadata as possible. Without this information, interpreting results can be difficult. For example, many of the sequences we used in this study lacked detailed information about the source of the DNA in the associated BioSample entries (e.g., whether it came from a lab strain or wild-caught specimens, its geographic origin if field collected, whether it was from a single individual or a pooled sample, whether the specimen was male or female, whether it was a whole body or specific tissues, etc.). Including this information would have helped us better understand possible biases in the dataset, such as how well the results may reflect the frequency of infection in natural populations, or whether a sample might give a false negative result because Wolbachia is not present at high densities in the tissues sampled for DNA.