Sex-based de novo transcriptome assemblies of the parasitoid wasp Encarsia suzannae, a host of the manipulative heritable symbiont Cardinium hertigii

Parasitoid wasps in the genus Encarsia are commonly used as biological pest control agents of whiteflies and armored scale insects in greenhouses or the field. They are also hosts of the bacterial endosymbiont Cardinium hertigii, which can cause reproductive manipulation phenotypes, including parthenogenesis, feminization, and cytoplasmic incompatibility (the last is mainly studied in Encarsia suzannae). Despite their biological and economic importance, there are no published Encarsia genomes and only one public transcriptome. Here, we applied a mapping-and-removal approach to eliminate known contaminants from previously-obtained Illumina sequencing data. We generated de novo transcriptome assemblies for both female and male E. suzannae which contain 45,986 and 54,762 final coding sequences, respectively. Benchmarking Single-Copy Orthologs results indicate both assemblies are highly complete. Preliminary analyses revealed the presence of homologs of sex-determination genes characterized in other insects and putative venom proteins. Our male and female transcriptomes will be valuable tools to better understand the biology of Encarsia and their evolutionary relatives, particularly in studies involving insects of only one sex.


Read preparation and assembly
Raw read files were processed with BBDuk (RRID:SCR_016969) from the BBTools software suite (v37.36, RRID:SCR_016968) [32] to remove the Illumina adapter sequences, trim and/or filter out whole reads with a quality score less than 15, and remove reads shorter than 36 bp after trimming using the following options: "ref=adapters.fa ktrim=r ordered k=23 hdist=1 mink=11 tpe tbo maq=15 qtrim=rl trimq=15 minlen=36". We utilized FastQC (v0.11.9, RRID:SCR_014583) to visualize the sequence quality of each sample before and after trimming and to confirm the successful removal of adapter sequences [33]. Due to the complex biology of this species and its host insects, sequence contamination from a variety of organisms throughout the rearing system is inevitable, including Cardinium cEper1, the different insect hosts of male and female E. suzannae, and the endosymbionts of those insect hosts. Thus, we employed a mapping-and-removal approach to enrich for E. suzannae reads prior to assembly and limit the generation of contaminating transcripts. For this approach, BBMap (RRID:SCR_016965) from BBTools was used to initially map the quality-trimmed reads to the genomes of Cardinium hertigii cEper1 and the endosymbionts of Bemisia tabaci MEAM1, with which E. suzannae females and males have direct or indirect contact (i.e., Hamiltonella defensa, Portiera aleyrodidarum, and Rickettsia sp. MEAM1 [34,35]). It was also determined that the E. sp. nr. emiratus hosts of E. suzannae males contain Wolbachia [36]; thus, the Wolbachia wPip genome was added and mapped to the male samples. Reads that did not map to any of these bacterial genomes with a greater than 94% identity were retained (to allow for a difference of 3 nucleotides between sequenced transcripts and reference endosymbiont genomes). These reads were then subsequently mapped to the B. tabaci MEAM1 genome with a more stringent 97% identity List of the organisms whose reads were removed prior to assembly with Trinity. Quality-controlled reads were mapped to the genomes of the listed organisms, and the reads mapped to any of the references were removed. threshold using BBMap to avoid mapping E. suzannae reads from genes highly conserved in both Encarsia and Bemisia (see Table 1 for mapping and removal details). Again, only unmapped reads were retained for assembly, as these final reads were expected to be mainly attributed to E. suzannae.
We assembled separate transcriptomes for male and female E. suzannae whole adult wasps with the remaining unmapped reads using Trinity (v2.6.6, RRID:SCR_013048) and its default settings [37]. Transcript abundance was then estimated for each transcriptome with kallisto (RRID:SCR_016582) using the "align_and_estimate_abundance.pl" command bundled with Trinity [38]. Transcripts with an estimated abundance below 0.5 transcripts per million were removed from both assemblies as these may be lowly expressed isoforms of other transcripts, poorly assembled or chimeric transcripts, or simply contaminants and, thus, not from Encarsia [39,40]. Next, TransDecoder (v5.5.0, RRID:SCR_017647) [41] was used to predict coding sequences within the remaining transcripts in each assembly and translate those coding sequences into predicted protein sequences with a minimum amino acid length of 67. Similar protein-coding sequences were then clustered using CD-HIT (v4.6.8, RRID:SCR_007105) [42,43] with a 95% amino acid identity threshold, and the longest protein isoform of each cluster was selected as the representative sequence for that cluster.
The final assemblies are presented as the nucleotide sequences of the representative proteins of each cluster. For a comprehensive list of the number of reads or transcripts at each step in the pipeline, see Table 2.

Quality control and data validation
Along with our mapping-and-removal approach to limit contaminations while enriching for Encarsia reads prior to assembly, we also utilized additional methods to improve the quality of our assemblies. First, to comply with the National Center for Biotechnology   Assembly and annotation statistics at each step in the pipeline for our E. suzannae transcriptomes compared to the previously-published E. formosa transcriptome assembly. The * highlight the number and type of final sequences in the public version of each assembly. Assessment of assembly completeness using BUSCO v5.3.2 to search the assembled proteins against a database of proteins identified as Hymenopteran BUSCOs. All BUSCO groups searched were determined to be present in a single copy in >90% of the Hymenopteran species tested; therefore, a high number of complete single-copy BUSCOs indicates a comprehensive and non-redundant assembly [45].
or contaminants were also removed. In total, 71 and 109 contaminating sequences were removed from the female and male assemblies, respectively.
The final assemblies were then assessed for completeness using Benchmarking Universal Single-Copy Orthologs (BUSCO) (v5.3.2, RRID:SCR_015008) in protein mode against the hymenoptera_odb10 reference lineage (v2020-08-05) [45,46]. The female and male assemblies were found to possess, respectively, 82.1% and 82.6% of the 5,991 complete orthologs identified as single-copy and nearly universal within the order Hymenoptera (present in >90% of the tested species). This indicates a high level of completeness for both E. suzannae transcriptomes, although with varying degrees of duplication (shown in

Annotation
The male and female E. suzannae assemblies are available as unannotated coding sequences in the NCBI's TSA database under the accession numbers GJLB00000000 and GJLI00000000, respectively. Here, we also provide the annotation information for both assemblies from multiple sources. courtship behavior of Drosophila [54,55]. We also searched the assemblies for homologs of wasp overruler of masculinization (wom) [56] but found none. This gene was identified in N. vitripennis as the instructor of sex determination via the activation of tra expression and autoregulation, which results in female development. However, we cannot rule out the presence of wom in E. suzannae as this gene in N. vitripennis is mainly transcribed in diploid (female) embryos prior to 7 h post oviposition and is not expressed in adults, which we sampled for our transcriptome assemblies. We also did not find homologs of complementary sex determiner (csd), the instructor of sex determination in Apis mellifera.
Sex determination in the Chalcidoidea has been a matter of some speculation [57]. Regardless, detecting putative venom proteins in E. suzannae provides more insight into how these wasps effectively parasitize their hosts. However, it should be noted that reliable identifications of venom proteins require additional experimental verifications.

Transcriptome comparisons
As stated above, the only other publicly available transcriptome of an Encarsia species belongs to E. formosa [26]; thus, limited comparisons can be made within this genus. An overview of all currently known Encarsia transcriptomes is shown in Table 2. Compared to the E. formosa transcriptome assembly, the male and female E. suzannae assemblies were generated from more initial reads and produced more pre-filtering transcripts, meaning they could be subject to more stringent transcript filtering than the E. formosa assembly.
While the E. formosa assembly underwent limited post-assembly contaminant filtering, the E. suzannae assemblies utilized additional measures to (1) limit potential nonsense, low-abundance, and redundant transcripts through post-assembly filtering and processing, and (2)   Finally, OrthoVenn2 (RRID:SCR_022504) was used to determine the orthologous groups between the predicted proteins in both the E. suzannae assemblies presented in this paper and the E. formosa assembly published elsewhere [26,60]. Using the default settings and an e-value cutoff of 1 × 10 −5 , 8816 orthologs were found to be shared across all three transcriptomes, and a total of 22,015 orthologous groups were shared between male and female E. suzannae out of the total of 23,265 and 23,346 clusters, respectively (see Figure 1). These results indicate a high degree of similarity between the different sex assemblies while showing the presence of over one thousand sex-specific protein clusters. It is also striking that the female and male E. suzannae transcriptomes are equally similar to the E. formosa transcriptome, although E. formosa exists as an asexual species consisting of nearly all females (due to the presence of parthenogenesis-inducing Wolbachia), and its transcriptome therefore only reflects female individuals [61].

CONCLUSION AND RE-USE POTENTIAL
We are confident that our assemblies are among the purest possible transcriptome representations of E. suzannae that can be obtained with the currently available data and tools for assembly and filtering (for a list of all software names and versions utilized in this study, see Table 4). This study is also one of the first to present sex-specific transcriptome assemblies of a single insect species. In an organism such as E. suzannae -where males and females develop within different hosts, are impacted differently by endosymbiotic bacteria, and exhibit distinct behaviors -it is highly valuable to have the availability of a reference database for both sexes to ensure more accurate studies when wasps of only one sex are used. Furthermore, these assemblies greatly expand our host knowledge of the Cardinium cEper1 CI system and pave the way for future studies exploring how this endosymbiont interacts with its E. suzannae host in causing CI. We also believe that these data will be a valuable reference when studying the diverse members of the ecologically important genus Encarsia and other chalcidoid parasitic wasps, many of which have interesting biology and potential as pest biological control agents. The above figure shows an OrthoVenn2 diagram of the orthologous groups between the E. formosa females and the male and female E. suzannae (e-value = 1 × 10 −5 ) [60]. TransDecoder, using a minimum amino acid length of 50, was run on the E. formosa assemblies to obtain the coding sequences. The resulting peptide sequence output (27,161 sequences) was tested against the predicted proteins from the male and female E. suzannae transcriptomes. The top Venn diagram depicts the number of orthologous protein clusters shared between the three transcriptomes. The middle bar graph depicts the total number of orthologous clusters present for each transcriptome. Lastly, the bottom graph shows (left to right) the number of clusters that were shared by all three transcriptomes, by any two transcriptomes, or were unique to one of the three assemblies.