Endoparasitoid lifestyle promotes endogenization and domestication of dsDNA viruses

The accidental endogenization of viral elements within eukaryotic genomes can occasionally provide significant evolutionary benefits, giving rise to their long-term retention, that is, to viral domestication. For instance, in some endoparasitoid wasps (whose immature stages develop inside their hosts), the membrane-fusion property of double-stranded DNA viruses have been repeatedly domesticated following ancestral endogenizations. The endogenized genes provide female wasps with a delivery tool to inject virulence factors that are essential to the developmental success of their offspring. Because all known cases of viral domestication involve endoparasitic wasps, we hypothesized that this lifestyle, relying on a close interaction between individuals, may have promoted the endogenization and domestication of viruses. By analyzing the composition of 124 Hymenoptera genomes, spread over the diversity of this clade and including free-living, ecto, and endoparasitoid species, we tested this hypothesis. Our analysis first revealed that double-stranded DNA viruses, in comparison with other viral genomic structures (ssDNA, dsRNA, ssRNA), are more often endogenized and domesticated (that is, retained by selection) than expected from their estimated abundance in insect viral communities. Second, our analysis indicates that the rate at which dsDNA viruses are endogenized is higher in endoparasitoids than in ectoparasitoids or free-living hymenopterans, which also translates into more frequent events of domestication. Hence, these results are consistent with the hypothesis that the endoparasitoid lifestyle has facilitated the endogenization of dsDNA viruses, in turn, increasing the opportunities of domestications that now play a central role in the biology of many endoparasitoid lineages.


30
The recent boom of genome sequencing programs has revealed the abundance of DNA fragments of viral origin within 31 eukaryotic genomes. These so-called Endogenous Viral Elements (EVEs) stem from endogenization events that not only 32 involve retroviruses as donors (as could be expected from their natural lifecycle) but also viruses that do not typically Lifestyles are displayed next to species names (blue: free-living, green: endoparasitoid, yellow: ectoparasitoid, grey: unknown). The number of EVEs and domesticated EVEs (dEVEs) found in each species are represented respectively by the first and second facet of the horizontal histograms. Colors along these histograms indicate the potential donor viral families (where blue tones correspond to viral dsDNA viruses, red tones to ssDNA viruses, orange/brown tones to dsRNA viruses and green tones to ssRNA genomes). Endogenous Viral elements (EVEs) shared by multiple species and classified within the same event are represented by circles whose size is proportional to their number; those that are considered as domesticated (dEVEs) are surrounded by a black border. Numbers in the white boxes correspond to the number of endogenization events inferred. As an example, Megastigmus dorsalis and Megastigmus stigmatizans are ectoparasitoids (yellow) sharing a common endogenization event (within the Cluster21304) that likely originated from an unclassified dsRNA virus (grey color in circle), and shows no sign of domestication (no black border around the grey part of the circle). The figure was inspired from the work of [28]. Details on the phylogenetic inference and time calibration can be found in the MM section; bootstrap information can be found in TableS2; details on lifestyle assignation can be found on the github repository under the name : Assembly_genome_informations.csv In all three panels, a yellow color indicates endogenization events that have not been followed by domestication, while orange indicates domestication events. A : Distribution of the number of events inferred, according to the four categories of viral genomic structures. The crosses refer to the expected number of endogenization events for each category based on its estimated relative abundance in insects (see details in Materials and methods and virus-infecting data in Excel tab:All_virus_infecting_insects_informations). B : Distribution of the various viral families involved in endogenization events. The polarity of ssRNA viruses is displayed next to the family name. Events involving multiple putative families (i.e. where several viral families are present in the same scaffold) have been excluded from the count. The star next to the family name indicates that the viral family is known to infect insects. C : Distribution of the number of EVEs per event across viral categories.

Double-stranded DNA viruses are over-represented in endogenization events
Most of the endogenization events recorded involve ssRNA and dsDNA viruses. But do these proportions simply mirror 137 the diversity and respective abundances of the different kinds of viruses encountered by insects? The analysis summa-138 rized in (Figure2-A) (see details in Materials and methods) indicates this is not the case. More specifically, it shows that 139 dsDNA viruses are more frequently endogenized than expected on the basis of their representation in the databases, 140 while ssRNA viruses are under-represented ( 2 = 213.36 and 221.38, respectively, for endogenization events and domes-141 tication events, d.f. = 3, both p-value < 2.2e-16). Notably, this result is not purely driven by the presence in our data set 142 of the four positive controls (previously described cases of viral domestication, that all involve dsDNA viruses as donors). 143 Finally, among endogenization events involving ssRNA viruses, we found an over-representation of negative stranded 144 ssRNA compared to their relative abundance in public databases (  Endogenizations of dsDNA viruses are more frequent in endoparasitoid species 148 Next, we sought to characterize the factors that could explain the patterns of endogenization events inferred (Figure1).

149
To this end, for each viral genomic structure, we assessed whether endogenization events were evenly distributed among  To further test the apparent correlation between Hymenoptera lifestyle and the rate of endogenization events, we 156 inferred ancestral lifestyles along the phylogeny using a Bayesian model (see details in Materials and methods). We then  Violin plots represent the posterior distribution of the coefficients obtained under the different GLM models (after exponential transformation to obtain a rate relative to free-living species). The coefficients are derived from 1000 independent GLM models, where 1000 probable scenarios of ancestral states at nodes were sampled randomly among the MCMCM iterations (see details in Materials and methods). Branches from nodes older than 160 million years were removed from the dataset. The %pd is the probability of direction and indicates the proportion of the posterior distribution where the coefficients have the same sign as the median coefficient.
indicates the proportion of MCMC iterations where the coefficient obtained for endoparasitoid species is higher than for ectoparasitoid species. All statistical summaries of the Bayesian GLM models can be found on the GitHub repository under the name : Lifestyle_statistical_analysis_results.xlsx. 8 of 45 the same endogenization event [30]. In our unknown Campopleginae species, we identified homologs of 35 out of the 212 40 ichnovirus genes present in the genome of H. didymator [16]). Those genes show conserved 213 synteny in the two species (Figure S11), strongly suggesting that they derive from the same endogenization event. How-214 ever, our analysis did not identify viral homologs in the two Ophioninae and Cremastinae sufamilies, that are internal 215 to the clade including Campopleginae and Banchinae wasps. This result argues against the view of a single event at the 216 root of Ophioniformes, and thus supports the alternative view [31] that the so-called IVSPER genes in the Campopleginae 217 and Banchinae subfamilies stem from independent events, despite their striking structural similarities (see FigureS12   218 for illustration). We found no trace of the previously suggested remnants of ichnoviruses in the related species Venturia 219 canescens, whereas the presence of nudiviral genes in this species was confirmed [17].   Figure 5. Phylogenetic relationships among endogenized and "free-living" dsDNA viruses. Specifically, this figure shows the relationships between Naldaviricetes double-stranded DNA viruses and EVEs from hymenopteran species, where at least 3 endogenization events were found. This tree was computed using maximum likelihood in Iqtree (v2) from a 38,293 long protein alignment based on the concatenation of 142 viral genes. Confidence scores (aLRT%/ultra-bootstrap support%) are shown at each node. The scale bar indicates the average number of amino acid substitutions per site. Previously known EVEs are in white, those from the present study in orange, and leaves inferred as free-living viruses are in red. All the best partitioned models can be found on the Github repository under the name : dsDNA_phylogeny_best_ML_partitions.nxs. All free-living dsDNA viruses used in this phylogeny were obtained from published complete viral genomes. More details on the phylogenetic inference can be found in methods.
We found 5 new cases of endogenization involving multiple EVEs from dsDNA viruses belonging to Nudiviridae, LbFV-222 like and AmFV-like families.
Two of them involve parasitoid species, i.e. Platygaster orseoliae and an Aprostocetus species. For Aprostocetus, we detected 6 EVEs related to nudiviruses branching between the Chalcidoidea and the Diaprioidea superfamilies (Fig. S4).

255
In addition, all scaffolds do include transposable elements and eukaryotic genes making them undoubtedly endogenized.

256
Accordingly, our pipeline attributed the highest confidence index A for 104 of them (out of 135). The P.gracilis genome 257 revealed 9 EVEs, including homologs of pif RNA polymerase,ac81,. Notably, one of 258 the 9 EVEs (AmFV_059, of unknown function) shows both a dN/dS < 1 (mean= 0.1747, p-value 5.877e-02), and a very high 259 TPM value (362836 TPM from whole body tissues). Finally, in A. picea, 7 EVEs were detected, including homologs of pif -1, 260 pif-3, integrase, odv-e56 and p74 (Figure S7). No raw reads data were available for this species, precluding coverage-based 261 inferences. Since there were neither orthologs or paralogs for these genes to compute dN/dS analyses, nor transcrip-262 tomic data, it was not possible to infer their domestication status. At this stage it is thus not possible to conclude as to 263 the functions of these genes in H. saltator, P. gracilis and A. picea, but this surely deserves further attention.

265
All kinds of viruses can integrate arthropod genomes, although the mechanisms underlying these phenomena remain 266 unclear [1,4]. Prior to the present analysis, 28 viral families had been described as involved in endogenization in arthro-

272
In the following section, we will first discuss why double-stranded DNA viruses, in comparisons with other viral genomic structures (ssDNA, dsRNA, ssRNA), are more often endogenized than expected. We will then discuss hypotheses 274 that could explain why we found a higher rate of endogenization of dsDNA viruses among endoparasitoids compared to 275 ectoparasitoids or free-living hymenopterans, which also translates into more frequent events of domestications. 276 277 278 dsDNA viruses are more frequently involved in endogenization than expected by chance 279 Despite the observations that all viral genomic structures can be involved in endogenization, we clearly identified dif-280 ferences in their propensity to do so. Based on a comparison between the respective proportions of the various viral 281 categories in the inferred endogenization events and in public databases, we found that dsDNA viruses are much more 282 represented than expected, while ssRNA viruses are under-represented (Figure 2-A). We acknowledge that current knowl-283 edge on the actual diversity of free-living viruses (as approximated through the NCBI taxonomy database) remains incom-284 plete, but the strength of the effect reported here makes this conclusion rather robust to variations in the null distribution.

285
On the basis of current knowledge, RNA viruses, and in particular ssRNA viruses, appear to be much more diversified and 286 prevalent than DNA viruses in insects. We note that viral-metagenomic studies often focus either on DNA or RNA viruses, 287 and as such do not provide an accurate and unbiased picture of the extent viral diversity. were infected by a single Mononegales-like virus) and extremely diversified [40]. Actually, this view appears to hold at the 298 larger scale of eukaryotes [41]. Taking into account this patent abundance of ssRNA viruses in insects, our study indicates 299 they are by far less frequently endogenized than their dsDNA counterparts in hymenopterans. Notably, a similar trend 300 was recently reported in a study including a diverse set of eukaryotes [24].

301
Most of the major endogenization events characterized so far in hymenopterans involve dsDNA viruses from the 302 Nudiviridae family [21,18,17,42,43,44,4]. Our study further confirms that this viral family represents a major source 303 of exogenous and sometimes adaptive genes for Hymenoptera. Indeed, 28 new independent endogenization events in-304 volve this family, among which 9 are shared by at least two related species Figure1). The major contribution 305 of nudiviruses to endogenization may be explained by their wide host range in arthropods [45]. Their nuclear replica-306 tion constitutes another plausible explanatory factor [46], since it may facilitate contact with host DNA. In addition, their between species. For instance, in their analysis based on 48 arthropod genomes, [20] found that the number of EVEs ranged between 0 and 502. Although insect genome size and assembly quality may partly explain this variation [4], 325 the underlying biological factors are generally unknown. In this study, we tested the hypothesis that the insect lifestyle 326 may influence both the endogenization and domestication rates. We used a Bayesian approach to reconstruct ancestral 327 states throughout the phylogeny of Hymenoptera, thus accounting for uncertainty, and found that endoparasitoidism, 328 in comparison with other lifestyles, tends to promote dsDNA viral endogenization. Notably, this conclusion was not 329 the artefactual consequence of differences in genome assembly quality. In fact, the quality of genome assemblies was 330 correlated with the lifestyle in our data set, but the genomes of endoparasitoid species were generally less well assembled 331 than those of free-living species. If anything, this difference should reduce the power for detecting endogenization events 332 in endoparasitoids, where our analysis detected an excess of such events. Our estimate of the effect sizes (with 2.47 times 333 more endogenization events in endoparasitoids than in free-living species) should thus be seen as conservative. Why do 334 endoparasitoid wasps tend to undergo more endogenization than others? We initially had in mind two non-exclusive 335 hypotheses that remain plausible explanations for the observed pattern. First, endoparasitoids may be more intensively 336 exposed to viruses. In addition, or alternatively, endoparasitoids may have a higher propensity to endogenize and retain 337 viral genes.

338
Several factors come in support of the first hypothesis. Endoparasitoid larvae grow by definition inside their host's 339 body, and such a close interaction implies that any endoparasitoid individual will also be interacting with its host's viruses.

340
Accordingly, the best studied cases of viral domestication in wasps involve nudiviruses, that are known to replicate in their 341 caterpillar hosts [57,17,42]. Another putatively important factor is the presence of virus in the venoms that parasitoids 342 inject into hosts together with their eggs. These are known to protect the offspring against the host's immune response, Leptopilina boulardi [27]. Although this effect may also be at play for some ecto-parasitoid species [60], we expect it to 350 be more pronounced for endo-parasitoid species since they have a closer interaction from the inside of their hosts. Gen-351 erally, endoparasitoids may thus carry a higher load of non-integrated viruses than other hymenopterans. However, if 352 this effect is at play, we expect to have an "endoparasitoid" effect for all viruses, whatever their genomic structure. For 353 instance, we would expect such an effect to be detected for ssRNA viruses, which are involved in the greatest number of 354 endogenization events (Figure2-A). This was not the case, since only dsDNA viruses were more frequently endogenized 355 in endoparasitoids. Thus, we argue that this hypothesis is unlikely to explain the observed pattern.

356
The second hypothesis posits that endoparasitoids are more frequently selected for retaining virally-derived genes 357 than ectoparasitoid or free-living hymenopterans. In our analysis, domestication events are most frequently observed 358 in endoparasitoids (over 3 times more frequently than in other hymenopterans). Obviously, this may be at least partly 359 explained by the higher input discussed above (the higher endogenization rate). Yet, once this effect is controlled for, 360 a trend towards a higher rate of domestication remains. More specifically, the likelihood of domestication following 361 endogenization was significantly higher in endoparasitoids than in ectoparasitoids, but was not significantly higher than 362 in free-living species. This latter lack of significant difference may be biologically explained if a single domestication event 363 precludes the domestication of additional EVEs, while not affecting the rate of non-adaptive endogenization. This would 364 "dilute" the signal along branches involved in domestication. If this effect is at play, then it reduces considerably the 365 power of our analysis to detect any difference on the rate of domestication between lifestyles. Indeed, in all known cases, 366 only one domesticated virus has been documented, suggesting that further domestications are not beneficial once a viral 367 machinery has been recruited by a wasp lineage.

368
Whether or not the rate of domestication per se is higher in endoparasitoids than in other hymenopterans, the se- ]. Yet, to our knowledge, such an effect has only been demonstrated against RNA viruses, so that it would not explain 372 the excess of DNA viruses documented here. Furthermore, the sequence identities with known viral sequences, which is 373 needed for this mechanism to work, is low in our dataset. Accordingly, previous work revealed that EVE-derived piRNAs 374 studied in 48 arthropod species were also probably too divergent to induce an efficient antiviral response [20]. At that stage, the ability of EVEs to generate PIWI-interacting RNAs that play a functional role in antiviral immunity seems ques-376 tionable. Further studies involving small RNA sequencing in hymenopterans would be required to shed light on this issue.

377
Protection of the eggs and larvae against the host immune system is recognized as an important trait, where EVEs play 378 a critical role. Because of their peculiar lifestyle, endoparasitoids are all targeted by the host immune system, a matter 379 of life or death to which other hymenopterans are not exposed to. Several cases of endogenization and domestication 380 in endoparasitoids, all involving dsDNA viruses, are thought to be related to this particular selective pressure [42, 16, 381 17, 18, 19]). The parasitoids appear to have co-opted the viral fusogenic property to address their own proteins (VLPs) 382 or DNA fragments (polydnaviruses) to host immune cells, thereby canceling the host cellular immune response. The 383 above-hypothesized high exposure of endoparasitoids to viruses, together with this unique selective pressure, may act 384 in concert to produce the pattern documented here: a strong input, that is, a diverse set of putative genetic novelties, 385 combined with a strong selective pressure for retaining some of them. The observed excess of dsDNA viruses may be 386 an indication that these viruses display a better potential for providing adaptive material in this context. In the cases 387 of polydnaviruses (found in some Braconidae and some Campopleginae), it appears that one way to efficiently deliver 388 virulence factors to the host cell is by addressing DNA circles that ultimately integrate into the host immune cells and 389 get expressed [62,63]. The DNA which is packed into the mature particles typically encodes virulence proteins deriving 390 from the wasp [64]. This means that, at least for these cases, the viral system should be able to pack DNA, which is most  Our analysis has revealed a large set of new virally-derived genes in Hymenoptera genomes. Those genes were de-399 riving from viruses with any genomic structures, although dsDNA viruses were disproportionately involved in endoge-400 nization and domestication. Importantly, our analysis revealed that endogenization rate and the abolute number of do-401 mestication events involving dsDNA viruses was increased for endoparasitoids compared to other lifestyles. Among the 402 new cases of endogenization and domestication, we uncovered new events revealing common features with previously 403 known cases of viral domestication by endoparasitoids, such as in the Platygastroidea Platygaster orseoliae. This is to our 404 knowledge the first case reported in the superfamily Platygastroideae, thus extending the diversity of Hymenoptera con-405 cerned by viral domestication. We propose that the higher rate of endogenization and higher number of domestication 406 events in endoparasitoids is a consequence of the extreme selective pressure exerted by the host immune system on 407 endoparasitoids. This extreme selective pressure may select endoparasitoids for retaining a viral machinery that could 408 help them address virulence factors to their hosts. We expect this process to be widespread among insect species sharing 409 the same lifestyle.

411
Genome sampling, assembly correction and assembly quality 412 A bioinformatic pipeline mixing sequence homology search, phylogeny, genomic environment, and selective pressure 413 analysis was built to search for viral endogenization and domestication events in Hymenoptera genomes. We used 133 414 genome assemblies in total, of which 101 were available on public repositories (NCBI and BIPPA databases) and 32 were 415 produced by our laboratory (all SRA reads and assemblies available under the NCBI submission ID : SUB11373855). Con-416 cerning the last 32 samples, DNA was extracted on single individuals (usually one female) or a mix of individuals when 417 the specimens were too small using Macherey-Nagel extraction kit, the DNA was then used to construct a true seq nano 418 Illumina library at Genotoul platform (Toulouse, France). The sequences were generated from HiSeq 2500 or HiSeq 3000 419 machines (15Gb/sample). The paired-end reads were then quality trimmed using fastqmcf (-q15 -qual-mean 30 -D150, 420 GitHub) and assembled using IDBA-UD [66]. All sample information can be found on the GitHub repository under the 421 name : All_sample_informations.txt and is available under the NCBI Biosample number : SUB11338872.

422
The size of the 133 assemblies ranged from 106.14mb to 2102.30mb. We kept only genome assemblies containing 423 13 of 45 at least 70% non-missing BUSCO genes (124/133 genomes, [67]) (all genome information can be found on the GitHub repository under the name : Assembly_genome_informations.txt). In addition, when the raw reads were available, we 425 used the MEC pipeline [68] to correct possible assembly errors. Although some genomes were highly fragmented (such 426 as the 32 genomes we generated since they were obtained using short reads only), the N50 values (min: 3542bp) were 427 equal to or larger than the expected sizes of genes known to be endogenized and domesticated (min known domesticated 428 EVE : 165bp) indicating that most of the putative EVEs should be detected entirely.

429
Out of the 32 samples sequenced by our laboratory for this study, one (corresponding to Platygaster orseoliae) gave 430 unexpected results. After assembly and BUSCO analysis, two sets of contigs were identified: one with only 4X coverage on 431 average, and one with 33X on average. The phylogeny of these different BUSCOs gene sets showed that the low-coverage 432 scaffolds likely belong to an early diverging lineage of Chalcidoidea (Figure1), whereas the 33x scaffolds belong to the 433 target species P. orseoliae. This result suggests that the pool of 10 individuals used for sequencing was likely a mix of two 434 species. A phylogenetic study based on Ultra Conserved Elements (UCEs) obtained from several species of Chalcidoidea 435 [69, 70] allowed to identify the unknown species as sister to Aprostocetus sp (Eulophidae) (see details in the supporting 436 information and Figure S2 ). In the figures and tables, the name putative_ Aprostocetus_sp was consequently assigned 437 to the unknown sample. However, since the lifestyle and identity of this species are uncertain, we did not include the 438 corresponding scaffolds in the main analysis. The scaffolds belonging to this putative_Aprostocetus_sp. (i.e : all scaffolds 439 with a mean coverage < 10X) were removed from the P. orseoliae assembly file hosted in NCBI.

440
Pipeline outline 441 EVEs were identified from the 124 Hymenoptera assemblies using a sequence-homology approach against a comprehen-442 sive viral protein database (including all categories of viruses : ssDNA, dsDNA, dsRNA and ssRNA). In order to validate 443 viral endogenization within Hymenoptera genomes, we developed an "endogenization confidence index" ranging from 444 A to X (FigureS13-7). This index takes into consideration the presence of eukaryotic genes and/or transposable elements 445 around candidate loci, and scaffolds coverage information (coverage for a valid candidate should be similar to that found 446 in BUSCO containing scaffolds). Finally, the pipeline also included an assessment of the evolutionary history and of the 447 selective regime shaping the candidates (based on dN/dS and/or expression data). with a percentage coverage of the viral protein >= 30%, an identity score >= 20% and an E-value score < 5e-04 ( Figure S13-462 1). The threshold parameters were optimized to maximize the detection of the 13 endogenous viral sequences within 463 the genus Leptopilina [19]. Once all the viral hits were recovered, we formed putative EVEs loci (n=238,108) correspond-464 ing to the overlap of several viral hits on the same scaffold using the GenomicRanges R package [76] ( Figure S13-2). To 465 remove false positives corresponding to eukaryotic genes rather than viral genes, we then performed another generalist 466 sequence homology search against the Nr database (downloaded the 09/11/20) using mmseqs2 search (-s 7.5, E-value 467 max = 0.0001) ( Figure S13-3). We did not select our candidate based on the best hit, since it does not necessarily reflect 468 the true phylogenetic proximity. Instead, candidates with more than 25 hits with either eukaryotic non-hymenoptera 469 species or prokaryotic species were removed, except if they also had hits with at least 10 different virus species (bits >= 470 50). We chose to eliminate Hymenoptera hits from the database because if a real endogenization event concerns both 471 one of the 124 species of our dataset and some species in the NCBI database, then an apparent "Hymenoptera" hit will be detected, possibly leading to its (unfair) elimination. Since viral diversity is poorly known, we also kept sequences 473 with even one single viral hit, as long as it did not have more than 5 eukaryotic or prokaryotic hits. Using these filtering 474 criteria we removed a total of 234,036/238,108 (98,3%) candidate loci leaving 4,072 candidates with convincing homology 475 to viral proteins. Note that among these loci a certain proportion actually corresponded to non-endogenized "free-living" 476 viruses. To study the evolutionary history of these candidate EVEs, we then performed a general protein clustering of all 477 the candidates and the NCBI viral proteins ( Figure S13-4, Mmseqs cluster; thresholds : E-value 0.0001, cov% 30, options :

478
-cluster-mode 1 -cov-mode 0 -cluster-reassign -single-step-clustering[77]). 479 We eliminated from the dataset the chuviral glycoproteins that have been captured by LTR retrotransposons [78], as 480 these loci have complex histories mostly linked to the transposition activity after endogenization. For this purpose, we 481 systematically searched among the candidates for the presence of TEs within or overlapping with the EVE (see the file 482 All_TEs_overlapping_with_EVEs under the github repository). Only one cluster (Cluster4185) was concerned by such a 483 situation (chuviral glycoproteins overlapping to Gypsy/LTRs). It was detected in 89/124 species (1074 total copies, median 484 = 7 copies/species, max = 244, min =1), and was probably similar as the one described in [79]. accidentally added to the samples). One way to filter these cases was to study (i) the genomic environment (are there 490 other eukaryotic genes or transposable element on the same scaffolds?) and (ii) metrics such as G+C% (used only for 491 read coverage/GC plots) and scaffold coverage depth around candidate loci (are they the same as scaffolds containing 492 housekeeping genes?). All of these data were used to establish confidence in the endogenization hypothesis, scaled from 493 A to X (Figure S13-7).

(i) Scaffolds sequencing depth (Figure S13-5) :
In order to support the hypothesis that a scaffold containing candi-495 date EVEs was part of the Hymenoptera genome, we studied the sequencing depth of the scaffolds. If the sequencing 496 depth of a candidate scaffold was not different from the depth observed in scaffolds containing BUSCO genes, then this 497 scaffold was likely endogenized into the Hymenoptera genome. Hence, when DNA reads were available (FigureS1), we 498 measured this metric by mapping the reads on the assemblies using hisat2 v 2.2.0 [80]. An empirical p-value was then 499 calculated for each scaffold containing a candidate EVE. To calculate this empirical p-value, we sampled 500 loci of the 500 size of the scaffold of interest within BUSCO scaffolds. These 500 samples represented a null distribution for a scaffold 501 belonging to the Hymenoptera genome. The p-value then corresponded to the proportion of BUSCO depth values that 502 were more extreme than the one observed in the candidate scaffold (two-sided test). We used a threshold of 5% and a 503 5% FDR (multipy python package [81]).

504
(i) Genomic environment and scaffold size  Another way to rule out contaminating scaffolds was to 505 look for the presence of eukaryotic genes and transposable elements in the scaffolds containing candidate EVEs, assum-506 ing that their presence in a viral scaffold is unlikely. Indeed, so far, very few viral genomes have been shown to contain 507 transportable elements [82,83,84,85,86] because TE insertions are mostly deleterious and are therefore quickly elimi-508 nated by negative selection [84,85]. We searched for transposable elements by a BlastX-like approach (implemented in 509 Mmseqs2 search -s 7.5), taking as query the scaffolds of interest and as database the protein sequences of the transpos-510 able element (TE) available in RepeatModeler database (RepeatPeps, v2.0.2) [87]. We only kept hits with an E-value < 1e-10 511 and with a query alignment greater than 100aa. We then merged all overlapping hits and counted the number of TEs 512 for each scaffold. To find eukaryotic genes within genomes we used Augustus 3.3.3 [88] with BUSCO training and then 513 assigned a taxonomy to these genes via sequence homology with Uniprot/Swissprot database using mmseqs2 search

516
• A: scaffolds with a corrected coverage p-value > 0.05 and at least one eukaryotic gene and/or one repeat element,

517
• B: scaffolds with at least one eukaryotic gene and/or one repeat element but no coverage data available,

518
• C: scaffolds with a corrected coverage p-value > 0.05 and neither eukaryotic gene nor transposable element,

519
• D: scaffolds with a corrected coverage p-value < 0. 05 and whose coverage value was higher than the average of 520 the scaffolds containing BUSCOs (as it is difficult to imagine that an endogenized scaffold present a lower coverage 521 15 of 45 than expected, whereas a higher coverage could correspond to the presence of repeated elements that inflate the 522 coverage of the scaffold for example) but with at least 5 eukaryotic genes and/or a repeated element (in total),

523
• E: scaffolds presenting no DNA seq coverage data available and no eukaryotic gene nor transposable element de-524 tected,

525
• F: scaffolds presenting a corrected p-value of coverage < 0.05 and less than 5 eukaryotic genes without any trans-526 posable elements; this category may rather correspond to free-living viruses.

527
• X: scaffolds with a corrected p-value < 0.05 and neither eukaryotic gene nor transposable element; This category 528 may rather correspond to free-living viruses.

530
Because several EVEs may derive from the same endogenization event, we sought to aggregate EVEs into unique events. 531 We aggregated into a single event, firstly (i) all the EVEs present on the same scaffolds, and secondly (ii) all the EVEs that 532 presented the same taxonomic assignment at the level of the viral family. These two steps were sufficient to aggregate 533 EVEs in the simplest case of events involving only one species (but possibly several EVEs).

534
To further characterize the endogenization events including more than one Hymenoptera species, we also relied 535 on phylogenetic inference. To this end, the protein sequences belonging to each of the clusters (containing both viral 536 proteins and candidate EVEs) were first aligned with clustalo v1.2.4 [89] in order to merge possible candidate loci (which 537 may in fact correspond to various HSPs). All loci (=HSPs) within the same scaffold presenting no overlap in the alignment 538 were thus merged, as they probably correspond to multiple HSPs and not duplications. We then performed a new codon 539 alignment from the augmented sequences in the clusters using the MACSE v2 alignsequence program [90] .

540
This alignment allowed us to obtain a protein and nucleotide codon alignment. We used the protein alignment to infer  For events shared by several species, we were also able to analyze gene synteny around putative EVEs. To do this, 552 we conducted the equivalent of an all vs all TblastX (Mmseqs2 search -search-type 4, max E-value =1e-07) between all 553 the candidate loci within a putative event (deduced from the phylogenetic inference), and then looked for hits (HSPs) 554 between homologous EVEs around the insertions. Because it is possible to find homology between two genomic regions 555 that does not correspond to orthology, for example because of the presence of conserved domains, we had to define 556 a threshold to identify with confidence the orthology signal. We therefore conducted simulations to define this value, 557 based on the well assembled genome of Cotesia congregata (GCA_905319865.3) by simply performing the same all vs all 558 blast analysis against itself (as if the two species considered had the same genome). Based on this, we defined two types 559 of simulated EVEs, (i) independently endogenized EVEs in the genomes of the two "species". This is simply simulated 560 by randomly selecting two different regions in the genomes, and (ii) a shared simulated EVE that was acquired by their 561 common "ancestor". This is simulated by selecting the same random genomic location in both "genomes". We then 562 counted the total length of the HSPs found around the simulated insertions all along the corresponding scaffold (i and 563 ii). As the result will obviously depend on scaffold length, we performed these simulations on several scaffold lengths 564 (100000000bp, 10000000bp, 1000000bp, 100000bp and 10000bp). We conducted 500 simulations in each scenario, and 565 we measured the cumulative length of homologous sequences by counting the sum of HSPs (bit score > 50). We then One way to test for the domestication of an EVEs (dEVEs) was to estimate the ratio (omega) of the number of nonsynony-573 mous substitutions per non-synonymous site (dN), to the number of synonymous substitutions per synonymous site (dS).

574
If EVEs are evolving neutrally, then the ratio is expected to be equal to 1, whereas if the EVE is under purifying selection, 575 dN/dS is expected to be lower than 1. We conducted this analysis on trimmed codon alignments from   to the monophyletic EVE clade evolved under a neutral scenario ( 2 test). The dN/dS estimated for the whole clade is then 579 the average of each branch of the clade. The p-values were then adjusted by selecting an FDR of 0.05 [81], and we esti-580 mated the standard errors of dN/dS that maximized the likelihood (option getSE = 1). dN/dS with dS greater than 10 were 581 removed, since this indicates substitution saturation .

582
The other way we choose to study the domesticated nature of a viral gene was to study their expression profile (Fig-583 ure S13-13). We reasoned that domesticated genes are likely to be significantly expressed. To test this, when RNAseq 584 reads were available on NCBI (SRA), we mapped them on the assembled genomes (until reaching 300x coverage as far   (table S1). However, in addition to the expected unique shared event concerning the M. demolitor and C. vestalis species, our pipeline inferred two additional events, each specific to one lineage. This was due to the fact that two genes were not detected by our pipeline as shared 624 by M. demolitor and C. vestalis, either because they are effectively not shared (for 3 of them: HzNVorf118, like-pif-4 (19kda), 625 fen-1), or because of some false negative in one of the two lineage (for one of them:p33 (ac92)).

626
Assessing the distribution of virus infecting insects 627 We estimated the number of viral species infecting insect species based on the virushostdb database (downloaded the 628 23/05/2022 on https://www.genome.jp/virushostdb/) which lists a wide diversity of viral species associated with their 629 putative hosts. We also added two important exploratory RNA virus search studies [94,100]. We kept only viruses found 630 in interaction with insect in at least one of these databases. Genomic structures were retrieved through the ICTV report  is obtained for the fixed unrooted topology of using a standard MCMC analysis (100,000 iterations, 3 chains, 5000 burn-646 in, tuningInterval=200). The obtained posterior distribution is then used to calculate the posterior mean and posterior 647 variance of the branch length for each branch of the unrooted topology. In the second step, we date the phylogeny using 648 a relaxed clock model and calibrations (500,000 iterations, 4 chains, 5000 burn-in, tuningInterval=200). Calibration of the 649 root was done using a uniform law between 300 and 412 Mya. To verify that MCMC analyses converged to the same 650 posterior distribution, for both steps we computed the effective sample size and applied the Kolmogorov-Smirnov test 651 using the package convenience v1.0.0 with a minimum ESS threshold of 100 (however, due to an excessive demand for 652 resources, we were unable to achieve the sampling value of an ESS of minimum 100 for 46/389 parameters (min=44.25)).

653
Ancestral state reconstruction 654 To explore the dynamics of EVEs gain in relation to lifestyle, we first had to reconstruct the ancestral lifestyle states of 655 the Hymenoptera used in this study. This was achieved using a Bayesian approach implemented in RevBayes 1.1.1 [101].

656
The lifestyles of the Hymenoptera species used in this study were deduced from various sources (details and sources in 657 the table named Assembly_genome_informations.csv from the GitHub repository). Since lifestyle characters are probably 658 not equally likely to change from any one state to any other state, we choose the Mk model with relaxed settings allowing 659 unequal transition rates. Thus, we assumed 6 different rates with an exponential prior distribution. Before running the 660 MCMC chains, we made a preliminary MCMC simulation used to auto-tune the moves to improve mixing of the MCMC 661 analysis with 1000 generations and a tuning interval of 300. We then ran two independent MCMC analyses, each set 662 to run for 200 000 cycles, sampling every 200 cycles, and discarding the first 50 000 cycles as burn-in. To verify that 663 MCMC analyses converged to the same posterior distribution, we computed the effective sample size and applied the KolmogoRov-Smirnov test using the package convenience v1.0.0 with a minimum ESS threshold of 100. The MCMC chain 665 was subsampled to provide 1000 samples. At each sample, ancestral states were reconstructed for all nodes of the 666 phylogeny. We assumed that the state assigned to a node was constant throughout the branch leading to that node.

Test of the lifestyle effect on viral endogenization and domestication
In order to test the lifestyle effect on the propensity to integrate and domesticate viral element, we first randomly sampled 669 1000 probable ancestral state scenarios to take into account the uncertainty in the estimates of the ancestral states of the 670 nodes. Because a lot of branches had no EVE endogenization inferred, we ran zero-inflated negative-binomial GLM model, 671 for each of these 1000 scenarios such that (GLM(NbEVEs~free-living + endoparasitoid + ectoparasitoid * Branch_length, 672 family = zero inflated neg binomial). We eliminated all branches older than 160 million years because they are too old for 673 our analysis to detect events (the oldest event detected by our analysis is around 140 mya) that could artificially inflate the 674 zero count. The model was implemented in stan language using the R package brms (seed = 12345, thin=5, nchains =4,  To calculate the rate of domestication independent of the rate of endogenization, we built a binomial logistic regres-683 sion model in a Bayesian framework, specifying the number of domesticated EVEs (or Events) (the numerator) relative to 684 the total number of EVEs or Events inferred by our pipeline (the denominator). These binomial models allowed us to test 685 whether the probability of domestication after endogenization correlated with lifestyle by controlling for the endogeniza-686 tion input (the denominator). Thus, for each of the 1000 lifestyle scenarios, we ran a binomial brms model with a logit 687 link such that brms(Nb dEVEs/dEvents | trials(Nb EVEs/Events)~lifestyle + Branch length).

688
Before analyzing the data, we checked that the inferences did not depend on the quality of the genomes selected 689 for analysis. We found a significant effect of the lifecycle on the N50 and percentage of complete+partial BUSCO in the 690 assemblies (Kruskal-Wallis rank sum test p-values respectively = 3.192e-10 and 1.26e-14). Furthermore, a pairwise Wilcox 691 test with p-values adjusted with Bonferroni method revealed a significantly higher values of N50 and %complete+partial 692 BUSCO in genome assemblies from free-living species compared to endo and ectoparasitoids species (p-value <0.05).

693
The same test using the total assembly length in bp did not reveal any difference between the three lifestyles (p-value

717
The authors declare no competing interests.

Supporting Information 1-ssRNA endogenization
Although our results show an under-representation of EVEs deriving from ssRNA viruses (relatively to their high abundance in insect virome), they were involved in a high absolute number of endogenization events: 21 viral families/clades were involved in 174 independent endogenization events in the 114 Hymenoptera genomes analyzed, in particular involving Chuviridae, Artoviridae and Nyamiviridae (Figure2-B). In a recent meta-analysis [4], more than 1876 EVEs involving ssRNA viruses were identified in 37 species distributed in 8 insect orders. Interestingly, the authors noticed that the contribution of negative-stranded RNA viruses was overall high (67%), but was also highly species-specific. In our dataset, the great majority of ssRNA viruses donors were negative-stranded (14/21 viral family/clades) accounting for 78.8% of ssRNA events. The pattern thus seems even more pronounced in Hymenoptera compared to insects in general, and resemble the pattern observed in ticks [56]. The reasons for the asymmetry observed between negative and positive strand RNA viruses in endogenization are unclear. One explanation proposed by [107] posits that the ssRNA(-) have a higher probability of endogenization compared to ssRNA(+) because non-segmented ssRNA(-) usually produce abundant short mRNAs compared to ssRNA(+) which conversely produce lower amount of long mRNAs encoding a single polyprotein [108]. Then, all else being equal, an RNA(-) virus would produce more RNA molecules, which will increase the likelihood that some of them get reverse-transcribed and ultimately endogenized into the host genome. In support of that view, [107] noticed that the NP gene was more often endogenized compared to the other genes encoded by most ssRNA(-), which fits its prediction. This is because for most Mononegavirales species the 3' nucleoprotein (NP) gene is the most abundant RNA [109], due to the polar 3'-5' stepwise attenuation of transcription [109]. This pattern was since observed on some systems (i.e. mosquitoes and few mammals genomes [1]) but opposite results were also obtained on others (ticks, [56]). In our dataset, we found two ssRNA(-) non-segmented families showing the expected pattern where the genes closest to the 3' regions were the most endogenized : the Nucleoprotein (N) which is the first transcribed protein in Nyamiviridae was endogenized in 26 cases out of 28, and the 3'-unknown protein in Lispiviridae (first transcribed) was endogenized in 12 cases out of 15 endogenization events (FigureS3). On the contrary, all the other putative ssRNA families donor presented more EVEs deriving from the middle or the 5' genomic regions: the most endogenized gene from Artoviridae was the U2 protein (19/39) which is in the middle of the genome (2nd/3); in Bornaviridae and Rhabdoviridae the most endogenized gene was the RdRp (L) protein, which is the last ORF in the first genome (out of 8 genes) and the one just before the last gene in Rhabdoviridae. Finally, out of 36 EVEs deriving from Chuviridae, 26 corresponded to the Nucleoprotein (N) which is the last transcribed protein in the closest viral genomes. These unexpected results under Holmes model may thus lead one to reject the hypothesis, unless peculiar mechanisms of regulation of the transcription are at play for these viruses. Another explanation could come from a strong selective pressure for retaining particular proteins (i.e. Nucleoprotein) in the genome, independently of their level of transcription.

2-A new case of virus domestication in Platygaster orseoliae
In the assembly of P. orseoliae, 12 scaffolds were annotated as free-living viruses (F-X scaffolds). They had a different sequencing depth compared to BUSCO containing scaffolds and encoded 136 complete ORFs for which 21 presented homology with LbFV ORFs (min bit score = 50, min ORF size = 150pb, max overlaps = 23pb). ORF density was 82.7% which is in the range of expected values for related free-living viruses [110]. In order to identify additional scaffolds possibly belonging to this free-living virus, we searched for homology between the 136 putative viral proteins, and the scaffolds obtained from the assembly of P. orseoliae. These results allowed us to identify two additional scaffolds (scaffold_117128 scaffold_18896). Because the total size of the 14 putative "free-living" scaffolds was within the expected range for a dsDNA virus genome (136,801 bp) and because the average coverage was much higher than BUSCO-containing scaffolds (mean cov = 95.6X [sd=5.05X] compared to 33X in BUSCOs) and homogeneous ( Figure S4), we believe that this set of scaffolds corresponds to the whole genome of a new virus, related to LbFV, which we propose to call Platygaster orseoliae filamentous virus (PoFV). In order to characterize the relationship of this new virus within dsDNA viruses diversity, we inferred a phylogenetic tree including ORFs of known dsDNA viruses along with the EVEs newly identified here. The phylogenetic reconstruction revealed that PoFV was the closest relative of the endogenized virus found in the same species (PoEFV, Figure 5).
In order to detect possible new viral endogenization from the same "donor virus", we queried the genome of P. orseoliae with the 136 predicted proteins of PoFV. This way, we found a total of 139 convincing hits (89 PoFV ORFs), including the hits to the 15 ORFs with LbFV-homology. All ORFs were encoded by scaffolds with BUSCO-like coverage depth (p-value cov >= 0.05) and/or containing eukaryotic genes and/or transposable elements (Blastx E-value max = 7.060e-07, bits min =50, with an average percentage of identity of 69.16%). Furthermore, a large proportion of the EVEs (22.7%) presented premature stop codons within the sequences, further suggesting that these virally-derived genes are indeed endogenized since abundant pseudogenization is not expected in free-living virus genomes ( Figure S8-A).
Among the 81/139 apparently intact EVEs (with ORF length >= 50% of the PoFV ORF), some are likely implicated in DNA replication (integrase), in transcription (lef-8,lef-9,lef-5,lef-4), in packaging and envelopment (ac81, 38k) and in infectivity (pif-1, pif-2, pif-3). Among the 139 PoFV-related EVEs found in P. orseoliae, 104 corresponded to putative paralogs. Conversely, none of these 104 ORFs were present in two copies within the PoFV genome, suggesting that a major postendogenization duplication event occurred or that multiple endogenization events did occur. Among these 104 duplicated EVEs, 78 presented topologies allowing us to calculate dN/dS ratios using Bayesian pairwise estimates (runmode -3 in codeml) or foreground/background tests (codeml) when topologies presented more than 4 leafs. Before running the foreground/background tests, we constrained all paralogs to form a monophyletic group including the PoFV loci as the closest taxa in the phylogenies (all LRT tests did not significantly present differences between constrained and unconstrained trees). Among these 78 paralogs EVEs, 44 presented a complete and intact open reading frame and a dN/dS ratio significantly lower than 1 suggesting that they are under stabilizing selection ( Figure S8-A).
Although functional studies are needed to confirm that these virus-derived genes are involved in the formation of VLPs as observed in Leptopilina [19], we believe that P. orseoliae filamentous virus (PoEFV) is a good candidate for viral domestication, possibly involved in counteracting its Diptera host immune system (from the family Cecidomyiidae [33]).

4-Assignation of the unknown Hymenoptera to species
UCEs along with 400 bp of flanking regions on either side were extracted from the low coverage scaffolds with a custom script. We used a two-step process to assign the unknown sample to species. First, UCEs + flanking regions were analyzed with a set of UCEs + flanking regions obtained from early diverging families of Chalcidoidea by [70,69] to assign the unknown sample to family. Then, unknown sequences were analysed with a larger set of species belonging to the identified family (Eulophidae; loci taken from [69]). In both cases, only loci that had a sequence for at least 75% of the samples included in the analysis were retained. Loci were aligned with MAFFT (-linsi option; [111]). Positions with > 90% gaps and sequences with > 25% gaps were removed from the alignments using SEQTOOLS (package PASTA; [112]). The concatenation of all loci was analysed with IQ-TREE v 2.0.6 [71] without partitioning. Best fit models were selected with the Bayesian Information Criterion (BIC) as implemented in ModelFinder ([72]). FreeRate models with up to ten categories of rates were included in tests. The candidate tree set for all tree searches was composed of 98 parsimony trees + 1 BIONJ tree and only the 20 best initial trees were retained for NNI search. Statistical support of nodes was assessed with ultrafast bootstrap (UFBoot) ([113]) with a minimum correlation coefficient set to 0.99 and 1,000 replicates of SH-aLRT tests ([114]). Results of the phylogenetic analyses are presented in Figure S2. Placement of the unknown species in trees shows that samples of P. orseoliae were likely mixed up with a species belonging to the genus Aprostocetus (Eulophidae, Tetrastichinae). Given its small size, color and abundance (265 species described just in Europe), it seems plausible that one specimen of Aprostocetus sp. remained unnoticed in the pool of P. orseoliae.   Figure S1. Source of the datasets and availability of the reads. Phylogeny of 124 Hymenoptera species. Two Coleoptera species were used to root the tree. The aLRT bootstrap scores are represented along the nodes. The sources refer to the platform or laboratory in which the assemblages come from (This study, BIPPA: BioInformatics Platform for Agroecosystem Arthropods, NCBI: National Center for Biotechnology Information). The assemblies for which raw DNAseq or RNAseqs reads were available are listed in the column DNA or RNA reads. The G+C% column reflects the average G+C rate for each assembly, and the BUSCO% column reflects the rate of complete or partial BUSCOs found via the analysis with BUSCO V3. Posterior Bayesian lifestyle inference distribution for each node and tips are represented by colored pie charts.  Figure S2. UCE trees built to assign to species the unknown Chalcidoidea sequenced with the pool of P. orseoliae. A: Phylogeny of early diverging families of Chalcidoidea (423 UCES and 127,979 bp were analysed to get the tree, best fit model = GTR+F+R10). B : Phylogeny of the family Eulophidae to which the unknown sample was inferred to belong to (408 UCES and 77,514 bp were analysed to get the tree, best fit model = GTR+F+R10). For both trees, SH-aLRT/UFBoot are shown at nodes; the number of UCEs analyzed for each sample is indicated between bracket and the unknown sample is highlighted in red. Table S1. Summary statistics for control cases. The numerator indicates the numbers of EVEs or dEVEs inferred by our pipeline, and the denominator indicates the number of known EVEs for each case. Analysis on dN/dS was only possible when orthologs or paralogs were available. Controls Endogenous viral elements present in scaffolds probably belonging to the Hymenoptera genome are scored from A to D, and scaffolds probably belonging to free viruses are scored as F or X: (see details in Materials and methods). TPM (Transcripts per kilobase million) values were calculated via RNAseq read mapping when available in the databases (all RNAseq data sources can be found on the github repository under the name : RNA_seq_reads_mapped.txt). * In addition to the expected unique shared event concerning the M. demolitor and C. vestalis species, our pipeline inferred two additional events, each specific to one lineage. This was due to the fact that two genes were not detected by our pipeline as shared by M. demolitor and C. vestalis, either because they are effectively not shared (for 3 of them: HzNVorf118, like-pif-4 (19kda), fen-1), or because of some false negative in one of the two lineage (for one of them:p33 (ac92) 1 When paralogs were detected on different scaffolds, the best scaffold score was used. 2 Analysis on dN/dS was only possible when orthologs or paralogs were detected (for example, in V. canescens this calculation was only possible for 3 genes having paralogs)  Figure S3. Example of endogenization events. The phylogeny of cluster21304 corresponds to the clustering of a set of viral and candidate viral insertion genes sharing a homology. In red are represented the loci of viral origin, and in blue are represented the loci probably endogenized (EVEs). The letter at the end of the taxon label represents the endogenization score assigned to the candidate (see details in Materials and methods). In this example, we found two singular endogenization events in the species endoparasitoid Encarsia formosa (annotated A and thus presenting a depth of coverage non-significantly different from the distribution of the BUSOs of the genome as well as at least one transposable element and/or one eukaryotic gene) and ectoparasitoid Torymus auratus (annotated C and thus presenting only a depth of coverage non-significantly different from the distribution of the BUSOs of the genome). Since these two species do not share a close common ancestor in the phylogeny and come from two different families, the algorithm therefore assigned them to two independent viral endogenization events. The viral locus found in the assembly of the endoparasitoid species Platygaster orseoliae was annotated F, meaning that the depth of coverage deviated significantly from the BUSCO distribution of the genome and that no TEs and less than 5 eukaryotic genes were found in the scaffold containing the candidate insertion. Finally, the two loci belonging to the ectoparasitoid species Megastigmus stigmantizans and Megastigmus dorsalis both show a score supporting viral endogenization. Furthermore, these species exhibit a doubly monophyletic clade (high bootstrap score) within the gene phylogeny and within the species phylogeny, suggesting that they acquired this viral gene from their closest common ancestor about 20 million years ago. All newick phylogenies are available on the GitHub repository under the name : All ℎ .  Figure S5. Violin plots of the posterior distribution of GLM coefficients after exponential transformation in relation to wasp lifestyle. The ectoparasitoid lifestyle is in yellow, the endoparasitoid lifestyle is in green, and the free-living lifestyle is in blue. Coefficients have been transformed into exponential and correspond to the posterior distribution of the coefficients of a binomial negative zero-inflated GLM model, where the lifestyle free-living stand for the intercept. The Y-axis corresponds to the multiplicative factor of the number of endogenization and/or domestication of EVES and/or events relative to free-living species. The coefficients are derived from 1000 GLM models adjusted on 1000 randomly selected probable scenarios (>90 CI) of ancestral states at nodes. Branches from nodes older than 160 million years have been removed from the dataset. The ROPE% is the percentage of the  Figure S7. Candidate EVEs in two ant species. In grey are displayed the eukaryotic genes predicted by Augustus, with a dark color for exons, and light for introns. In black are displayed the transposable elements predicted by sequence homologies with the RepeatPeps protein database. The size of the scaffold is displayed below the name of each scaffold. The letter followed by the scaffold name refers to the scoring given to the scaffold based on coverage and gene/ET presence information (see details in Materials and methods). For the sake of representation, all scaffolds are represented at the different scale, the exact coordinates of the elements are referred in the abscissa which corresponds to the coordinates in base pairs. Annotation is indicated above the arrows.  Figure S8. Genomic environment for the EVEs detected in Platygaster orseoliae . The plot show regions homologous to viral ORFs in the Platygaster orseoliae filamentous virus (PoFV) genome (A). The colored regions correspond to the predicted ORFs in the PoFV genome (gray ORFS in PoFV scaffolds mean that no homologous EVEs were found in the genome of P.orseoliae.) (B), so the closer the colours, the closer the ORFs were initially in the PoFV genome. In grey are displayed the eukaryotic genes predicted by Augustus, with a dark color for exons, and light for introns. In black are displayed the transposable elements predicted by sequence homologies with the RepeatPeps protein database (max E-value =5.866e-18). The letter followed by the scaffold name refers to the scoring given to the scaffold based on coverage and gene/ET presence information (see details in Materials and methods). The exact coordinates of the elements are referred to the x abscissa, which corresponds to the coordinates in base pairs and can be found on the Github repository under the name : All_PoFV_PoEFV_loci_informations.txt. Annotation is indicated in or next to the arrows. The number below each EVE correspond to the homologous ORF number in the PoFV genome. Numbers are colored in red if the EVE has at least one paralog and if the computed dN/dS is below 1, suggesting purifying selection in the P.orseoliae genome. Arrow with black borders correspond to EVEs showing a complete ORF (>50% of the size of the best PoFV ORF). Hatched arrows correspond to pseudogenized EVEs (with premature stop codons). The colour difference between black and white for the names of the proteins is for visual purposes only.  V .c a n e s c e n s V .c a n e s c e n s * F .a ri s a n u s F .a ri s a n u s * C .i n a n it u s C .c o n g re g a ta C

of 45
.  Figure S9. Heatmap representing the viral genes known to be domesticated by Hymenoptera. The panel (A) refers to the four known cases (Venturia canescens, Fopius arisanus, Cotesia congregata and Microplitis demolitor) involving Nudivirus donors while the panel (B) refers to the known case involving LbFV donors in three Leptopilina species. Complete parasitoid wasp genomes information were available for Microplitis demolitor, Venturia canescens, Fopius arisanus and Cotesia congregata, while only partial genomic data were available for Chelonus inanitus. Each row indicates a gene which has been identified previously as being endogenized in at least one species. In (A), the first four columns indicate whether the gene is a core gene for baculoviruses (Bv), Nudiviruses (Nd), alpha-nudiviruses (alpha-Nv) or beta-nudivirus (beta-nv). The following columns indicate the presence of each gene based on the literature (in blue) and based on our pipeline (columns with a star symbol). The colors indicate the inferred selection pressure acting on each gene (dN/dS) and the letters A, B, C, D, E, and X represent the degree of confidence in the endogenization. Capital letters indicate that this gene is present in a scaffold that contains other candidate genes. When the box is framed in black, it means that the gene is expressed (TPM>1000). . The ectoparasitoid lifestyle is in yellow, the endoparasitoid lifestyle is in green, and the free-living lifestyle is in blue (the intercept). Coefficients have been transformed into exponential and correspond to the posterior distribution of the coefficients of a binomial logistic regression GLM model, where the lifestyle free-living stand for the intercept. The Y-axis corresponds to the multiplicative factor of the number of dEvents (corrected for Events rates) correlative to free-living species. The coefficients are derived from 1000 GLM models adjusted on 1000 randomly selected probable scenarios (>90 CI) of ancestral states at nodes. Branches from nodes older than 160 million years have been removed from the dataset. The ROPE% is the percentage of the posterior distribution of coefficients below the intercept. The posterior distribution of the interaction coefficients between lifestyles and branch size were not informative, and the branch size factor was therefore added as an additive effect to the model. A-The corrected coefficient within all dEvents, B-The corrected coefficient within all dEvents without the control genomes, C-The corrected coefficient within all dEvents present in a scaffold annotated with a score A, D-The corrected coefficient within all dEvents present in a scaffold annotated with a score A and without the control genomes. Figure S12. Cladogram of the Ophioniformes group, illustrating the two independent endogenization events of two unknown viruses in Banchinae and Campopleginae lineages. The phylogeny includes 12 subfamilies of the Ophioniformes group within the superfamily Ichneumonoidea. Several species of these subfamilies have been examined for the presence of ichnovirus-like polydnaviruses: by negative staining of calyx fluid (N), TEM of ovarian sections (S), visual examination of the calyx fluid (CF), probes from ichnovirus replication or structural proteins (probes) or by IVSPER sequence homology on whole genome assemblies (WG). The subfamilies and species in blue correspond to those positive for a dsDNA virus endogenization from unknown origin (ichnovirus-like). The others (in black) were negative for endogenized ichnovirus-like elements. The phylogeny is inspired from [115], and the data reported comes from [115,116,117,118].

Viral Homology search
Data products