Extensive hybridization between pig and human Ascaris identifies a highly interbred species complex infecting humans

Human ascariasis is a major neglected tropical disease caused by the nematode Ascaris lumbricoides. We report a 296 megabase (Mb) reference quality genome comprised of 17902 protein-coding genes derived from a single, representative Ascaris worm collected from 60 human hosts in Kenyan villages where pig husbandry is rare. Notably, the majority of human isolates (63/68) possessed mitochondrial genomes that clustered closer to the pig parasite Ascaris suum than to A. lumbricoides. Comparative phylogenomic analyses identified over 11 million nuclear-encoded SNPs but just two distinct genetic types that had recombined across the genomes analysed. The nuclear genomes had extensive heterozygosity and all samples existed as genetic mosaics with either A. suum-like or A. lumbricoides-like inheritance patterns supporting a highly interbred Ascaris species genetic complex. As no barriers appear to exist for anthroponotic transmission of these “hybrid” worms, a one-health approach to control the spread of human ascariasis will be necessary.


13
To estimate the number of supported ancestries (K) that could be resolved in the Ascaris 295 genomes sequenced, we calculated the Dunn index, which supported 3-6 ancestral populations 296 ( Figure 4D). A gradual increase in the Dunn Index after K = 6 was observed for an ancestral 297 population size between 2 and 15 ( Figure 4D and Figure S6). We next used POPSICLE to calculate 298 the number of clades present within each 10kb sliding window. Local clades were represented with 299 a different color and painted across the genome to resolve ancestry. The SNP diversity plots across 300 the 68 isolates identified 3 major "parentage blocks" that were resolved as belonging to ALV5 or 301 were genetically distinct with either both haplotypes sharing the alternate parent (homozygous 302 alternate), or were heterozygous between the two parental haplotypes for the majority of the 303 isolates ( Figure 4E, middle Circos plot. Color hues cyan, orange, aqua). 304 To visualize such shared ancestry across the different Ascaris strains at chromosome 305 resolution, a color hue representing a local genetic "type" present was assigned and integrated to 306 construct haplotype blocks across each chromosome for the ancestries present. Chromosome 307 painting based on shared ancestry revealed a striking mosaic of large haplotype blocks of different 308 admixed color hues, consistent with limited genetic recombination between a low number of 309 parentage haplotypes. These admixture patterns were readily visualized by shared color blocks 310 between different strains across entire scaffolds including AlgV5R019X ( Figure 5A) and 311 AlgV5R027X ( Figure 5B). In low complexity regions such as the left portion of contig 312 ALgV5R019X, only three major haplotypes were resolved ( Figure 5A). Strikingly, within each of 313 the 6 clades resolved, all strains showed a limited, mosaic fingerprint of introgressed sequence 314 blocks indicating that recombination has shaped the population genetic structure among the Ascaris 315 isolates sequenced. Evidence for both segregation and recombination were evident. For example, 316 isolates 1107E_1 and 2110F_2 shared the same chromosome at ALgV5R019X, but entirely 317 different chromosomes at ALgV5R027X, whereas isolates 107_1, 108_1 and 2110F_2 were 318 identical except at the subtelomeric end of ALgV5R19X. In this region two admixture blocks were 319 resolved; 107_1 and 2110F_2 remained similar to each other but 108_1 now possessed a sequence 320 14 block that was shared with isolate 119_3. This extensive chimeric pattern in chromosome painting 321 also closely resembled the genome-wide hierarchy tree ( Figure 5A). The data support a model in 322 which the isolates are genetic recombinants between A. suum and A. lumbricoides that are 323 predominantly inbreeding. 324 Geographic and demographic correlates of genetic similarity 325 To examine clustering of worms in similar human hosts, we statistically compared genetic 326 variation within groups (such as within a village) versus between groups (such as between 327 villages). We found significant genetic separation between worms in different villages (using 328 Adonis vegan in R), but not between worms from different countries ( Table 2). This suggests 329 genetic diversity is present in the population of Ascaris in these Kenyan villages, which is similar 330 to the diversity of populations of Ascaris around the world. It also suggests that a high proportion 331 of Ascaris transmission may occur within villages in this Kenyan setting. There was no evidence 332 from this analysis that the 13 worms collected three months after albendazole treatment were any 333 different than the worms collected prior to albendazole treatment (Table 2). 334 To expand on our observations, that genetically similar worms are found around the world, 335 but that similar worms cluster within a village based on our nuclear SNPs data, we plotted genetic 336 distances against geographic distances. Surprisingly, we found no significant correlations between 337 genetic and geographic distance, neither across all five studied villages nor within the two most 338 heavily parasitized villages ( Figure S7 and  and humans in endemic regions) and clade B (worms from humans and pigs from endemic and 350 non-endemic regions) described elsewhere 22 . It is likely that worms in both of these clades are 351 being transmitted from human to human, as pig husbandry is rare in this area of Kenya. Patterns 352 may differ by locality, and it is possible that some of the pig-associated (A. suum-like) worms 353 circulating in this human population in Kenya were acquired, perhaps generations ago, by humans 354 who lived in closer proximity to pigs. It is also possible that these worms were acquired from non-355 human primates 42 , or some other Ascaris host, rather than from pigs. 356 However, the SNPs across the whole nuclear Ascaris genome provide significantly greater 357 power in understanding Ascaris speciation Importantly, our nuclear genome SNP analysis suggest 358 that the 68 Kenyan Ascaris are distributed across multiple clades in a phylogeny based on the 359 nuclear genomes. Overall, data from our study and other studies are consistent with a pattern 360 where hybrid genotypes in Ascaris populations were observed 11,22,43 Our study represents one of 361 the most detailed accounts of mito-nuclear discordance in nematodes echoing patterns seen in 362 another human nematode Onchocerca volvulus 44 . The data in our current study shows the 363 occurrence of distinct mitochondrial lineages that could be evidence of early stages of species 364 differentiation. The admixture seen within the nuclear genome, however, appears to disrupt the 365 establishment of defined molecular speciation barriers between the different Ascaris lineages. Although the current genome is, by far, the most continuous assembly for Ascaris, it is not a 381 full chromosome assembly due largely to repetitive sequences, in particular 120 bp tandem repeat 382 clusters and long stretches of subtelomeric repeats. Thus, it is possible that mis-assembly in some 383 scaffolds has increased the frequency of mosaicism detected. It is for this reason that the 384 comparative analyses on the nuclear genome was restricted to the largest 50 scaffolds, most of 385 which are at chromosomal resolution, with only minor localized variation due to the repeat clusters. 386 In these high confident scaffolds, large haplotype blocks possessing either A. suum, A. lumbricoides 387 or both parental haplotypes (heterozygous) were readily resolved indicating that the genetic 388 mosaicism observed could not be solely attributed to genome mis-assembly. Ultimately, future 389 studies using ultralong PacBio and/or Nanopore sequencing combined with chromosome 390 conformation capture (Hi-C) techniques will improve the genome to full chromosome assembly to 391 more accurately resolve the true extent to which recombination has impacted the population genetic 392 structure of the Ascaris species genetic complex. 393 The finding that A. suum and A. lumbricoides form a genetic complex has important public 394 health implications. Reduced treatment efficacy is not currently a common issue in Ascaris 395 infections among humans or pigs 52-54 , though low efficacy of benzimidazoles is an issue for 396

Assembly and annotation of A. lumbricoides reference genome 436
The A. lumbricoides germline genome assembly was constructed using the A. suum genome as a 437 reference. Briefly, sequencing reads from a single A. lumbricoides worm (libraries #8457, #8458 438 and #8778) were mapped to the A. suum germline genome assembly 28 using BWA 63 to generate 439 BAM and MPILEUP alignment files. The MPILEUP files were processed with a PERL script that 440 replaced all variation sites in the reference genome with the highest allele frequencies in the A. 441 lumbricoides sample. A. suum genomic regions that represent < 5X of A. lumbricoides reads 442 coverage were excluded from the assembly. We further polished the genome with additional 443 Illumina sequencing reads using Pilon and its default parameters 64 . The A. lumbricoides genome 444 was annotated using the gene models built for A. suum, using the annotation transfer tool RATT 65 . 445 The protein coding regions were defined using TransDecoder 446 (https://github.com/TransDecoder/TransDecoder/wiki). To evaluate the gene expression across all 447 stages, we utilized previous RNAseq data from the developmental stages 27,28 , re-mapped the SRA 448 from adult males, females, L3 and L4 stages 25 to the current gene models, and quantified the 449 expression using tophat and cufflinks. The re-mapped reads, analyzed by JMP Genomics (SAS) 450 across all the stages and based on the principal component analyses ( Figure 1B), were grouped as 451 adult male, adult female, L1, L2, L3 (egg L3, liver L3 and lung L3), L4, carcass, muscle, intestine, 452 embryonic (zygote1, zygote2, zygote3, zygote4, 24h, 46h, 64h, 96h, 5d, 7d), ovaries (female 453 mitotic region, female early pachytene, female late pachytene, female diplotene and oocyte) and 454 testis (male mitotic region, spermatogenesis, post meiotic region, seminal vesicles and spermatids). 455 Proteome and comparative genomics analyses were done using an in-house pipeline 66   Co-ancestry heatmap 504 The SNP data (VCF file) was first phased accurately to estimate the haplotypes using SHAPEIT 102 505 after keeping only biallelic SNPs and loci with less than 80% missing data. Co-ancestry heatmaps 506 were generated using the linkage model of ChromoPainter 103 and fineSTRUCTURE 507 where each sample had at least 20x coverage for each locus. Using this list, the base calls for each 522 sample were pooled together to generate a single multi sequence fasta file. 523

22
Next, both maximum likelihood (ML) trees and bootstrap (BS) trees were generated with a final 524 "best" tree generated from the best scoring ML and BS trees using RAxML v8. Similarity within and between worms from different villages, households, people and time-points 528 was analyzed based on the distance matrix of the patristic distances from the phylogenetic tree 529 described above, using permutational multivariate analysis of variance (Adonis Vegan in R). The 530 distance matrix underlying the phylogenetic tree was analyzed in order to measure the significance 531 and contribution of different factors to variance between samples. Each factor (village, household, 532 host, time-point, human age group and worm age/size group) was analyzed both separately and 533 sequentially. The sequence chosen was ordered based on significance of each factor when tested 534 individually. Since multiple groupings were considered using the same dataset, multiple 535 comparison corrections were applied. Sample sizes and descriptions of each group are shown in 536 Table 2. Worms with recorded lengths were split into three groups of equal number. This was in 537 order to test whether larger worms (using worm length as a very approximate proxy for age, where 538 older worms might have survived at least one round of ALB treatment) were any different from the 539 newly acquired younger worms. Similar methods were used to analyze the mitochondrial 540 phylogeny along the same groupings. 541

Mitochondrial genome assembly 542
We assembled mitochondrial genomes using a de novo approach from 68 individual Ascaris 543 genomes. For each individual, the Ascaris mitochondrial reads in the total DNA sequencing were 544 identified by mapping the Ascaris reads to the A. suum reference mitochondrial genome (GenBank 545 accession: NC_001327). Adaptor sequences were trimmed prior to de novo assembly. To reduce 546 the complexity of the de novo assembly, we randomly sampled 1,000x reads from each individual 547 (the use of higher read coverage often resulted in fragmented scaffolds) and assembled these reads 548 using the SPAdes assembler 106 with continuous k-mer extension from K=21 to the maximum k-549 mer allowed (average extended k-mer size = 91). The assembled scaffolds were corrected with the 550 built-in tool in SPAdes to reduce potential assembly artifacts. Next, the assembled scaffolds were 551 aligned to the A. suum mitochondrial reference genome using BLAST, the order of the scaffolds 552 was adjusted, and they were joined into a single scaffold. Finally, the gaps in the scaffold were 553 filled using GapFiller 107 using mitochondrial reads from the same individual to generate a 554 complete mitochondrial genome. Using the same method, we also de novo assembled another five 555 A. suum or A. lumbricoides mitochondrion genomes from previous studies (see Table S7). 556

Analysis of mitochondrial genomes 557
In order to assess overall evolutionary relationships across the complete mitochondrial genomes, 558 we aligned the genomes using Clustal W and phylogenetic trees constructed using RaxML under 559 the conditions of the general time reversible model (GTR) as described above for the whole 560 genome SNP alignment. Subsequent tree files were formatted in FigTree and MEGA v7. The 561 variation in nucleotide diversity across the mitochondrial genome was measured using sliding 562 window analyses, with a window of 300 bp and a step of 50bp, using DNAsp v6 108 . In order to 563 assess the validity of potential species groupings in the ML phylogenetic tree the Birky 41 X4 ratio 564 was applied to the alignment of the complete mitochondrial genomes including both samples from 565 Kenya and published mitochondrial reference genomes from Tanzania, Uganda, China, USA, 566 Denmark and the UK. The X4 ratio method of species delimitation compares the ratio of mean 567 pairwise differences between two distinct clades (K) and the mean pairwise differences within each 568 of the clades being compared (ϴ). It is considered that if K/ϴ>4 this is indicative of the two clades 569 representing two distinct species. Owing to the fact that two clades are being compared there will 570 be two separate values of ϴ, as per recommendations of Birky 41 , the larger ϴ value is used to 571 perform the final ratio calculation as this will provide a more conservative result which ultimately 572 will be less likely to provide a false positive result. 573 Due to the extensive use of mitochondrial genome data in population genetic analyses of Ascaris, 574 several analyses were performed to identify the effect of any population level processes that may be 575 affecting the diversity of the parasites within Kenya. Initially, diversity indices were calculated for 576 each of the genes within the mitochondrial genome across the entire Kenyan data set as well as 577 considering the mitochondrial genome as a whole. In order to account for the diversity within the 578 genic regions, we removed non-coding and tRNA sequences for these analyses.  (Table S8). All described analyses were performed using DNAsp6 108 . As both CO1 and ND4 have 588 been used in the past for epidemiological studies, single gene phylogenies were also constructed as 589 described previously for comparison against the whole mitochondrial genome phylogeny ( Figure  590 S3 and supplementary methods). 591 Owing to the extensive use of the CO1 for epidemiological studies the gene was extracted from the 592 complete mitochondrial genomes of Kenya and compared to all other available Ascaris 593 lumbricoides and Ascaris suum CO1 sequences housed by NCBI representing populations from 594 across the globe. Haplotype network analyses was performed to produce the parsimonious network 595 using TCS as implemented through PopArt 109 . This provided a genealogical perspective of 596 population structure and allowed genetic connectivity between the Kenyan samples and other 597 geographical isolates to be assessed. 598 We would like to thank the school children, schoolteachers, and Bungoma administrators for their 601 support. We would like to extend special thanks to all the members of the study team: Bungoma 602

a) b)
Optical mapping, long sequencing reads, and three-dimension chromosomes organization 26 data (Hi-C) will be necessary to further improve the assembly. 27 28

Genome annotation and Ascaris proteome 29
In addition to transferring A. suum annotations to the reference-based germline A. 30 lumbricoides genome, we also transferred the A. suum gene models to the de novo and semi-31 de novo A. lumbricoides genome versions. While ~94.6% of the genes can be transferred to 32 both genomes, over 20% of the transferred genes are only partial matches and are fragmented 33 supporting the view that the de novo and semi de novo A. lumbricoides assemblies are highly 34 fragmented. Although the updated Ascaris suum proteome was reported in our previous study 35 Table S3. Description of worm from which each sample was sequenced. The sex of the 126 worm (based on morphological identification) and the part of the worm (germline vs somatic) 127 is listed. Some hosts donated multiple worms. 128