Candidates for Balancing Selection in Leishmania donovani Complex Parasites

Abstract The Leishmania donovani species complex is the causative agent of visceral leishmaniasis, which cause 20–40,000 fatalities a year. Here, we conduct a screen for balancing selection in this species complex. We used 384 publicly available L. donovani and L. infantum genomes, and sequence 93 isolates of L. infantum from Brazil to describe the global diversity of this species complex. We identify five genetically distinct populations that are sufficiently represented by genomic data to search for signatures of selection. We find that signals of balancing selection are generally not shared between populations, consistent with transient adaptive events, rather than long-term balancing selection. We then apply multiple diversity metrics to identify candidate genes with robust signatures of balancing selection, identifying a curated set of 24 genes with robust signatures. These include zeta toxin, nodulin-like, and flagellum attachment proteins. This study highlights the extent of genetic divergence between L. donovani complex parasites and provides genes for further study.


Introduction
Intracellular Leishmania parasites cause the neglected infectious disease leishmaniasis in over 80 countries. Visceral leishmaniasis (VL) is the most severe form of the disease, caused by Leishmania donovani and Leishmania infantum. Annual cases of VL are estimated at a minimum of 50,000, with a fatality of !95% if untreated, and occur primarily in the Indian subcontinent (ISC), Bangladesh, Sudan, South Sudan, Ethiopia, and Brazil (World Health Organisation 2020). After transmission by sand flies, Leishmania promastigotes are taken up by macrophages and develop into amastigotes which proliferate. These processes require specific adaptations to different environments, such as evasion and active modulation of mammalian host or sand fly vector immune responses (Atayde et al. 2016;Dong et al. 2019). Leishmania species contain genomes that are primarily diploid and sexually recombining. Amongst their unusual features are the use of constitutively transcribed polycistronic genes and supernumerary chromosomes with unstable ploidy (Dumetz et al. 2017).
Balancing selection (BS) has been studied in Plasmodium parasites extensively. In this genus, proteins that interact directly with host cells maintain high genetic diversity (Mobegi et al. 2014;Ochola-Oyier et al. 2019;Hocking et al. 2020), as do proteins that are exported to the surface of erythrocytes (Jeffares et al. 2007). BS signals are also enriched in solventexposed regions of proteins consistent with selection for increased diversity via a rare allele advantage (Guy et al. 2018). Given the competitive interaction between Leishmania cells and host immune cells (Atayde et al. 2016;Dong et al. 2019), BS may also operate in this parasite if rare (parasite) alleles provide an advantage to host-parasite interactions. Other mechanisms of BS, such as heterozygote advantage (overdominance) or alleles that confer fitness differentially in the sand fly vector and the mammal host are also possible. These processes are expected to generate similar genetic signals (Charlesworth 2006). In all these scenarios, genomic signatures of BS can highlight genes that are important for transmission, host immune evasion, or ecological adaptation.
Thus far, there have been no published studies of BS in Leishmania species. Here, we use genome data from 477 clinical isolates from the L. donovani species complex (L. infantum or L. donovani) from East Africa, the ISC, and Brazil to identify five populations that are well-represented by genome data. Using a variety of metrics, we search for signals of BS within these populations. We identify multiple strong signatures of BS. Signatures are generally unique to a single population consistent with adaptive divergence between populations.
from all isolates to the L. donovani BPK282A1 reference genome, and applied variant calling methods and filtering to identify single-nucleotide polymorphisms (SNPs) and insertion/deletion polymorphisms (indels). In this data set of 477 isolates, we identify 339,367 SNPs and 14,383 indels.
We used the ADMIXTURE clustering tool (Alexander and Lange 2011) to assign isolates to populations. This analysis indicated that this collection can be clustered into between 8 and 11 populations ( fig. 1; supplementary figs. 1 and 2, Supplementary Material online). The majority of these isolates could be assigned with !99% confidence to one of five relatively well-sampled populations ( fig. 1). Principal component and phylogenetic analysis showed consistent results. These five populations included two from the Indian subcontinent (ISC1, ISC2), two from East Africa (EA2 from North Ethiopia/ Sudan, and EA1 which corresponds to a population from South Ethiopia/Kenya; Gelanew 2010) and a Brazil-Mediterranean population (BM; also contains isolates from Honduras and Panama but we refer to this population as BM from here onwards for brevity). The remaining isolates were assigned to populations of <6 isolates (n ¼ 44).
Our results are generally consistent with previous analysis (Franssen et al. 2020), indicating that these five populations have largely independent ancestries. Fixation index (F ST ) values range from 0.27 to 0.90 (supplementary table 2, Supplementary Material online). As has been observed previously (Gelanew 2010), the two East African populations and the older ISC population (ISC2) are approximately equidistant from one another, with F ST in the range of $0.3. Larger F ST values appear to be due to genetic divergence from the newly emerged ISC1 and BM populations. Only 7% of polymorphic sites are shared between two or more populations. We note that rare hybrids have been shown to occur between L. donovani complex populations in both East Africa and Turkey (Rogers et al. 2014;Cotton et al. 2020). We do not include L. donovani complex hybrids from Turkey (Rogers et al. 2014) in our analysis, because hybrid populations may contain balanced alleles from the parental populations that give an appearance of BS. This, and the under-sampling of VLendemic regions between Europe and India, render our data unsuited to studying the true extent of global gene flow in this species, so we do not analyze this further here.
Phylogenetic analysis provides some qualitative insight to the history of these species ( fig. 1C and D). The long-branched positions of EA1 and EA2 support the relative age of these populations in East Africa, as does the high genetic diversity in this region and genetic distance between these populations, consistent with previous studies (Gelanew et al. 2010(Gelanew et al. , 2014Ferreira et al. 2012;Teixeira et al. 2017;Zackay et al. 2018;Cotton et al. 2020;Franssen et al. 2020). The high nucleotide diversity of EA1 (table 1) is reflected in the branch lengths in this clade of the phylogeny. In contrast, the smaller ISC population, identified as ISC1 here (equivalent to the ISC5 group identified by Imamura et al. [2016]), produces short terminal . Isolates in gray were not confidently assigned to one of the five major populations (BM, EA1, EA2, ISC1, and ISC2) by ADMIXTURE. (C) Unrooted ML phylogeny, based upon an SNP alignment of 477 sequences with 283,378 variable sites. All visible branches are maximally supported (100% mlBP) unless indicated. The scale bar represents the number of nucleotide changes per site. Country names in gray indicate origins of isolates that were not confidently assigned to one of the five major populations. (D) BM ML tree of L. infantum strains, based upon an SNP alignment of 158 sequences with 81,018 variable sites, midpoint rooted. Ninety-three of these isolates were sequenced in the current work. Scale bar and support are as in (C). A version of this tree with all isolate origins is available in supplementary figure 3, Supplementary Material online. A single sample isolated in China (Franssen et al. 2020), is the only demographic exception in the BM sample collection, and is indicative of movement of parasites. Data and tree files are available in supplementary documents, Supplementary Material online. (E) Locations of samples used in this study. Pie charts show the number of samples from each location that are confidently assigned to one of the five major populations, with a radius proportional to the number of samples from each location. Gray indicates isolates that were not confidently assigned to one of the five major populations. branches in the phylogeny and lower genetic diversity (table 1  and fig. 2), consistent with previous genomic analyses indicating that is an emergent population (Imamura et al. 2016). Epidemiological evidence indicates that this population arose in the 1970s after the malaria elimination program (Dye and Wolpert 1988;Bhattacharya et al. 2006;Thakur 2007;Muniaraj 2014;Dhiman and Yadav 2016). The 93 Brazilian isolates we examined, which are mostly from Piau ı state in north west Brazil, cluster within isolates originating from Mediterranean countries ( fig. 1D), consistent with a relatively recent European introduction of L. infantum into Brazil (<400 years ago; Kuhls et al. 2011). Short branches in the Brazilian clade ( fig. 1C and D), low-genetic diversity and an abundance of rare alleles ( fig. 2) are all consistent with a previously occurring population bottleneck and an expanding population (i.e., a founder effect caused by the transportation of L. infantum to Brazil). In contrast, both East African populations and the older population from the ISC2 show higher genetic diversity, and are likely to have been maintained as larger populations for longer periods of time.
As well as the very low diversity, the BM and smaller East African (EA2) populations contain more indel polymorphisms (table 1), with an SNP:indel ratio of 6:1 and 5:1, respectively, compared with EA1 (11:1), ISC1 (10:1), and ISC2 (9:1), as expected for eukaryote polymorphisms, where SNPs typically outnumber indels (Mullaney et al. 2010;Jeffares et al. 2015). Extensive variant-calling quality control showed that this excess of indels is unlikely to be artefactual (supplementary fig.  4, Supplementary Material online). It is possible that this is due to the accumulation of weakly deleterious indel alleles when the Brazilian population was established from European

Analysis of L. donovani Species Complex Populations
Genetic diversity summary statistics vary considerably between populations ( fig. 2 and table 1), consistent with these varied demographic histories in different locations. For example, the initial population within the ISC (ISC1) has a normally distributed Tajimas's D, whereas the Tajimas's D is strongly skewed to negative values in the emerging ISC2 population. To select groups of strains that will approximate panmictic populations, we used the ADMIXTURE analysis to identify isolates that were confidently assigned to one population (n ¼ 433), rather than being interpopulation hybrids. This selection resulted in 59 isolates from East Africa (two populations of 41 and 18), 226 from the ISC (two populations of 211 and 15), and 127 from BM, of which 93 were newly sequenced here (table 1). Our assignment of isolates into five major populations largely agrees with the previous global analysis of Franssen et al (2020). To characterize BS in these populations we applied the NCD2 test (Bitarello et al. 2018) and Betascan1* test (Siewert and Voight 2017) in 10 kb windows to each of the five populations. Genomic windows that were outliers for the Betscan* test, were enriched for low NCD2 scores and high Tajima's D values, indicating that these three metrics were largely complementary (supplementary fig. 12, Supplementary Material online).

Are Targets of BS Shared Between Populations?
In some circumstances BS can be maintained as populations or species diverge (Siewert and Voight 2017;Bitarello et al. 2018;M erot et al. 2020;Wang et al. 2020;Ding et al. 2021).
Given these examples, we examined whether any genes had maintained BS between populations of the L. donovani species complex, indicating long-term BS. To assess this without relying on shared polymorphisms, we used the Betascan1* maximum and NCD2 minimum scores for each gene, for each population as a summary statistic (see Materials and Methods). We find scant evidence for shared BS from Betascan1* scores. We define Betascan1* outlier genes as those in the upper 5% of Betascan1* scores for their population. There was little overlap in these outliers; 701 genes are outliers in at least one population, only 42 of these (6%) were outliers in two or more populations (supplementary table 3, Supplementary Material online), and only 9 are outliers in three or more populations (1%). The NCD2 metric identified more overlap between populations, but long-term BS is still the exception; 1,627 genes were 5% outliers in at least one population and only 195 (11%) were outliers in more than one. However, because the NCD2 metric measures the similarity of allele frequencies to a target frequency (0.5 in our case; Bitarello et al. 2018), genes that are merely subject to weaker purifying selection will have elevated NCD2 scores. Another possibility is that weak polygenic BS operates on a number of genes, perhaps transiently. This may be the case for frequency-dependent BS, for example, in exported and cell surface-located erythrocyte membranes and exported proteins in Plasmodium falciparum (Volkman et al. 2002;Jeffares et al. 2007;Claessens et al. 2014). In this scenario, we might expect BS targets in one population to predict genes with higher metrics in other populations, due to a history of weak BS. For example, if multiple genes are weakly influenced by BS, they may have elevated signals as a group, even if any one particular gene is not significant alone. We examined this using only Betascan1*, because we suspect that the measure of correlated allele frequency that Betascan utilizes will be less confounded by weak purifying selection. To examine this, we examined whether the 10 kb regions with the 5% highest Betascan1* scores in the East African EA1 population were enriched for high Betascan* scores (compared with the remainder of the genome). We found that outliers from EA1 do not show significantly elevated Betascan1* scores in any other population. We repeated this analysis for the 5% highest outliers from ISC1 and from BM (again comparing to each other population; supplementary  populations, where ISC1 outliers predict higher scores in ISC2 (P ¼ 2 Â 10 À4 ; supplementary fig. 18, Supplementary Material online). Since the ISC1 population has been derived from ISC2 population relatively recently (Imamura et al. 2016), we can expect some aspects of the genetic diversity to be maintained. In summary, signals of BS are generally not shared between populations of L. donovani.

Identifying Genes That are Subject to BS
To advance research in Leishmania it would be useful to identify the most likely targets of BS. To achieve the most comprehensive detection of BS signatures, we performed both Betascan* and NCD2 tests with 1, 5, and 10 kb windows. We observed complete overlap between tests using 1 and 10 kb windows, with four additional ORF-containing regions identified with 5 kb windows (supplementary table 4 and fig. 16, Supplementary Material online). As a pragmatic approach, we sought to identify genes with robust and strong signatures from multiple metrics. To achieve this, we selected 10 kb genomic windows that were in the first or 99th percentile of either the NCD2 test or Betascan tests, respectively. To identify the genes within NCD2/Betascan windows that are likely targets, we calculated nucleotide diversity (p) and Tajima's D (Tajima 1989) for each gene, and selected genes in the 90th percentile of either statistic as well-supported plausible targets. We then selected genes that were outliers in both categories (NCD2 or Betascan* and p or Tajima's D). This intersection identified 38 genes (supplementary table 4, Supplementary Material online). We manually vetted these to remove "hitchhikers," genes whose high diversity was likely due to their proximity to a BS "driver" gene, resulting in 24 vetted candidates. We also removed genes with suspicious read coverage, because gene duplications produce strong artefactual signals of BS (supplementary fig. 13, Supplementary Material online). Due to the stringent process of filtering, this method is not guaranteed to have equal power to detect BS in the five populations we examine, which are represented by different numbers of isolates, have different levels of nucleotide diversity and different allele frequency distributions ( fig. 2 and table 1). This screen identified 24 candidate genes (table 2;  justification for vetting genes in supplementary table 4, Supplementary Material online). Candidate genes in the EA1 population, where the most were discovered, have nucleotide diversity that is 34-fold higher than the genome-wide median ( fig. 3). Diversity is elevated in genome regions surrounding these target genes and remains significantly elevated up to 250 kb from the targets. Because the mean size for chromosomes in L. donovani is 900 kb, this increase in diversity influences a large proportion of the genome. Furthermore, BS candidates are enriched for high minor allele frequency (MAF) cosegregating sites and show higher levels of statistical linkage than genome-wide distributions, consistent with expectations for genes that are subject to BS (Charlesworth 2006;supplementary fig. 14, Supplementary Material online).
Consistent with the lack of evidence for shared BS between populations, the genes that are BS candidates in the EA1 population do not show significantly elevated Tajma's D values in any other population (supplementary fig. 17, Supplementary Material online).
To identify these candidates NCD2 and Betscan* tests were applied in 10 kb genomic windows for each population. Candidate genes that were both 1) outliers for at least one test (>99.5 percentile) and 2) outliers (95 percentile) for either Tajima's D and/or nucleotide diversity (p, calculated within the gene start-end window). Details of the method are described in supplementary fig. 6 and text 2, Supplementary Material online. All comparisons of tests for each population are available in supplementary figures 7-11, Supplementary Material online, respectively. Where protein function is "unknown" on TriTrypDB, we subjected each protein to BLASTp searches to obtain homology to other known proteins and ascertain conservation across trypanosomes.
Candidate Genes for BS in the L. donovani Complex Our manual vetting of BS candidates retained 24 genes (table 2), of which 20 were discovered in the EA1 population. We did not discover any reliable candidates in the two populations that appear to be expanding following a bottleneck (BM and ISC1). Figure 4 illustrates the variety of robust genetic signatures that implicate four of these genes. All vetted genes contain similarly robust signatures (supplementary fig.  16, Supplementary Material online).
Several genes caught our attention as interesting targets of selection. LdBPK_291600.1 encodes a transmembrane protein containing a nodulin-like domain. Such proteins have been implicated in membrane transport and iron homeostasis (Laranjeira-Silva et al. 2018) in Leishmania. The zeta toxin domain protein (LdBPK_341740.1; fig. 4) is indicated as a BS target by both Betascan* and NCD2 metrics in the East African population EA2. The gene also has high nucleotide diversity in EA1 (supplementary table 5, Supplementary Material online). Because the phylogeny of this zeta toxin gene does not separate EA1 and EA2 isolates, as we would expect from the genome-scale divergence in figure 1, this gene may be subject to BS in both species, or may be a recent instance of between-population introgression. The zeta domain is positioned at 744-861aa, with two nonsynonymous variants resulting in the changes Leu747Thr and Ser752Phe, respectively, from the reference genome. The zeta toxin is part of the Type-II toxin-antitoxin (TA) module identified in prokaryotes, with homologues only recently discovered in Leishmania (Srivastava et al. 2019). The toxin component of the TA module acts against cellular processes such as translation and is neutralized by the antitoxin component in

Discussion
Here, we sequenced 93 strains of L. infantum from Brazil, contributing to a worldwide collection of L. donovani complex isolates along with previous analyses (Imamura et al. 2016;Carnielli et al. 2018;Zackay et al. 2018;Franssen et al. 2020). Our analysis of this population is consistent with these previous studies, showing that the L. infantum population in Brazil contains very little genetic diversity (Carvalho et al. 2020;Schwabl et al. 2021). A consistent observation in analysis of this species complex is that populations from East Africa, India, and Brazil are substantially genetically differentiated, a result that we reiterate here ( fig. 1). In this study of BS, we also show that signals of selection largely differ between populations.
Relatively few studies have attempted BS screenings in parasites (reviewed by Weedall and Conway 2010). Our study uses the Betascan* (Siewert and Voight 2017) and NCD2 (Bitarello et al. 2018) metrics that have been developed recently and have been shown to outperform classic metrics under models of BS, such as Tajima's D (Tajima 1989). These analytic tools, and the use of multiple populations should produce an analysis at least as sensitive as previous screens for BS searches within parasites such as Plasmodium, producing up to 25 candidate genes (Tetteh et al. The 24 candidate genes uncovered here possess varying functions within the L. donovani species complex. The flagellum attachment gene FLAM3 is a striking candidate considering the importance of the protein in parasite cellular structure, proliferation, and differentiation (Hayes et al. 2014;Sunter et al. 2019;Halliday et al. 2020). Furthermore, candidate genes LdBPK_311120.1 and LdBPK_361900.1, which encode a member of the emp24/gp25L/p24/GOLD family and a ras-like GTPase, respectively, may influence trafficking of virulence factors, and subsequently interaction with their host. ras-like GTPases may contribute to attenuation of VL via the TOR pathway in L. donovani (Zhang et al. 2014). Given the lack of detailed studies of the majority of these candidates in the L. donovani species complex, studies of the cell biology of these proteins will be useful next steps.
Our analysis suggests that the performance of Betascan* and NCD2 BS tests are dependent on the changes of population demography within this species complex. This is partly due to our pragmatic screening criteria that required >5 variants per genomic window (Betascan*, NCD2) and >5 variants per gene (Tajima's D and p metrics). The low-diversity Brazil and ISC1 populations contain far fewer regions that satisfy these criteria. However, strong population bottlenecks would enhance the loss of polymorphic sites by drift, ablating strong BS signatures. For example, 99% of 1 kb genomic windows in our BM population contain <5 segregating sites, which will reduce the scale of Betascan* and NCD2 metrics. This could result in a loss of power to detect long-term BS. Only two populations (EA1 and ISC2) appear to have relatively stable population sizes and sufficient nucleotide diversity to identify BS candidates using our pragmatic methods (table 1). More complex methods that employ population models to detect BS are available (DeGiorgio et al. 2014;DeGiorgio 2019, 2020). We chose not to employ these because we do not believe that L. donovani complex populations are sufficiently understood to be meaningfully modeled at this stage. For example, there is no accurate estimate of the recombination rates or mutation rates in these populations, nor estimates of divergence between L. donovani and L. infantum in terms of generations since it is not clear how many generations per year Leishmania species undergo in natural conditions. Our approach was to divide the samples into multiple populations and exclude potential between-population hybrids, with the expectation that this would alleviate some of the issues resulting from population structure.
We conclude that BS targets are generally not conserved between populations of the L. donovani complex. rather than a lack of power. The differentiation of BS targets is most likely due to differentiated processes (or history) of selection, rather than a lack of power to detect targets in some populations. Several observations led us to this view. Tajima's D is not at all correlated between populations (data not shown), but this might be expected if most genes are dominated by drift. The high F ST between populations, and relatively few shared polymorphisms are consistent with this. The initial 24 candidate BS genes that were discovered in the EA1 population, did not have statistically higher Tajima's D (as a group) in any other population (supplementary fig. 17, Supplementary Material online), so this set is not enriched for BS elsewhere. We also examined whether Betascan* outlier genes in EA1, ISC1, or BM were enriched in any other population (supplementary fig. 18, Supplementary Material online). In general, outliers in one population were not enriched for Betascan* scores in any other population, the only exception was ISC1 outliers were enriched in ISC2. Given the relatively recent derivation of ISC1 from ISC2 in the 1970s (Imamura et al. 2016), some maintenance of diversity is to be expected. It is possible that other evolutionary changes caused some of the signals we observe, including introgression events, partial selective sweeps or transient heterozygosity excess that can occur as a consequence of adaptation (Sellis et al. 2011). Local adaptation can also lead to an appearance of within-population diversity and/or excess heterozygosity (Wood et al. 2008;Eizaguirre and Lenz 2010;Ellison et al. 2011;Keller et al. 2011), particularly when the distribution of local "niches" is not well understood, which is generally the case with Leishmania. It is possible, for example, that multiple variants exist within these genes as adaptations to regional differences in host or sandfly cellular/extracellular environments. Parasite genes may vary with regional variations in HLA loci that affect susceptibility to VL (Blackwell et al. 2020). In any case, the candidate genes we identify warrant further study.
It is possible that BS of single-copy genes is not the most important mechanism that maintains diversity within the L. donovani complex, or protozoan parasites generally. The effects of multicopy gene families encoding RIFIN, STEVOR, and PfEMP1 variant surface antigens in pathogenesis of Plasmodium parasites are well-described (e.g., Claessens et al. 2014;Wahlgren et al. 2017). These genes are typically removed from BS screens, because multicopy genes will produce artefactual signals of BS when analyzed with current bioinformatics methods (supplementary fig. 15, Supplementary Material online). Although variants in duplicated regions or multiple copies of genes may allow the parasite to maintain diversity, it is an open question whether this diversity is maintained by neutral processes or BS.
At present we regard our candidate genes as "likely suspects" for BS, rather than experimentally proved examples. FIG. 3.-Diversity is significantly elevated in BS target regions. On the left we show the distribution of nucleotide diversity (p) genome-wide for the EA1 population (GW) and the distribution for the 500 kb around all the 20 vetted BS targets discovered in the EA1 population. On the right, the filled circles show the median p (for all BS targets) every 10 kb up and downstream to 500 kb from the targets. Circles are red where the diversity at this distance is significantly higher than the genome-wide distribution and black otherwise (Wilcoxon signed rank tests <1.5 Â 10 À4 , using both up-and downstream p values). The distribution of nucleotide diversity values for target genes is shown using box and whisker plots at 50 kb intervals. There are various biological scenarios that could produce signals of BS. Perhaps the simplest is frequency-dependent rare allele advantage or overdominance within human/mammalian hosts. In this case, experimental support for these targets would require demonstration that host cell populations produced different responses to different alleles of the proteins. Another possibility is that overdominance is caused by alleles whose fitness differs in sandfly and human hosts. Technically, this is more challenging to test, but could be achieved by tracking genotype frequencies of segregating F 2 populations within laboratory passages between sandfly and mammalian models. Finally, signals of BS can be caused by fine-scale clines of alleles with differential fitness across different environments (Westram et al. 2021). In our case, these could be sand fly or human host genotypes. Evidence for this scenario would require fine-scale localized genetic data.

OTHERS
In summary, our description of diversity in the L. donovani species complex provides insight into the global populations of this parasite. We show that these populations are genetically divergent, with independent signals of BS. Our discovery of a handful of genes with robust signatures of BS provides candidate genes for the study of host-parasite and host-vector interactions.

Ethics
Samples from Brazil were obtained as part of a broad study for genomic studies in the Laboratory of Leishmaniasis at the Institute of Tropical Medicine Natan Portella, approved by the Research Ethics Committee of the Federal University of Piau ı (approval ID number 0116/2005). All methods were performed according to the approved guidelines and regulations. A written informed consent was obtained from all study participants or their legal guardians.

Strain Culture and Genome Sequencing
Bone marrow aspirates were obtained from the routine diagnosis of patients admitted to the Natan Portella Tropical Diseases Institute in Teresina-PI, Brazil. Aspirates were inoculated into a mixed culture medium NNN (Neal, Novy, Nicolle) containing 2 ml of Schneider's medium supplemented with 10% fetal bovine serum, 2% urine and penicillin 10,000 U/ ml, and streptomycin 10 mg/ml. The positive isolates in mixed media were expanded in Schneider's liquid medium under the same conditions mentioned above. Extraction of DNA from the parasites was performed after washing to remove culture medium, using Qiagen Blood and Tissue kit was used according to the manufacturer's recommendation.
Genome sequencing was performed on Illumina HiSeq 2500 machines (or similar) to produce paired end 150 nt reads. The majority (95%) of the samples were sequenced to provide mapped read coverage of !30Â (mean 97Â,  (Li and Durbin 2009), converted to bam, sorted, indexed, and duplicates removed with SAMtools v.1.9 ).
With this variant filtering we observed a correlation between MAF and read coverage at SNP sites (supplementary fig. 13, Supplementary Material online). Modeling showed that duplications resulted in a systematic bias against calling rare alleles. We therefore removed any SNP/indel sites where the mean variant coverage within the ADMIXTURE-defined population was !1.5Â larger than the median coverage (corresponding to triploid sites in a generally diploid chromosome), or !1.25Â larger than the median coverage for chromosome 31 (corresponding to tetraploid sites in a generally triploid chromosome). We also removed sites where coverage was highly variable, by excluding sites in the upper 5th percentile of the coverage standard deviation (SD). In each population this filtered $5-7% of sites. Mapping coverage was ascertained by SAMtools bedcov for each gene in the multipopulation Variant Call Format (VCF) file. After this filtering, the correlation between MAF and read coverage was either far less significant or removed completely. This filtering retained 10,377 out of a possible 10,778 sites in population ISC1; 9,781 out of a possible 10,227 sites in population ISC2; 40,127 out of a possible 41,957 sites in EA1; 11,757 out of a possible 12,365 sites in EA2, and 26,884 out of a possible 28,281 sites in BM.
To validate the variant filtering we produced a de novo assembly of the MHOM/BR/06/MA01A L. infantum isolate from Brazil (Carnielli et al. 2018), mapped Illumina reads from the same isolate to the assembly, and called SNPs and indels as above. All calls should be heterozygous sites, or errors. Initial variant calling identified 4 SNPs and 23 indels, after filtering no SNPs or indels remained, consistent with a very low false positive call per strain. The MHOM/BR/06/ MA01A de novo assembly will be described elsewhere. Briefly, the assembly was produced using Oxford Nanopore Technology (ONT) reads to 110Â coverage, assembled with Canu v.1.9 (Koren et al. 2017), polished once using ONT reads using Nanopolish v.0.9.2 (Loman et al. 2015), and thrice with Illumina reads using Pilon v.1.22 (Walker et al. 2014).
Prior to BS tests performed on the five populations (EA1, EA2, ISC1, ISC2, BM), mixed ancestry strains were removed from population VCFs. Population-specific VCFs were filtered with VCFtools v.0.1.15 (Danecek et al. 2011) to remove sites that were fixed within a population (option -mac 1). Repeat regions (see below) were also filtered out of VCFs at this stage. Tajima's D, p, and MAF were calculated on unpruned variants using VCFtools. Tests for BS used biallelic SNPs and indels from each population. Copy-number variant and duplicated genome regions were removed from this analysis, as these regions will produce biases in allele frequencies toward common alleles, producing artifactual signals of BS (supplementary text 1 and fig. 5, Supplementary Material online). Variant calling for multicopy regions was beyond the scope of this study.
Repeat regions were determined as follows. Intergenic coordinates in L. donovani were extracted from the annotation .gff, downloaded from TriTrypDB (version November 2019) with BEDtools v.2.27.1 (Quinlan and Hall 2010) complement with default parameters. Intergenic regions were then extracted from the genome using BEDtools getfasta. Repeat regions in L. donovani were identified by nBLASTing v.2.7.1 (Altschul et al. 1990) intergenic regions against the rest of the genome, disregarding redundant hits and those <200nt in length. Resulting coordinates were converted to bed format for filtering out of the VCF. This filtering removed 401 sites in the ISC1 population; 446 in ISC2; 1,830 in EA1; 608 in EA2, and 1,387 in BM.

BS Tests
The NCD2 test used Equation 1 provided by Bitarello et al. (2018), using windows of 1, 5, and 10 kb with step sizes of 0.5, 2.5, and 5 kb, respectively. Ten kilobase pairs of window sizes were used in this study as sizes of 1 and 5 kb largely returned windows without scores (but see Identifying Genes that are Subject to BS). A list of fixed differences between L. donovani populations (total 285 isolates) and the nonadmixed BM L. infantum population (127 isolates) was used in the analysis for NCD2: Fixed difference sites were encoded as MAF ¼ 0. Fixed differences were determined using bcftools isec called on VCFs of all L. donovani populations and L. infantum containing fixed variants, resulting in 76,284 fixed difference sites. Our target frequency of 0.5 (tf) and Equation 2 of Bitarello et al. (2018) were used to generate Ztf-IS scores, with the exception of using the SD for each number of informative sites (IS) rather than simulated SD. P values for each window were calculated by assigning a rank based on Ztf-IS score and dividing by the total number of scanned windows.
The Betascan1* test (Siewert and Voight 2017), using default parameters, using the file format generated from the variants using glactools (Renaud 2018; available at: https:// github.com/grenaud/glactools) We performed the test on each individual population.
Betascan1* and NCD2 scores were calculated in windows around each variant site. To obtain values for each gene, we used the maximum Betascan1* score for all variants within the gene and the minimum NCD2 score within each gene (because low NCD2 scores are indicators of BS). Following the recommendations of Siewert and Voight (2020), we only considered windows containing !5 variants.

Gene Ontology Analysis
Gene Ontology (GO) descriptions and gene details for the L. donovani BPK282A1 reference genome were downloaded from TriTrypDB. GO enrichment analysis was performed using the PANTHER service on tritrypdb.org. Proteins that were classed as "hypothetical" or of "unknown function" were BLASTed against the nonredundant protein sequences (nr) database of NCBI to obtain possible identity by shared homology, and to determine conservation across trypanosomes.

Supplementary Material
Supplementary data are available at Genome Biology and Evolution online.