Introduction

For many organisms, in particular, very small ones such as parasites, the analysis of genetic variation at different hierarchical levels is often the only way to investigate population parameters such as gene flow, size of reproductive units or breeding strategies (for example, Criscione et al., 2005). Interpreting the distribution of genetic variation in these terms is nevertheless often not a simple process. This is especially so, as the sampling design may generate artificial patterns that may lead to erroneous conclusions regarding the biology of the species concerned. A famous example is the Wahlund effect (Wahlund, 1928). When individuals belonging to two or more genetically differentiated populations are gathered into the same sample (considering that they belong to one unique population), this may artificially increase the observed homozygosity compared with the Hardy–Weinberg expectations (H–WE), even if the subpopulations themselves were in H–W equilibrium. This may thus lead to false biological interpretations, for example, the existence of self-fertilization in the studied species. The quality of the sampling design is therefore of crucial importance for biological inferences.

For parasitic organisms, including micropathogens, a rather frequent way of sampling from local populations is to collect and genotype only one randomly chosen parasite per host individual, even though each host harbors an infrapopulation. An infrapopulation refers to all individuals of a parasite species in or on an individual host at a particular time (Criscione et al., 2005). All infrapopulations at a given place and time constitute the component population (Bush et al., 1997). Within the component population, infrapopulations may be considered as demes or subpopulations. Indeed, most of the time, they constitute cohesive genetic units (in which genetic exchanges may potentially occur) that may persist, more or less structured (depending on the migration rate among infrapopulations at each generation and the rate of infrapopulation extinction), over time (Criscione and Blouin, 2005, 2006). Plasmodium falciparum oocysts present within each infected anopheline mosquito (Razakandrainibe et al., 2005) as well as, the different bacterial strains of the same species that coexist within a host are examples of infrapopulations. In this latter case, the infrapopulation evolves for many generations within an individual host in which it is regulated by the immune system, competition for resources and/or space. Migration events occur from other infrapopulations. Genetic diversity inside the infrapopulation may also evolve under the effect of repeated mutations and recombinations occurring between strains. Infrapopulations thus behave just as true subpopulations. Variations of life cycles and transmission patterns from one generation to another among parasite species determine the level and patterns of genetic structure observed among infrapopulations (Prugnolle et al., 2005a, 2005b; Criscione and Blouin, 2006; Criscione, 2008). A component population can therefore be considered and modelized as a classical subdivided population (the subpopulations being the infrapopulations).

The one individual per host (infrapopulation) sampling strategy has frequently been used, particularly, to investigate the biology and especially the reproductive mode of some parasitic fungi (Boerlin et al., 1996; Mzilahowa et al., 2007; Pujol et al., 1993; reviewed in Nebavi et al., 2006), some protozoa (MacLeod et al., 2000; de Freitas et al., 2006; Koffi et al., 2007; Kuhls et al., 2007), some helminths (Mulvey et al., 1991; Nejsum et al., 2005) or some bacteria (Maynard Smith et al., 1993). Using simulations, we here investigate the effect of such sampling design on the observed distribution of genetic variability within populations of clonal and partially clonal organisms. For the sake of simplicity, we have limited our investigations to clonal parasites with an acyclic life cycle (de Meeus et al., 2007). This life cycle is characterized by the fact that individuals in a population originate either from clonal or sexual reproduction. The frequency of sexually or asexually produced individuals can vary. This life cycle is found in a variety of parasites including some imperfect fungi (for example, Candida albicans), some Parabasalia (Trichomonas vaginalis), some Metamonadina (Giardia lamblia), some parasitic amoebas or some kinetoplastid parasites (Leishmania, Trypanozoma).

For the distribution of genetic variability, we have considered two aspects that are often used to make inferences on the nature of pathogen reproduction modes (Tibayrenc and Ayala, 2002; Balloux et al., 2003): (i) linkage disequilibrium between loci and (ii) departure from H–WE. We show that collecting and genotyping only one individual pathogen per host individual (or per subpopulation) and pooling them, considering that they belong to one unique subpopulation, may generate strongly misleading patterns of genetic variations that may lead to false conclusions regarding their mode of reproduction. In particular, we show that (i) the level of linkage disequilibrium is significantly reduced compared with the one observed within each subpopulation (or infrapopulation) and that (ii) the departure from the H–WE may be strongly modified, sometimes giving the false picture of an apparent strongly recombining species, although highly clonal.

Materials and methods

Simulations

We used an individual-based simulation approach as implemented in the software EASYPOP (version 2.0.1) (updated from Balloux (2001)) to generate all data sets. For every data set, we simulated 50 subpopulations (this term is used here as a synonym of demes or infrapopulations) comprising a fixed number of 50, 100 or 1000 individuals. Three migration rates m (0.001, 0.01 and 0.1) and five rates of asexual reproduction c (1, 0.999, 0.990, 0.800 and 0 (panmixia)) were simulated. A total of 20 loci with a mutation rate of 10−5 were used. Mutations had an equivalent probability of generating any of 99 possible allelic states. At the start of the simulation, genetic diversity was set to the maximum possible value. This means that at the first generation, all the 99 possible alleles were randomly assigned in all individuals of all demes. The simulations were then run for 10 000 generations. For each parameter set, five replicates were run. At the end of each simulation, only 20 subpopulations were randomly sampled within which only 20 individuals were kept.

Parameter estimates and sampling design

The multilocus linkage disequilibrium D was estimated using the program Multilocus 1.2 (Agapow and Burt, 2001). D is a variant of the association index IA (Maynard Smith et al., 1993), independent of the number of loci analyzed. It has a form similar to a correlation coefficient and has a maximum value of 1. The departure from the H–WE was estimated using the Weir and Cockerham's, (1984) f unbiased estimator for FIS with the program FSTAT V. 2.9.3 (Goudet 2007, updated from Goudet, 1995).

For each set of simulations, both parameters were estimated using two sampling designs: the ‘Subpopulation (S)’ design and the ‘One-Individual per Subpopulation (1I-S)’ design. For the first one, parameters were estimated from 10 randomly chosen subpopulations (and the 20 individuals present in each of the 10 subpopulations). For the second design, they were estimated from an artificially created subpopulation. This artificial subpopulation included 20 individuals, each originating from a different subpopulation (1 individual taken randomly in each of the 20 simulated subpopulations). This sampling was repeated 10 times to get 10 estimates of D and f as obtained in the case of the ‘S’ design (cf. Figure 1).

Figure 1
figure 1

Diagram showing the two different sampling strategies (the ‘Subpopulation’ (S) and the ‘One-Individual per Subpopulation’ strategy (1I-S)) used to analyse the data. For the ‘S’ sampling strategy, linkage disequilibrium (D) and FIS are computed independently for 10 randomly chosen subpopulations out of the 20 sampled. For the ‘1I-S’ sampling strategy, linkage disequilibrium (D) and FIS are computed from an artificially created subpopulation. This subpopulation includes 20 individuals, each originating from a different simulated subpopulation (from the 20 subpopulations sampled). The creation of the artificial subpopulation is repeated 10 times to get 10 estimates of D and FIS as obtained in the case of the ‘S’ strategy.

Statistical analyses

The effect of sampling design (S and 1I-S designs) on LD and FIS was investigated using a Wilcoxon rank test for paired data using the R program (R Development-core-team, 2008). A linear model was then used to determine what explanatory variable affected the difference of LD or FIS between the two designs. The model analyzed was ΔD=D(S)D(1I-S) or ΔFIS=FIS(S)−FIS(1I-S)Nm+poly(c,2)+Nm:poly(c,2). Nm referred to the effective number of migrants, c to the clonal rate, poly(c,2) to a quadratic function because it gave a better fit than simple linear relationships (data not shown), and ‘:’ to the interaction between terms. We used a stepwise procedure to select the best model using the AIC (Akaike Information Criterion) (Akaike, 1974).

Application to real data set

We applied similar sampling designs and then computed the linkage disequilibrium and the FIS for the two different parasite organisms, C. albicans and P. falciparum. C. albicans is a diploid fungus that may be responsible for severe mucosal or systematic infections in immunocompromised persons (Beck-Sague and Jarvis, 1993; Boerlin et al., 1996; Hull et al., 2000; Berman and Sudbery, 2002; Bougnoux et al., 2004). P. falciparum is a protozoan parasite that is responsible for the deadliest form of malaria in humans. Both pathogens reproduce clonally, at least at one stage of their life cycle.

For C. albicans, data come from Nebavi et al. (2006). In that study, 55 C. albicans infrapopulations were obtained from 42 HIV-positive patients (some patients were collected several times) with oral candidiasis from Abidjan and the suburbs (see Nebavi et al., 2006 for further details). Each sample consisted of five isolates per patient, obtained by buccal swabbing. Each isolate was genotyped at 14 enzyme-coding loci.

For P. falciparum, the data come from Razakandrainibe et al (2005). In that study, Plasmodium oocysts were obtained from a total of 145 infected anopheline mosquitoes. From these, we kept only the oocyst infrapopulations that displayed at least five genotyped oocysts (the total number of oocyst infrapopulations analyzed=30).

For both parasites, linkage disequilibrium and FIS were obtained as described in the section ‘Parameter estimates and sampling design’.

Results

Linkage disequilibrium

Results are presented in Figure 2. Over all sets of simulations, the level of linkage disequilibrium (D) is significantly different between the two kinds of sampling designs (Wilcoxon signed rank test P-value <0.0001). We show, in particular, that it is significantly lower in the case of the 1I-S design than under the S design (Figure 1). The difference between the two estimates of LD (ΔD) is significantly dependent on the parameters of the simulations. The difference decreases as Nm increases (coefficient estimate=−0.016, P-value=0.0168), whereas it increases as does the clonal rate (coefficient estimates=0.48 and 0.27, P-value=0.0052). Interestingly though, the strongest differences are observed for clonal rates very close to 1 (0.99 or 0.999) but not 1.

Figure 2
figure 2

Variation of Agapow and Burt's, (2001) estimates of LD (D) measured within samples using the S (hatched bars) and the 1I-S (white bars) sampling design under different simulation scenarios. c represents the rate of clonal reproduction within subpopulations, N the subpopulation size and m the migration rate.

Departure from the H–WE

Results are presented in Figure 3. Over all simulations, the average FIS estimated within subpopulations using the S design is significantly lower than the FIS estimated using the 1I-S design (Wilcoxon signed rank test P-value <0.0001). The difference observed between estimates is explained by Nm (coefficient estimate=0.055, P-value <0.0001) but not by the clonal rates (coefficient estimate=−0.029 and −0.021, P-value=0.9886).

Figure 3
figure 3

Variation of Weir and Cockerham's, (1984) estimates of FIS (f) measured within subpopulations using the S (hatched bars) and the 1I-S sampling (white bars) design under different simulation scenarios. c represents the rate of clonal reproduction within subpopulations, N the subpopulation size and m the migration rate.

Application to C. albicans and P. falciparum data set

The same patterns were observed for C. albicans and P. falciparum real data sets. For both, we observed a decline in the amount of LD when sampling individuals using the 1I-S design (C. albicans: ‘S’ design D=0.83; ‘1I-S’ design: D=0.12; P. falciparum: ‘S’ design D=0.27; ‘1I-S’ design: D=0.14); and an increase in FIS (C. albicans: ‘S’ design f (95% confidence interval obtained by bootstrap over loci)=−0.85 (−0.90; −0.73); ‘1I-S’ design f (95% confidence interval obtained by bootstrap over loci)=0.07 (−0.11; 0.34); P. falciparum : ‘S’ design ‘f (95% confidence interval obtained by bootstrap over loci)=0.10 (0.049; 0.146); ‘1I-S’ design : f (95% confidence interval obtained by bootstrap over loci)=0.49 (0.41, 0.58). Note that for C. albicans, the confidence interval obtained by bootstrap over loci of the FIS comprised 0, which is characteristic of panmixia.

Discussion

Pathogens are frequently studied using a particular sampling strategy. Although each individual host may harbor an infrapopulation of the parasite, it is frequent that only one individual parasite per host is sampled (Pujol et al., 1993; Boerlin et al., 1996; Mzilahowa et al., 2007; reviewed in Nebavi et al., 2006). In this paper, we have investigated the effect of this sampling strategy on two aspects of genetic variability that are commonly analyzed to detect clonal reproduction: linkage disequilibrium and deviations from the H–WE (Tibayrenc and Ayala, 2002). For this study, we have specifically focused on diploid acyclic life cycle parasites.

Our results clearly show a strong effect of such design on the distribution of genetic variability at both levels. We show that sampling only one individual per subpopulation (or infrapopulation) (1I-S design) artificially decreases the level of linkage disequilibrium observed between loci and increases the FIS. These modifications of LD and FIS can be rather drastic as shown in Figures 2 and 3. For example, for the case where N=50, m=0.001 and c=0.99 (99% of individuals are clonally produced within each subpopulation), the average LD measured using the S design is about 10 times higher (D=0.68) than the one measured using the 1I-S design (D=0.086). Similarly, FIS estimates are strongly increased under the 1I-S design (f=0.85) compared with that under S design (f=−0.0756). This means, therefore, that depending on the sampling design, the analysis of genetic variability may give a very different picture of the factors that shaped population structure in the species under study. Thus, in the latter example, although under the S design all elements (LD and FIS) are congruent with a species being highly clonal (Balloux et al., 2003; De Meeus et al., 2006), the results obtained under the 1I-S design would be more consistent with a sexual species displaying high rates of self-fertilization.

One may obviously argue that LD and FIS are not the only elements that may allow one to conclude that a species (or a population) is highly clonal. Tibayrenc and Ayala (2002) argue that other genetic criteria may constitute good indices of clonal reproduction. The occurrence of identical multilocus genotypes within subpopulations is one of them. Unfortunately, this criterion is also strongly affected by the sampling design. For N=50, m=0.01 and c=0.8, we determined the number of clones per subpopulation and the number of identical multilocus genotypes per clone under the S and 1I-S designs using the program CLONALITY V 0.4 (Prugnolle et al., 2008). As expected, the average number of clones (multilocus genotypes or haplotypes presenting replicates) observed within subpopulations and the average number of replicates per clone was higher under the S design (average number of clones=2.9 and average number of replicates per clone=2.3) than under the 1I-S design (no clone was found).

Why does LD decrease (FIS increase) under the 1I-S design compared with the LD (FIS) observed under the S design? When pooling individuals coming from genetically differentiated subpopulations of sexual organisms, it is traditional to observe an increase in linkage disequilibrium (Nei and Li, 1973). In the case of clonal organisms, the effect is opposite to that shown here. One obvious explanation for this opposite trend lies in the diversity of haplotypes (multilocus genotypes) occurring within subpopulations. When this diversity is low (for example, when the clonal rate is high), the amount of LD is very high within each subpopulation under the S design as shown in Figure 2. Pooling individuals coming from several genetically differentiated subpopulations (the 1I-S design) artificially increases the diversity of haplotypes occurring in the sample, which in turn reduces the amount of LD observed between loci. This phenomenon is congruent with the fact that when the effective migration rate increases (and so does the diversity of haplotypes maintained within each subpopulation), the level of LD decreases under the S design (Figure 2). Regarding the departure from the H–WE, the increase of FIS occurs simple because of a classical Wahlund effect (Wahlund, 1928). A particular aspect of the Wahlund effect observed here for clonal organisms is that it may generate FIS values close to 0 or above 0, although within each subpopulation the FIS values are negative, indicative of an excess of heterozygotes compared with the H–WE.

In conclusion, we show that collecting and genotyping only one individual per host individual (per infrapopulation) in acyclic life cycle pathogens and pooling them to form one ‘artificial’ subpopulation may generate strongly misleading patterns of genetic variations that may lead to erroneous conclusions regarding their reproduction mode. We limited our investigations to clonal organisms. However, the same effect occurs in highly self-fertilizing sexual organisms, namely a reduction of the level of linkage disequilibrium between loci and an increase in FIS (data not shown) when applying the same kind of sampling design. Our study suggests therefore that, for clonal pathogens, sampling has to be done with lots of precautions, especially when local populations are constituted by several highly differentiated subpopulations (for example, infrapopulations). As shown here, a sampling design that does not take into account the smallest possible level of substructure may lead to erroneous interpretations sometimes giving the false picture of an organism highly recombining although highly clonal. Obviously, our study goes further beyond the clonal parasites alone and applies to all clonal organisms that could present highly structured populations. In particular, estimation of recombination from sequence data, often employed to study prokaryotic organisms (McVean et al., 2002), may also be concerned by this important caveat. For example, the intestinal habitat of mammals has been shown to offer multiple niche for parasites that colonize it (Haukisalmi and Henttonen, 1993) and also for bacterial species (Swidsinski et al., 2005). If different clonal lineages segregating at different intestine sections are sampled on a one isolate per host basis it is probable that the recombination will be strongly overestimated in the bacterial group under study.