VARIANCE ADDITIVITY OF GENETIC POPULATIONAL PARAMETER ESTIMATES OBTAINED THROUGH BOOTSTRAPPING

Studying the genetic structure of natural populations is very important for conservation and use of the genetic variability available in nature. This research is related to genetic population structure analysis using real and simulated molecular data. To obtain variance estimates of pertinent parameters, the bootstrap resampling procedure was applied over different sampling units, namely: individuals within populations (I), populations (P), and individuals and populations simultaneously (I, P). The considered parameters were: the total fixation index (F or FIT), the fixation index within populations (f or FIS) and the divergence among populations or intrapopulation coancestry (θ or FST). The aim of this research was to verify if the variance estimates of F̂ , f̂ and θ̂ , found through the resampling over individuals and populations simultaneously (I, P), correspond to the sum of the respective variance estimates obtained from separated resampling over individuals and populations (I+P). This equivalence was verified in all cases, showing that the total variance estimate of F̂ , f̂ and θ̂ can be obtained summing up the variances estimated for each source of variation separately. Results also showed that this facilitates the use of the bootstrap method on data with hierarchical structure and opens the possibility of obtaining the relative contribution of each source of variation to the total variation of estimated parameters.


INTRODUCTION
In studies of genetic population structure with the use of genetic markers, usually resampling methods are used to estimate genetic population parameters and their respective standard deviation.Some authors have used resampling only over one source of variation, like Van Dongen (1995) and Vencovsky et al. (1997), who applied it only over individuals.Others applied it over several sources of variation, such as Petit & Pons (1998) and Carlini-Garcia et al. (2001).Petit & Pons (1998) applied the bootstrap method over individuals, populations, and individuals and populations concomitantly, to estimate population parameters and their variances, based on these sources of variation.Their objective was to verify over which source of variation resampling should be applied to obtain estimates of the studied parameters.To do so, they compared the obtained variances based on the mentioned variation sources, with the variance estimates calculated from explicit expressions obtained by Pons & Petit (1995).The authors concluded that to estimate the studied parameters and their variances, the resampling should be priority over populations.Carlini-Garcia et al. (2001) applied bootstrap resampling over populations, individuals within Scientia Agricola, v.60, n.1, p.97-103, Jan./Mar.2003 population, populations and individuals simultaneously, and also over loci.They estimated some genetic population parameters, their standard deviation, and obtained the respective confidence intervals, as well as the empirical distribution of the estimates.Among other aspects, they could demonstrate the importance of applying resampling taking into account each source of variation.
The aim of this research was to verify if, with the hierarchically structured data, it is possible to obtain the total bootstrap variance of an estimate summing up all obtained variances by the resampling over each source of variation separately.This equivalence would be advantageous in that it would be possible to obtain the relative contribution of population and individual sources of variation to the total variation, as well as the lack of necessity to do a joint resampling of individuals and populations.The hierarchical structure considered involved populations and individuals within populations.The evaluated parameters were the total fixation index (F or F IT ), the fixation index within populations (f or F IS ) and the degree of divergence among populations or coancestry within populations (θ or F ST ) (Wright, 1951;1965;Cockerham, 1969;1973;Weir & Cockerham, 1984;Weir, 1996).Real and simulated data were considered.
Twenty-five sets of simulated data were also used, each of them composed by 30 populations with 100 individuals each.Five loci with three alleles per locus were considered, and the initial allelic frequencies were 1/3 for each allele at all loci.In these simulations, populations in inbreeding equilibrium were considered with the inbreeding rates (s) varying in the interval 0 ≤ s ≤ 0.08 and the number of generations (g) varying from 100 to 500 (Table 1).Different numbers of generations were considered to generate data sets having different degrees of divergence among populations.
The study of population genetic structure of each considered data set was carried out by means of analysis of variance of gene frequencies (Cockerham, 1969;1973;Weir & Cockerham, 1984;Weir, 1996).Thus, in each case, the total variance estimate ( 2 T σ ˆ) of the allelic frequencies, as well as their components: among populations ( 2 P σ ˆ), among individuals within populations ( 2 I σ ˆ), and among genes within individuals ( 2 G σ ˆ), were obtained.From these, estimates of the total fixation index ( F ˆ), the fixation index within populations ( f ˆ), and the degree of divergence among populations or coancestry within populations ( θ ˆ) and their respective variance estimates, were calculated.
To obtain these estimates, a random model was considered, meaning that for each data set, it was assumed that there is a reference population that originated, by genetic drift, the evaluated populations.Therefore no selection in all considered loci was assumed, such that loci were taken as neutral.The considered hierarchical structure for the analysis of variance included the following sources of variation: populations (P), individuals within populations (I) and genes within individuals (G) (Weir, 1996).
The method of moments was employed to estimate the variance components mentioned above, as well as to estimate the other population parameters, according to Weir (1996): The resampling bootstrap method (Efron & Tibshirani, 1993;Manly, 1997) was applied with the objective of obtaining bootstrap estimates of F, f and θ and of their respective variance estimates, considering the sources of variations of individuals, populations, and individuals and populations simultaneously, in a similar way to that used by Petit & Pons (1998), fixing the loci.In each resampling level, 100,000 bootstrap samples were obtained for the real data and 10,000 for the simulated data.The variance analysis was carried out in each bootstrap sample, which provided F, f and θ estimates.The average of these estimates, per parameter, is the bootstrap estimate of the parameter, while their variance is the variance estimate of the bootstrap estimate of the parameter.
For F, f and θ parameters individually, it was verified if the additivity of the variances was true or not when the bootstrap approach is used.This property was investigated verifying if the sum of the variance estimates of the parameter estimates, obtained from the independent resampling of individuals and populations, corresponds to the variance estimate of such parameter estimates, taken from the concomitant resampling of individuals and populations.In addition to the practical s g 10 0 20 0 30 0 40 0 50 0 Table 1 -Description of the 25 simulated data sets according to the selfing (s) and the number of generations (g).
Scientia Agricola, v.60, n.1, p.97-103, Jan./Mar.2003 facility of not having to carry out the simultaneous resampling of populations and individuals, the outlined procedure allows investigating the relative contribution of the sources of variation and gives an indication of where the major deficiencies of field sampling are occurring.
To verify this additivity, a simple linear regression model was adopted, i.e.Y = α +βX + ε, Y being the values of the sum the variances obtained from the resampling of individuals and populations separately, and X the respective variance estimated from the simultaneous resample of these two factors.This was carried out for each parameter (F, f and θ),with the simulated and real data sets.Student's t tests were applied to verify if the coefficient of regression (β) and the intercept (α) estimates differed from zero.Confidence intervals for β were constructed to verify if the corresponding parameter differed from 1 (Sokal & Rohlf, 1995).If α = 0, β ≠ 0 and β does not differ from 1, the regression equation is reduced to E (Y) = X in terms of mathematical expectation, and then, it is possible to confirm the additivity of variance estimates, that were derived from the resample of individuals and populations separately.The degree of deviations from regression was verified through the coefficients of determination R 2 , obtained for each regression analysis (Sokal & Rohlf, 1995).
Shapiro-Wilk test was applied to verify if the regression analysis residuals follow a normal distribution (Sokal & Rohlf, 1995).When this assumption was not fulfilled, appropriate data transformation was searched for attaining normality.This was necessary to guarantee the validity of the test of hypothesis, as well as the confidence intervals calculated for the intercept and for the coefficient of regression, which are based on normality.
As these two variables involved in the regression are random, an appropriate regression analysis for the random model could have been used (geometric mean regression; Sokal & Rohlf, 1995).Nevertheless, these latter authors mention the existence of controversies regarding the use of this methodology.Thus, the usual regression analysis was used here, as described in the previous paragraphs, in agreement with Neter et al. (1990).
All resamplings and calculations of F, f and θ bootstrap estimates and of their bootstrap variances estimates were carried out using a version of the EG software (Coelho, 2000), specially developed for this purpose.

RESULTS AND DISCUSSION
For the real data sets, the observed values (total variance estimates, due to individuals and populations together: I, P) and the expected values of these variances (sum of estimates of variances due to the sources of variation of individuals and populations, I+P) were very close (Table 2).In the case of Seoane's et al. (2000) data, the bootstrap discard did not contribute in a significant manner to the increase in the difference between the estimates, as the discard was very small, from 0.011% and 0.014%, for individuals, and individuals and populations simultaneous resamples, respectively.However, these discards must have altered the precision of the estimates and of their variances, in comparison to those obtained with no discards.These discards are due to the estimation method used, since estimates were obtained as variance ratios and, in certain combinations, these ratios may have zero values in the denominators.In these cases, the software used automatically discarded the bootstrap sample.This procedure was applied to all resampling levels.
Shapiro-Wilk's normality test was non-significant (P ≥ 0.05) in all analyses, when the real data set was considered.This, however, was not observed with the simulated data in the case of the F and θ.
Nevertheless, the regression residuals presented normal distribution when the logarithmic transformation was applied to the simulated data sets for these two parameters.
In all situations, the linear regression model adjusted well to the data.In all the real cases, the estimates of β were significant, and intercept estimates (α ˆ) did not differ from zero.Furthermore, in all regressions, the hypothesis β = 1 was accepted since all confidence intervals obtained for β included the value 1. Deviations from regression were not expressive, since all R 2 values were greater than 99% (Table 3, Figure 1 i to iii).
Results obtained for the three parameters indicated that the corresponding variance estimates, taken from individual and population resamples, can be summed up to obtain the total variance due to these two sources of variation jointly, confirming the additivity of the variances.Therefore, the regression model reduces to Y = X + ε.This same behavior was also observed when the simulated data sets were analyzed.In this case, the observed and expected values of the total variance estimates of F ˆ, f ˆ and θ ˆ were even more similar (Table 4).Such an outcome is probably due to the large number of populations and individuals used in each data set.No bootstrap discards took place.
Results of the simulated data confirmed those obtained with the real data.In all cases the null hypotheses H 0 : α and H 0 : β = 1 were not rejected.All the R 2 values were greater than 99%, so that deviations from regression were not expressive.(Table 5, Figure 1 iv to vi).This additivity is advantageous, as it is much simpler to work with additive models.Another practical advantage is the lack of necessity of carrying out simultaneous resamplings of individuals and populations to obtain variance estimates due to these two levels of simultaneous resampling.Summing up the bootstrap variance estimates of the different sources of variation is an adequate procedure for obtaining the total variance.Nevertheless, if there is interest in obtaining the total confidence interval of the parameter, due to individuals Scientia Agricola, v.60, n.1, p.97-103, Jan./Mar.2003 and populations simultaneously, the concomitant resampling of these two sources of variation becomes necessary whenever the distribution of the estimates F ˆ, f ˆ and θ ˆ is unknown.However, in order to investigate if the parameter differs or not from zero, an alternative approach is analyzing jointly the confidence intervals obtained for each resampled level.Carlini-Garcia et al. intercept and coefficient of regression, respectively; c, d, e estimates of the total fixation index, the fixation index within populations, and the divergence among populations, respectively; ns non-significative; **(P ≤ 0.01).
Table 3 a α and b β estimates and respective confidence intervals for the regression of observed (I, P) and expected (I+P) values of the c F ˆ, d f ˆ and e θ ˆ variance estimates.Estimates of the coefficient of determination R 2 .Data from several authors.
(2001) proposed that, if at least one of the confidence intervals, for a given parameter, comprised the zero value, the parameter should be considered null.Under this criterion, the hypothesis that the parameter is null is rejected only when all confidence intervals do not contain the zero value.The reference value zero is adequate for F, f and θ , but can be different for other parameters.As mentioned in methodology, different combinations of selfing rates and numbers of generations of divergence were considered (Table 1).The variances of θ ˆ and F ˆ tend to increase with divergence as expected (Table 4).Even though, the property of additivity was maintained.
Probably the main advantage of this additivity is the possibility of obtaining the relative contribution of the different sources of variation to the total variation.This fact has implications in sample planning, as the source that most contributes to the total variance should receive greater attention in the elaboration of future sample strategies.By knowing these relative contributions and verifying trends in several similar types of research, it is possible to organize sampling strategies (number of populations and of individuals per population) to minimize the error in the population parameter estimates.
Including the source of variation due to loci in this study would require knowing not only the total bootstrap variance, based on the hierarchical resampled levels (populations and individuals), but also the component due to loci.As the source of variation of loci leads to a crossed data structure, the existence of variance components due to interactions between loci and other resampled levels are expected.Therefore, it is not expected that mean squares are additive when loci are resampled together with individuals and populations.Determining the bootstrap variance components based Table 4 -Variance estimates of a F ˆ, b f ˆ and b θ ˆ due to resampling over individuals (I), populations (P), and I and P jointly (I, P).
Sums up variances are show as I+P.Simulated data.
a, b, c estimates of the total fixation index, the fixation index within populations, and the divergence among populations, respectively; d number of discarded bootstraps; e, f, g F ˆ, b f ˆand c θ ˆ variance estimates, respectively, multiplied by 10 5 .on the crossed structure due to loci in addition to those due to the hierarchical structure is necessary.This is required for verifying the property of additivity when loci are resampled together with individuals and populations or even with any other possible hierarchical levels.

Figure 1 -
Figure 1 -Regressions between observed and expected variance estimates of the F ˆ, f ˆ and θ ˆ.Data from various authors (i to iii), and simulated and transformed data (iv and vi), and simulated and non-transformed data (v).

Table 2 -
Variance estimates of a F ˆ, b f ˆ and c θ ˆ due to resampling over individuals (I), populations (P), and I and P jointly (I, P).Sums up I and P variances are show as I+P.Data from several authors.