A hierarchical statistical model for estimating population properties of quantitative genes

Background Earlier methods for detecting major genes responsible for a quantitative trait rely critically upon a well-structured pedigree in which the segregation pattern of genes exactly follow Mendelian inheritance laws. However, for many outcrossing species, such pedigrees are not available and genes also display population properties. Results In this paper, a hierarchical statistical model is proposed to monitor the existence of a major gene based on its segregation and transmission across two successive generations. The model is implemented with an EM algorithm to provide maximum likelihood estimates for genetic parameters of the major locus. This new method is successfully applied to identify an additive gene having a large effect on stem height growth of aspen trees. The estimates of population genetic parameters for this major gene can be generalized to the original breeding population from which the parents were sampled. A simulation study is presented to evaluate finite sample properties of the model. Conclusions A hierarchical model was derived for detecting major genes affecting a quantitative trait based on progeny tests of outcrossing species. The new model takes into account the population genetic properties of genes and is expected to enhance the accuracy, precision and power of gene detection.


Background
The identification of individual genes governing phenotypic variation is crucial to understand the mechanistic basis of quantitative inheritance and ultimately provide information about the design of optimal breeding strategies in both plants and animals. Quantitative genetics based on new statistical and computational technologies has advanced to the point at which individual genes can be detected and mapped on chromosomes [1,2]. The un-derpinning for the detection and mapping of genes is founded on their particular segregation pattern. For a pedigree derived from inbred lines, the segregation pattern of a gene can be exactly predicted. In this situation, quantitative genetic theory can lay a sufficient foundation for gene detection and, thus, the properties of a detected gene can be adequately described by its effect and genomic location. However, for outcrossing populations, such as forest trees, in which inbred lines are not available, genes also display population genetic properties [3]. Hence, the foundation for gene detection in these populations should be established on a combination of quantitative genetics and population genetics. The estimation of the population genetic properties of genes, such as allele frequencies and the degree of disequilibrium, can not only enhance the precision and power of gene detection, but also is fundamentally important to our understanding of population differentiation and evolution [4,5].
We present a strategy for investigating the existence of genes in an arbitrarily complex progeny test based on phenotype data and estimating the effects of these genes on a quantitative trait and their population properties under the maximum likelihood framework. Only a limited number of statistical studies have been performed to detect genes based on the phenotypic data [6][7][8][9][10][11][12][13], and none of these studies have considered the population and evolutionary genetic properties of genes. Our study population can be derived from a natural population instead of inbred lines as used for general agronomic crops. A finite number of unrelated parents are randomly selected from a natural population and are mated to produce progeny tests under a mating design such as factorial or diallel. In a recent paper, we used such progeny tests to derive a model for detecting major genes affecting a quantitative trait [14]. This model has power to estimate the effect of drift errors on genetic variation during hybridization, thus providing information about dynamic changes of the genetic architecture of a population under artificial hybridization. However, the estimates of population genetic parameters from the progeny test using the above model cannot be generalized to the original population from which the mating design is derived. In this paper, we developed a hierarchical model for analyzing the pattern of gene segregation at both population and family levels, which thus can provide estimates of genetic parameters in two successive generations of the population. Estimates of population genetic parameters using this hierarchical model can reasonably infer the genetic structure of the original population. A forest tree example is used to demonstrate the application of the new statistical model for gene detection.

Population genetic structure
Suppose there is a segregating major gene responsible for a quantitative trait in a natural or experimental diploid population. This gene has s different codominant alleles, A 1 ,...,A s , whose frequencies in the population are denoted by p 1 ,...,p s with p κ = 1. These s alleles randomly unite to form s(s + 1)/2 distinguishable genotypes including s homozygotes and s(s -1)/2 heterozygotes. The population frequencies of the genotypes are denoted by p κι (κ, ι = 1,...,s). If the population is at Hardy-Weinberg disequilibrium (HWD), the genotypic frequency is the product of the two corresponding allelic frequencies, plus the coefficient of HWD (d κι ), i.e., p κι = p κ p ι + d κι , where d κι has the restrictions, max{- [15].

Mating design
A finite number of unrelated individuals are randomly selected from the population as female and male parents to generate a I × J factorial mating design. Based on sampling theory, these selected individuals have genotypes at the major locus whose counts follow the same multinomial distribution as those in the original population, with probability vector p = (p κι , κ, ι = 1,...,n). The segregation of progeny genotypes in the mating design depends on how the major-locus genotypes segregate (1) among the selected female or male parents and (2) within each of the fullsib families ( Table 1). The segregation among the I female or J male parents having the same pattern as in the original population is viewed as the population-level segregation. The segregation within a full-sib family of size n ij derived from the ith female and jth male parent follows the Mendelian segregation ratio, viewed as the family-level segregation. To understand the genetic structure of the original population, different genetic information derived from this two-level segregation will not be mixed, but rather combined within a maximum likelihood framework.

Estimation method
Let y ijk denote the phenotypic measurement for the kth offspring from ith female parent and jth male parent and g ijk denote its genotype at the major gene, i = 1,2, ..., I, j = 1, 2, ..., J, k = 1,2, ..., n ij . The statistical model describing the phenotype-genotype relationship can be expressed as where µ κι is the genotypic value of genotype A κ A ι ; and e ijk is the error term with the normal distribution N(0,σ 2 ).
Let be the genotype for the ith female parent and for the jth male parent. We denote by y the collection of the observed data and by z = (g,g f ,g m ) the collection of the major-locus genotypes for offspring, female parents and male parents. We assume the following hierarchical model: where f denotes the normal distribution of the phenotype N(µ κι , σ 2 ), h is distributed according to the Mendelian segregation pattern and ψ is multinomially distributed with probability vector p. Note that genotypes for siblings are dependent and, given genotypes, the traits are independent of each other. Under the above model (y, z) has a joint distribution as follows: Statistically, equation (2) is a mixture model in which each component is represented by a genotype within a full-sib family [16][17][18]. The maximum likelihood estimates for θ = (µ, σ 2 , p) can be obtained using the EM algorithm [19,20]. At the t th step of the EM sequence, the expected log-likelihood is proportional to where Σ G sums over all possible genotypes, are the posterior distributions of offspring genotypes and iG, jG are the posterior distributions of parent genotypes. Conditional on , these posterior probabilities can be evaluated as follows: The EM sequence is defined by The segregation patterns of genotypes are indicated at the parental and progeny generation levels. Genotype frequencies in each generation are given in parentheses.
Suppose the major gene has two different codominant alleles, the posterior probablities in equation (4) can be exactly calculated for a usual mating design. However, for large designs with many parents or for the case where the major gene has many different alleles, we have to rely on a Monte Carlo version of the EM algorithm. For example, we can sample from the conditional distribution of z = (g,g f , g m ) given using the following Gibbs sampler [20]:

Example
We use data from aspen trees to illustrate the appication of our statistical model for detecting a major gene. The example is derived from a 6 × 6 factorial mating design of Populus tremuloides and P. tremula. The parents for the crosses were randomly selected from a mixed breeding population of two different species established at the University of Minnesota. Originally, the P. tremuloides trees used as the female parents from the Upper Peninsula of Michigan and northern Wisconsin, and the P. tremula male parents came from northeastern Poland and Germany [21]. The seedlings from the progeny population were planted in two different locations, one on a cutover forest site near Grand Rapids, Minnesota, and the other on farmland near Rhinelander, Wisconsin. Both trials were laid out in a randomized complete block design with 10 replicates and six seedlings per family in each replicate at 2.5 × 2.5 spacing. At the end of the second year in the field, all trees were measured for stem height. Figure 1 presents density estimates of the second year height from each family pooled over the two locations. Trees from eight families did not have growth data due to mortality. The phenotypic distributions suggest the existence of a major gene in the progeny test. For example, in the family derived from Clone-5 × TA-1-68, the phenotypic distribution is a mixture of at least two components, showing the existence of a segregating gene. Figure 1 also indicates different segregation patterns among the families, with some more smooth and others more waved, thus suggesting that some parents are heterozygous with their alleles segregating in the progeny.
The effect and genotypic frequencies of the major gene are estimated using the hierarchical model assuming two alleles. Since no significant difference is detected between two locations, we only report the result based on pooled data. With probability of almost one, parents TA-1-91, TA-2-68, TA-4-91, XTA-3-6 have genotype A 1 A 1 , parents TA-1-68, TA-1-75, CLONE-5, TA-5-61 have genotype A 1 A 2 and the rest parents have genotype A 2 A 2 . The estimates of the genotypic values of A 1 A 1 , A 1 A 2 and A 2 A 2 are 11 = 56.7 ± 0.80, 12 = 40.6 ± 0.32 and 22 = 30.0 ± 1.02, respectively. The estimate of residual variance is = 218.676 ± 6.09. Here, the estimates of standard errors for the parameter estimators are obtained based on the Fisher information matrix [22]. We also fitted the model under the null hypothesis that there is no major gene with a single normal distribution. The likelihood ratio test has (p < 0.0001), suggesting the existence of a major gene for height growth. The ratio of the dominant to additive effect for this major gene is less than 0.2, which implies that it affects the stem height growth in an additive manner. ,

Figure 1
Density estimates of second year stem height (in cm) for all families. The measurement is the stem height at the end of second year. The randomly selected parents are labelled at the top (male) and on the right of plots (female). Trees from eight families did not have growth data due to mortality. The population genetic parameters of this major gene are estimated based on the genotypes of the parents detected. The genotype frequencies estimated are identical for the three genotypes, i.e. 11 = 12 = 22 = 1/3. In consequence, the allele frequencies are also same ( 1 = 2 = 1/ 2). Thus, the mixed breeding population of P. tremulodes and P. tremula is not at Hardy-Weinberg equilibrium, because the genotype frequencies are not products of the corresponding allele frequencies. The coefficient of Hardy-Weinberg disequilibrium is estimated as . The estimate of allele frequencies and Hardy-Weinberg disequilibrium using our hierarchical model can be well generalized to the original breeding population from which the parents were sampled. However, one should be cautious about the accuracy of the estimator due to the small sample size.

Simulation
The behavior of the EM estimators is evaluated by two simulation experiments. In the first simulation, a 6 by 6 mating design same as our real example is assumed. The major gene is assumed to have two codominant alleles A 1 and A 2 whose frequencies in the population are p 1 = 0.6 and p 2 = 0.4. The genotypic values are set to be 65, 40 and 25 with standard deviation 15. Based on the hierarchical model defined in equation (1), we first simulated a set of the parental genotypes ( for males). Then we randomly generated 50 offspring genotypes and observations for each family.
Our EM estimating procedure converged to the correct parental genotypes after only a few iterations (normally 2 or 3). A loglikelihood test is highly significant with log-likelihood ratio equals 125 (df = 4). The estimates of genotypic values from a single simulation replicate are 64.1 ± 0.74 for A 1 A 1 , 39.5 ± 0.59 for A 1 A 2 and and 26.8 ± 1.64 for A 2 A 2 . The estimate of residual variance is 241.6 ± 10.30.
In our second simulation we studied a 5 by 5 mating design. We used 25 replications to investigate the empirical accuracy of the estimators. We assume the major gene has 2 alleles and randomly generated parental genotypes. The parental genotypes were randomly generated based on allele frequence p 1 = 0.5. That resulted in female genotypes . For each family we simulated the 50 genotypes for its offsprings and then simulated its trait measurement using a normal and log-normal distribution.
The results from the second simulation study are summarized in Figure 2. The top panel of the plots is for the simulation where trait measurements are simulated from a normal distribution. The bottom panel corresponds to lognormal data. In each plot we graphed the ideal estimates of the parameter (assuming that we know the genotypes of each offspring) versus our estimates based on the EM algorithm. Mean squared errors are also indicated in the plots. It is seen from Figure 2 that all four parameters can be estimated quite accurately in the normal case while they are slightly biased for the lognormal data.

Discussion
We have derived a hierarchical model for detecting major genes affecting a quantitative trait based on progeny tests of outcrossing species. The new model takes into account the population genetic properties of genes and is expected to enhance the accuracy, precision and power of gene detection. The information about the population behavior of genes estimated from this model is fundamentally important to our understanding of the genetic architecture of a natural or experimental population and is also useful for the design of sound breeding strategies for agricultural crops and forest trees.
We used an example from interspecific aspen hybrids to illustrate the application of our gene-detecting model. The model has successfully identified a major gene with additive effects on second year stem growth in the hybrid aspens. The additive effect of genes on height growth in young poplars was also observed in a molecular marker experiment [23]. In this marker experiment, an additive quantitative trait locus was found to explain almost a quarter of the total phenotypic variation in second year height growth. The result from the aspen hybrids was consistent with that from a simulation experiment based on a mating scheme analogous to that used in our real example. The consistency of the results from the simulation study suggests that the model proposed here can be adequately used to detect genes of large effect on the phenotypes.
The model will have immediate applications for many species in which progeny tests have been established for a number of years [24]. Most of these tests are maintained in different locations and measured annually, thus offering desirable opportunities to address two important questions: (1) How does the genes interact with environment? (2) What is the developmental plasticity of gene expression? In our real example, no evidence has been obtained for the change of gene expression at the major locus detected over the two field trials, despite their dramatic differences in climate, soil properties and silvicultural measures [21]. If the stability of the expression of this gene over a range of environments is confirmed by more accurate genotype determination methods, e.g. based on genetic markers. It will have a tremendous application in breeding for stable cultivars of forest tree species.
Our analysis and simulation is based on diallelic inheritance. But multiallelic inheritance can be similarly considered, although more parameters should be introduced. Also, our model assumes the segregation of a single gene in progeny tests. The principle behind the model can be extended to consider two or more genes. Modelling multiple genes may be closer to biological reality, because the linkage, epistatic interaction and genetic association among genes are considered. However, the consideration of these relationships needs to estimate an increased number of unknown genetic parameters. Finally, our model is based on a balanced factorial design. For some populations, like mammals, mating designs are frequently unbalanced because of limited resources or reproductive difficulties. It is very important to extend our analysis and model to an unbalanced mating design of any complexity. For all of these extensions mentioned above, maximum likelihood approaches may not be sufficient owing to an increasing dimension of parameter space. A more powerful computational tool, e.g., Markov chain Monte Carol methods, may be needed [20] to make our model more tractable in real applications.

Figure 2
Summary of a simulation study based on a 5 × 5 mating design. In each plot we graphed the ideal estimates of the parameter (assuming that we know the genotypes of each offspring) versus the estimates based on the EM algorithm.