Introduction

Quantitative trait loci (QTL) mapping enables a better understanding of the genetic architecture of quantitative traits. One of the most important applications of this type of study is the possibility to incorporate architecture-derived information into breeding programs to make them more effective. It also allows a better understanding of the genetic correlation among traits (Jiang and Zeng 1995; Mackay 2001), the interaction between genotypes and environments (Malosetti et al 2004, 2008; van Eeuwijk et al 2005, 2007, 2009; Boer et al 2007; Mathews et al 2008; Messmer et al 2009; Pastina et al 2012), and the determination of the breeding value of individuals for marker-assisted selection (Kao et al 1999; Zeng et al 1999; Dekkers and Hospital 2002; Hospital 2009).

Several statistical models are available for QTL mapping in populations based on inbreed lines (e.g., F 2, backcross and recombinant inbred lines), including interval mapping (IM) (Lander and Botstein 1989), composite interval mapping (CIM), (Zeng 1993, 1994; Jansen and Stam 1994) and multiple interval mapping (Kao and Zeng 1997; Kao et al 1999). IM proposed modeling QTL genotypes as latent variables by using mixture models for the analyses. In CIM, cofactors are included in the model to remove the effect of QTL located outside the mapping region, resulting in a significant increase in statistical power. Models based on CIM have been used for QTL mapping in several economically important plant species, e.g., maize and soybeans (Sabadin et al 2008; Li et al 2008; García-Lara et al 2009, 2010; Tang et al 2010; Tucker et al 2010; Warburton et al 2011; Xu et al 2011).

For perennial species (citrus, eucalyptus, loblolly pine, rubber tree, and others), inbred lines are unavailable so mapping populations can be generated by a biparental cross between non-inbred individuals, resulting in full-sib progeny. In diploids, either molecular markers or QTL may have 1:1:1:1, 1:2:1, 3:1, or 1:1 segregation patterns, depending on the number and configuration of the alleles of the parents. In this situation, statistical analyses are frequently carried out using an approach named double pseudo-testcross (Grattapaglia and Sederoff 1994). For this analysis, only markers with 1:1 segregation patterns are considered, which allows obtaining separate linkage maps for each parent, and the usage of QTL models developed for backcross populations on each individual map. However, this approach cannot be directly employed in integrated maps using new markers with distinct segregation patterns (1:1:1:1, 1:2:1, and 3:1), which have become common recently (e.g., single nucleotide polymorphism, microsatellite, etc.).

Several authors proposed the construction of integrated linkage maps using markers exhibiting different segregation patterns. Ritter et al (1990), Ritter and Salamini (1996), and Maliepaard et al (1997) have developed methods to determine recombination fractions using two-point estimates. Ridout et al (1998) have proposed the estimation of recombination fractions based on three-point tests. Wu et al (2002a) and Lu et al (2004) have developed approaches based on maximum likelihood to simultaneously estimate the recombination fraction and linkage phase between markers. Ling (2000), Wu et al (2002b), and Tong et al (2010) proposed methods based on multipoint maximum likelihood using hidden Markov models (HMM). HMMs have been incorporated into software, such as OneMap (Margarido et al 2007, 2011). The main advantage of these methods is the ability to obtain linkage maps with higher saturation and good representation of the genetic polymorphism generated by the cross, because markers with all segregation patterns can be used in the statistical analysis.

Several alternatives are available for QTL mapping in outbred populations, in which two common situations can be considered: a complex pedigree or a large progeny. In the former, QTL mapping is done based on multiple families, and both fixed or random models can be used (Knott and Haley 1992; Kruglyak and Lander 1995; Xu and Atchley 1995; Xu and Gessler 1998; Gessler and Xu 1999; Yi and Xu 1999). The latter considers a single biparental cross to obtain tens or hundreds of offspring, usually modeling QTL genotypes as fixed effects (Haley et al 1994; Knott et al 1997; Schäfer-Pregl et al 1996; Sillanpää and Arjas 1999; Lin et al 2003; Wu et al 2007; Hu and Xu 2009; Xiong 2010; Payne et al 2010). Considering the approaches developed for large progeny, Haley et al (1994) proposed a model for F 2 population (two segregating alleles), and then applied it to a full-sib progeny, under some assumptions. When more than two alleles are considered, Knott et al (1997) expanded the previous approach, but still requires pedigree information. Schäfer-Pregl et al (1996) also presented models for a full-sib progeny considering either one scorable allele common to both parents or considering four alleles per marker locus under a non-linear approach. Sillanpää and Arjas (1999) proposed a Bayesian QTL mapping method for outbred species, which was initially proposed for inbred-based populations (Sillanpää and Arjas 1998). Lin et al (2003) developed an IM model using a maximum likelihood approach, considering QTL effects and linkage phases between markers with different patterns of segregation. However, the conditional probabilities for QTL genotypes are not based on multipoint estimates, and, in this approach, there are difficulties in estimating the linkage phase between QTL and markers using the expectation–maximization (EM) algorithm. As observed, none of these models incorporated the advantages of CIM, which is widely used for inbred-based populations, including the incorporation of cofactors and a high statistical power. Other software-based approaches were also suggested (Xiong 2010; Payne et al 2010; Hu and Xu 2009), but the segregation patterns of QTL and their linkage phases with markers were not fully addressed. Moreover, the estimation of QTL probabilities based on HMMs (multipoint) is of core importance in this context, because with outcrossing, it is common to have genomic regions with different marker types.

In this work, we developed a QTL mapping model based on the CIM approach considering a full-sib progeny and multipoint genetic mapping using molecular markers with several segregation patterns. The proposed method enables the determination of QTL, the estimation of their position, effects and segregation patterns, and the inference of their linkage phase with markers. A simulation study showed the advantages of the proposed model.

Methodology

Statistical model

We considered a full-sib progeny from the cross between two non-inbred, diploid parents P and Q with a known genetic map (Fig. 1). For an interval defined by two adjacent markers m and m + 1, with alleles 1 or 2 for each parent, the genotypes of these loci may be generically represented by P {1,2} m , P {1,2} m + 1 , Q {1,2} m , Q {1,2} m + 1 , where {1,2} indicates the allele possibilities for each locus in each parent. Assuming that there is a QTL in the interval, these alleles are represented as P 1 and P 2 for parent P and as Q 1 and Q 2 for parent Q. It is also assumed that the alleles P 1 and Q 1 have a positive effect on the phenotype. The crossing is then represented as P 1 m P 1 P 1 m + 1 /P 2 m P 2 P 2 m + 1  × Q 1 m Q 1 Q 1 m + 1 /Q 2 m Q 2 Q 2 m + 1 for the three loci considered.

Fig. 1
figure 1

Schematic representation of the type of cross considered in the model. The parents are represented by P and Q, and the alleles for the markers at the m and m + 1 loci are represented by P {1,2} m , Q {1,2} m , P {1,2} m + 1  and Q {1,2} m + 1 . The QTL alleles are represented as P 1, P 2, Q 1 and Q 2

The segregation of the QTL in the progeny results in four genotypic classes (P 1 Q 1, P 1 Q 2, P 2 Q 1, and P 2 Q 2) in 1:1:1:1 proportion. Therefore, it is possible to define three orthogonal contrasts between the means of these classes, similar to those suggested by Knott et al (1997), Lin et al (2003) and Payne et al (2010):

$$ \begin{array}{c}\hfill {P}^1{Q}^1+{P}^1{Q}^2-{P}^2{Q}^1-{P}^2{Q}^2\hfill \\ {}\hfill {P}^1{Q}^1-{P}^1{Q}^2+{P}^2{Q}^1-{P}^2{Q}^2\hfill \\ {}\hfill {P}^1{Q}^1-{P}^1{Q}^2-{P}^2{Q}^1+{P}^2{Q}^2\hfill \end{array} $$

The first two contrasts represent the additive effects of the QTL alleles in parents P and Q, respectively, and the third contrast is the intra-locus interaction (dominance) between additive effects on each parent. The contrast coefficients can be represented in the columns of the matrix D (genetic design matrix), similar to the notation of Kao and Zeng (1997):

$$ \mathbf{D}=\left[\begin{array}{rrr}\hfill 1& \hfill 1& \hfill 1\\ {}\hfill 1& \hfill -1& \hfill -1\\ {}\hfill -1& \hfill 1& \hfill -1\\ {}\hfill -1& \hfill -1& \hfill 1\end{array}\right] $$
(1)

It is important to note that QTL genotypes are not directly observed, so they need to be inferred based on the genotype of their flanking markers. The conditional probabilities for QTL genotypes can be obtained by either two-point (Lynch and Walsh 1998) or a multipoint approach (using hidden Markov models) (Jiang and Zeng 1997; Wu et al 2002b). Although two-point analysis could be used, multipoint methods are strongly recommended because they allow the inclusion of all individuals, including the ones with missing markers, and also because of the partial information provided by non-fully informative markers on the genetic map (Jiang and Zeng 1997; Wu et al 2002b). For this reason, in the present work, conditional probabilities for QTL genotypes were obtained using OneMap software, which implements a multipoint approach using hidden Markov models (Margarido et al 2007, 2011).

From the contrasts, it is possible to define a statistical model for QTL mapping:

$$ {y}_j={\mathbf{Z}}_j\boldsymbol{\gamma} +{\alpha}_p^{\ast }{x}_{pj}^{\ast }+{\alpha}_q^{\ast }{x}_{qj}^{\ast }+{\delta}_{pq}^{\ast }{x}_{pj}^{\ast }{x}_{qj}^{\ast }+{\varepsilon}_j $$
(2)

where y j : the phenotype of the j th individual (j = 1,…, n); Z j : j th line of the indicator matrix Z, with dimensions n × (1 + 3c), and a column of 1's and variables related to the genotypes of c cofactors, according to the contrasts represented in D matrix (Eq. 1) and similar to \( {x}_{{}^{pj}}^{*} \) and \( {x}_{{}^{qj}}^{*} \); parameter vector γ: vector (1 + 3c) × 1 containing the intercept (μ) and the coefficients of the multiple linear regression parameters (α pc , α qc , and δ pqc ) for each cofactor. Cofactors are selected in a previous step using for an example stepwise regression (Basten et al 1999) and they are fixed for each genomic position, given the window size; α * p and α * q : additive effects of the QTL for parents P and Q, respectively; δ pq : effect of the intra-locus interaction (dominance) between additive effects; and ε j : error. It is assumed that ε j ∼ N(0, σ 2). The variables \( {x}_{{}^{pj}}^{*} \) and \( {x}_{{}^{qj}}^{*} \) indicate the contrasts for QTL genotypes:

$$ {x}_{pj}^{\ast }=\left\{\begin{array}{rr}\hfill 1& \hfill \mathrm{if}\ {P}^1{Q}^1\\ {}\hfill 1& \hfill \mathrm{if}\ {P}^1{Q}^2\\ {}\hfill -1& \hfill \mathrm{if}\ {P}^2{Q}^1\\ {}\hfill -1& \hfill \mathrm{if}\ {P}^2{Q}^2\end{array}\right.;\kern0.5em {x}_{qj}^{\ast }=\left\{\begin{array}{rr}\hfill 1& \hfill \mathrm{if}\ {P}^1{Q}^1\\ {}\hfill -1& \hfill \mathrm{if}\ {P}^1{Q}^2\\ {}\hfill 1& \hfill \mathrm{if}\ {P}^2{Q}^1\\ {}\hfill -1& \hfill \mathrm{if}\ {P}^2{Q}^2\end{array}\right. $$

The elements of Z are defined in a similar way; however, they refer to markers proposed for CIM models (Zeng 1993, 1994). To select cofactors, procedures described for CIM for inbred-based populations were used. We found satisfactory results using multiple regression methods between markers and phenotypes using the Bayesian information criterion, or BIC (Schwarz 1978) to select the final model with a maximum of \( 2\sqrt{n} \) parameters to avoid super-parameterization (Sakamoto and Kitagawa 1987; Wang et al 2007). Because three effects (α pc , α qc , and δ pqc ) may be included for each marker that is added as a cofactor, the non-significant markers may be removed to reduce the number of parameters to be estimated.

Likelihood and estimation

Considering that QTL genotypes could not be observed, the model 2 was considered as a mixture model, with QTL genotypes as latent variables. The likelihood function for the model is:

$$ L\left(\boldsymbol{\theta}, \boldsymbol{\gamma}, \sigma \right)={\displaystyle \prod_{j=1}^n\left[{\displaystyle \sum_{u=1}^2{\displaystyle \sum_{v=1}^2{p}_{uvj}\phi \left(\frac{y_j-{\mu}_{uvj}}{\sigma}\right)}}\right]} $$
(3)

where θ and γ are vectors of QTL and cofactor parameters, respectively; p uvj is the conditional multipoint probability for genotype P u Q v and the j th individual, in a given position on the genome. The procedure to obtain these probabilities are detailed by Wu et al (2002b); ϕ (.) is the standard normal probability with mean μ 11j  = Z j γ + α p  + α q  + δ pq ; μ 12j  = Z j γ + α p  − α q  − δ pq ; μ 21j  = Z j γ − α p  + α q  − δ pq ; μ 22j  = Z j γ − α p  − α q  + δ pq , and variance σ 2.

Using the notation presented by Kao and Zeng (1997), and expanding the ideas of Zeng (1994) for a full-sib cross, the maximum likelihood estimates are obtained using the EM algorithm (Dempster et al 1977), in two steps:

  1. Step E:

    a posteriori probabilities (π (t + 1) uvj ) for QTL genotypes, which can be obtained applying the Bayes theorem:

    $$ {\pi}_{uvj}^{\left(t+1\right)}=\frac{p_{uvj}\phi \left(\frac{y_j-{\mu}_{uvj}^{(t)}}{\sigma^{(t)}}\right)}{{\displaystyle \sum_{u=1}^2{\displaystyle \sum_{v=1}^2{p}_{uvj}\phi}\left(\frac{y_j-{\mu}_{uvj}^{(t)}}{\sigma^{(t)}}\right)}} $$
  2. Step M:

    maximum likelihood estimates:

    where:

    $$ {\mathbf{V}}^{\left(t+1\right)}=\left[\begin{array}{l}\mathbf{1}^{\prime }{\boldsymbol{\Pi}}^{\left(t+1\right)}\left({\mathbf{D}}_1\circ {\mathbf{D}}_1\right)\mathbf{1}^{\prime }{\boldsymbol{\Pi}}^{\left(t+1\right)}\left({\mathbf{D}}_1\circ {\mathbf{D}}_2\right)\mathbf{1}^{\prime }{\boldsymbol{\Pi}}^{\left(t+1\right)}\left({\mathbf{D}}_1\circ {\mathbf{D}}_3\right)\hfill \\ {}\mathbf{1}^{\prime }{\boldsymbol{\Pi}}^{\left(t+1\right)}\left({\mathbf{D}}_2\circ {\mathbf{D}}_1\right)\mathbf{1}^{\prime }{\boldsymbol{\Pi}}^{\left(t+1\right)}\left({\mathbf{D}}_2\circ {\mathbf{D}}_2\right)\mathbf{1}^{\prime }{\boldsymbol{\Pi}}^{\left(t+1\right)}\left({\mathbf{D}}_2\circ {\mathbf{D}}_3\right)\hfill \\ {}\mathbf{1}^{\prime }{\boldsymbol{\Pi}}^{\left(t+1\right)}\left({\mathbf{D}}_3\circ {\mathbf{D}}_1\right)\mathbf{1}^{\prime }{\boldsymbol{\Pi}}^{\left(t+1\right)}\left({\mathbf{D}}_3\circ {\mathbf{D}}_2\right)\mathbf{1}^{\prime }{\boldsymbol{\Pi}}^{\left(t+1\right)}\left({\mathbf{D}}_3\circ {\mathbf{D}}_3\right)\hfill \end{array}\right] $$

    D 1, D 2, and D 3 are the columns of the matrix D, Π (t+1) = {π uvj }(n×4) is the a posteriori probability matrix of QTL genotypes, ◦ represents the Hadamard product, and primes indicate a matrix (vector) transposed.

    The algorithm is initiated by arbitrarily attributing values in iteration t to the parameters contained in θ, which allows the calculation of the a posteriori probabilities in step E (t + 1); the new probability estimates are then employed to update the model parameters according to the estimators obtained in step M. The procedure is repeated until convergence is obtained.

QTL mapping

The procedure to test for QTL evidence is carried out by comparing the likelihood of the models, considering the presence of a QTL (H a ) versus a model without QTL (H 0):

$$ {H}_0:{\alpha}_{{}^p}^{*}={\alpha}_q^{*}={\delta}_{{}^{pq}}^{*}=0 $$
  • Ha: at least one is different from zero

These hypotheses can be tested in all genome positions, using the LOD Score or likelihood ratio test (LRT) in a way similar to that presented by Zeng (1994). It is necessary to account for problems that occur with multiple tests, which can be done using strategies already available for inbred-based populations, such as permutation tests (Churchill and Doerge 1994). In short, this is a non parametric resampling method, allowing to obtain the empirical distribution under a null hypothesis of test statistic used for QTL mapping. The method starts shuffling the phenotypic values a number of times in order to break any correlation between QTL and phenotypes, and then performing the QTL mapping for these new data sets. The maximum test value obtained along the genome is recorded for each analysis, and the 95th percentile indicates the genome-wide threshold value.

Linkage phase and QTL segregation pattern

After finding evidence for QTL, it is possible to infer their linkage phase with their flanking markers simply based on the signals of the estimates of α * p and α * q . Because the configuration P 1 m P 1 P 1 m + 1 /P 2 m P 2 P 2 m + 1  × Q 1 m Q 1 Q 1 m + 1 /Q 2 m Q 2 Q 2 m + 1 was used, with alleles P 1 and Q 1 having a positive effect on the phenotype both the estimates for α * p and α * q are positive (Fig. 1). If distinct configurations occur, the signal of the estimates will be negative accordingly. Therefore, the linkage phase can be inferred simply by identifying the alleles that have a positive or negative effect on the phenotype, and it is not necessary to include the linkage phases in the model, as seen in Lin et al (2003). This approach sensibly reduces the complexity and numerical problems of the EM algorithm.

The QTL segregation pattern depends on the relations between alleles. To infer these relations, several statistical hypotheses have been defined, and they need to be tested in the positions with evidence for QTL presence in one or two steps (Table 1). In step 1, H 01, H 02, and H 03 are tested, one at a time. Depending on what hypotheses are rejected and on the signal of the significant estimates of the QTL effects, another hypothesis may be necessary (step 2).

Table 1 Statistical hypotheses for testing segregation patterns and linkage phase of QTL

If only one of these three hypotheses is rejected, an examination of the signal of the significant effect estimates allows the conclusion of the segregation pattern and linkage phase (no step 2 required). For example, if only H 01 is rejected and the signal of α * p is positive, the inferred segregation is 1:1 and the linkage phase is P 1 m P 1 P 1 m + 1 /P 2 m P 2 P 2 m + 1 . If α * p is negative, the linkage phase is P 1 m P 2 P 1 m + 1 /P 2 m P 1 P 2 m + 1 .

If more than one hypotheses is rejected at step 1, new tests are performed (step 2) that also consider the signals of the estimates. If two hypotheses are rejected, they need to be identified and the signals of the estimates of the significant effects need to be examined to check what hypotheses need to be examined in step 2. If both additive effects are significant, H 04 needs to be tested; if α * p and δ * pq are not 0, H 05 needs to be tested; otherwise, H 06 needs to be tested. Depending on the result of the test of H 04, H 05, and H 06 (conditional on the signs of the estimates), the segregation patterns and linkage phases are inferred. For example, if H 01 and H 02 are both rejected and α * p and α * q are positive and negative, respectively, H 04  α * p  = − α * q will be tested (bilateral); if rejected, the segregation is 1:1:1:1 and the linkage phase is P 1 m P 1 P 1 m + 1 /P 2 m P 2 P 2 m + 1 and Q 1 m Q 2 Q 1 m + 1 /Q 2 m Q 1 Q 2 m + 1 for parents P and Q, respectively. When H 01, H 02, and H 03 are all rejected, it is necessary to test H 04, H 05, and H 06 in step 2. In a similar way, conclusions are reached based on the signs of the estimates and on what hypotheses were rejected in step 2.

The hypothesis tests of step 2 are implemented by obtaining new estimates for the parameters under the constrained new hypothesis. Thus, a T matrix is defined, to impose constrains by multiplying the D matrix. The DT matrix substitutes D in the steps of the EM algorithm and then new likelihood estimates are easily obtained (Appendix A).

Simulation

To exemplify and validate the proposed model, a simulation study was conducted in a similar way to those presented by Zeng (1994), Kao and Zeng (1997) and Lin et al (2003). A full-sib progeny consisting of 300 individuals was considered, with a genetic map composed of four chromosomes. Each chromosome had 15 molecular markers equally spaced at 10 centiMorgans (cM), employing the Kosambi function (Kosambi 1944). Markers exhibiting distinct segregation patterns were considered using the notation proposed by Wu et al (2002a). Briefly, the markers are classified into four types according to their segregation type as follows: A (1:1:1:1), B (1:2:1, separated into B 1, B 2, or B 3 depending on the presence of the null allele in the parent P, Q or neither of them, respectively), C (3:1), and D (1:1, labeled D 1 when the heterozygous parent is P and D 2 when it is Q). From the total simulation, 15 markers were fully informative (type A); 15 markers were of the B type (equally distributed among the B 1, B 2, and B 3 types); 10 markers were of the C type; and 20 markers belonged to the D type of markers, with half of them being D 1 and the other half being D 2. The markers were randomly distributed along the chromosomes, resulting in the following distribution: chromosome one, 5 A, 1 B 1, 0 B 2, 2 B 3, 4 C, 2 D 1, and 1 D 2; chromosome two, 3 A, 2 B 1, 1 B 2, 1 B 3, 1 C, 2 D 1, and 5 D 2; chromosome three, 4 A, 0 B 1, 1 B 2, 2 B 3, 2 C, 4 D 1, and 2 D 2; and chromosome four, 3 A, 2 B 1, 3 B 2, 0 B 3, 3 C, 2 D 1, and 2 D 2. The order of the markers is indicated in Fig. 2.

Fig. 2
figure 2

QTL mapping for the simulated population without inclusion of cofactors (IM) and with inclusion of cofactors (CIM). Genetic markers are indicated by green triangles; the corresponding types (A, B1, B2, B3, C, D1, and D2) follow the notation of Wu et al (2002a)

The simulated trait had a heritability of 0.70 and was controlled by eight QTL located along the four chromosomes whose genetic effects were distributed such that they represented distinct linkage phases and segregation patterns. The effects were simulated as deviations of the mean, which was zero.

The conditional probabilities p uvj were calculated at every 1 cM along each chromosome using the multipoint approach implemented in the OneMap software (Margarido et al 2007, 2011). Subsequently, composite interval mapping was performed, using the new model. Additionally, QTL mapping was carried out in the absence of cofactors to determine if the properties of the new proposed model were similar to those described by Zeng (1994). Cofactor selection was performed by stepwise multiple linear regression using the BIC. For each included marker, three parameters (α p , α q , and δ pq ), were added and their significance was tested. The parameters exhibiting non-significant (5 %) effects were removed from the model. As proposed by Zeng (1994), from all the selected cofactors, only markers located at a distance greater than 10 cM (window size) from the markers flanking the interval to be mapped were considered.

To search for QTL along the genome, we used the likelihood ratio test, with three degrees of freedom. To declare a QTL, the threshold value used was LRT = 16.89 (LOD Score 3.66) obtained by employing 1,000 permutations, with 95 % significance level (Churchill and Doerge 1994). The remaining tests carried out for step 1 (H 01, H 02, and H 03) and step 2 (H 04, H 05, and H 06) were performed using one degree of freedom. These tests are performed only at positions with a putative QTL, thus, the problems derived from the use of multiple tests are not present (Jiang and Zeng 1995).

Results

Interval mapping

As expected, interval mapping did not perform well for QTL detection (Fig. 2). For chromosome 1, two QTL were simulated (15 and 115 cM), but only one was mapped, at 5 cM. On chromosome 2, two of the three simulated QTL were detected. For chromosome 3, a large region of approximately 80 cM was found to display an LOD Score superior to the threshold. However, the mapping results did not show conclusively if there exists two QTL located at 25 and 65 cM, as simulated, or if there is only a single QTL with a residual effect on the adjacent intervals. On chromosome 4, a QTL located at 60 cM was detected but there was also a possible false positive at 12 cM.

Cofactor selection

Eight cofactors were selected and all of them flanked the regions spanning the simulated QTL. No super-parameterization occurred in the CIM model because the actual sample size would accommodate a total of \( 2\sqrt{300}=34 \) parameters. Fifteen genetic effects were included in the model, with eight markers used as cofactors (Table 2). Although some of the selected cofactors are informative, only in one parent (D 1 or D 2), dominance effects were retained in some cases because when the multipoint approach is employed to obtain the probabilities, the genotype information is recovered, even for markers that are not fully informative.

Table 2 Location of the cofactors in relation to the simulated QTL

Composite interval mapping

The results from CIM (Fig. 2) were more consistent in comparison to those obtained from IM, because all simulated QTL were mapped. False positives, detected using interval mapping, were eliminated. It is also noteworthy that virtually all QTL mapped using CIM exhibited higher LOD Scores than those mapped using IM, which indicates a higher statistical power of CIM. Therefore, our results are in agreement with those from Zeng (1994) that presented the advantages of CIM.

For chromosome 1, both simulated QTL were detected by CIM at 15 and 111 cM. The first QTL was mapped to the exact location of the simulated one, and the second one was located within the same interval.

The QTL at 15 cM, which was detected by both methods, had a higher LOD Score for CIM analysis, indicating the greater statistical power of this model.

The CIM model displayed good results for chromosome 2 as well because it could identify all three simulated QTL, which were not obtained using IM. The QTL positioned at 44 cM was also mapped with higher resolution by the CIM approach. In the case of chromosome 3, the results were also substantially improved using CIM because IM showed one large region spanning 80 cM, providing an imprecise location of the QTL, while CIM analysis correctly pointed out two distinct peaks at 20 and 68 cM.

The use of CIM for chromosome 4 removed the false QTL at 10 cM detected by IM and also detected a QTL at 61 cM with a higher LOD Score. The simulated QTL was located at 55 cM, which is outside, but adjacent, to the range of the simulated interval. Eye inspection of Fig. 2 allowed us to infer that the confidence interval for the QTL spans the simulated position, and that during an actual mapping situation, this would not compromise practical applications of the results.

QTL segregation pattern and linkage phases

The QTL segregation patterns result was satisfactory for all tested situations (Table 3). All QTL that segregated in a 1:1 fashion (i, ii, vii) were correctly characterized, with only one hypothesis rejected at step 1. The estimated QTL were very close to the simulated ones. QTL with a 1:2:1 segregation pattern (iii, v, viii) were also correctly inferred. For QTL with a 3:1 fashion (iv), the segregations were well estimated, with three hypothesis rejected at step 1, and three not rejected at step 2.

Table 3 QTL mapping and characterization using the CIM model

QTL vi, which was simulated as 1:1:1:1 with additive effects larger than dominance effects, was mapped with a 1:2:1 segregation pattern and with two significant effects α * p  = − α * q and without dominance effect δ * pq . In this case, the inferred segregation was distinct from the simulated one, most likely due to the small magnitude of the effects, which may have impaired the correct identification.

In general, the CIM model was very effective at estimating the linkage phases between QTL and markers. In all situations where QTL effects were significant, the linkage phases were always correctly estimated.

Discussion

In this work, we have presented a model for QTL mapping using full-sib progeny obtained from two diploid, non-inbred individuals. The model takes into account the distinct segregation patterns that the molecular markers and QTL may assume in the investigated context. The approach is based on the composite interval mapping model (Zeng 1993, 1994) that was first developed for inbred-based populations (BC, F 2, RILs). To validate the model, we have simulated a quantitative trait with a 0.70 heritability controlled by eight QTL exhibiting distinct effects, segregation, and linkage phases.

In general, the model allowed us to map the simulated QTL with their correct characterizations. The model also provided correct estimates of the linkage phases for all QTL with significant effects, meaning that it was possible to identify the origin of the alleles that increased or reduced the phenotype. The main advantage of this feature is that the mapping results may be useful for marker-assisted selection in plant breeding programs, even if the inferences for segregation and/or the estimates of QTL effects are eventually incorrect.

The proposed model exhibits advantages over the approach devised by Lin et al (2003). In contrast to that previous work, we have not considered the linkage phases as parameters to be estimated by the model, but instead obtained them by interpreting the signals of the estimates, α * p and α * q . This change sensibly reduced the complexity of the EM algorithm, which allowed the model to be easily expanded to the CIM context. More complex situations found in multiple interval mapping (Kao and Zeng 1997; Kao et al 1999) and multiple trait and environmental mapping (Jiang and Zeng 1997) may also be easily investigated using the proposed model. Future studies may include investigations on epistatic interactions, interactions between QTL and environments and correlation between traits.

Lin et al (2003) noted that the 1:1 segregation is tested by the assumption that one of the additive effects is zero and that the 1:2:1 pattern, similar to that found in F 2, occurs when marginal effects are statistically equal. However, the present model allows for the identification of the possible segregation patterns via a procedure to identify these situations and a bypass to avoid multiple testing problems.

A great advantage of the proposed model is that it is based on multipoint conditional probabilities, and therefore, the presence of informative markers along the linkage groups allows the detection of QTL exhibiting distinct segregation patterns (1:1:1:1, 1:2:1, 3:1), even for regions with less informative markers. As an example, in the work by Lin et al (2003), the authors did not present effective means to estimate the conditional probabilities among less informative markers, such as those with 1:1 or 3:1 segregation patterns. In the current model, the information on adjacent markers is recovered and the probabilities are estimated in a more precise way. Jiang and Zeng (1997) have proven the effectiveness of the model for inbred lines, but to our knowledge, our work is the first to use multipoint conditional probabilities through HMM for QTL genotypes with a full-sib progeny.

The successful use of the strategy suggested for identifying QTL segregation and linkage phases depends on correctly estimating their location. Thus, models allowing higher control of the residual variance are advantageous. Zeng (1994) notes that the use of multiple linear regressions combined with interval mapping (Lander and Botstein 1989) provides more reliable estimates for QTL effects. In the present work, the CIM model more precisely positioned the QTL in comparison to the IM approach and also displayed higher statistical power for QTL mapping. In the model proposed by Lin et al (2003), the inclusion of cofactors is complicated because the extension of the EM algorithm is not straightforward in their model. Moreover, by modeling QTL effects based on multipoint QTL probabilities, we were able to easily expand from IM to CIM. This method is also valid for more sophisticated models, such as multiple interval mapping (Kao and Zeng 1997; Kao et al 1999). Thus, the suggested model provided a sound basis for future research. An R package named fullsibQTL to implement the models hereby presented is under development and will be released soon.