Introduction

Much research has been done on linkage mapping of qualitative or quantitative trait loci (QTL). Schork investigated multipoint identity-by-descent analysis of human quantitative traits.1 Fulker and Cardon worked out a sib-pair approach of a two-point interval mapping for QTL.2 Researchers have been extending the available methods toward several directions: (1) Fulker et al.3 extended the method of Fulker and Cardon2 for usage in multipoint interval mapping; (2) Almasy and Blangero worked on multipoint mapping for general pedigrees;4 (3) Liang et al. proposed a unified sampling method for both qualitative and quantitative traits;5 (4) Pratt et al. used an exact multipoint algorithm to analyse family data by variance component models.6 The focus of the above studies was on linkage mapping, which is based on family data. Linkage analysis is appropriate for low resolution genetic mapping to localise trait loci to broad chromosome regions within a few cM (<10 cM).

Linkage disequilibrium (LD) mapping or association study, on the other hand, is based on both family and population data, and is useful in high resolution of genetic mapping, ie, fine disease gene mapping. The reason for the high resolution of linkage disequilibrium mapping is that the allelic association due to linkage disequilibrium usually operates over short genetic distances. Linkage analysis and linkage disequilibrium mapping are complementary in disease gene mapping. To localise genetic traits, one may carry out linkage analysis as the first step on a sparse map to get suggestive linkage between genetic traits and markers. Then linkage disequilibrium mapping can be used as a follow-up in high resolution mapping of the genetic traits on a more dense map.

Abecasis et al.,7 Fulker et al.8 and Sham et al.9 have explored linkage and association studies of quantitative traits by variance–component procedures, allowing a simultaneous test of allelic association for family data. Zhao et al.10 applied a regression approach of linkage disequilibrium mapping to localise QTL in humans. In these studies the investigators used only one marker in their analysis. However, very dense maps such as single nucleotide polymorphisms (SNPs) in human genome (The International SNP Map Work Group) are available now.11 These exciting developments allow us to explore models and methodologies of simultaneously using two or more markers in high resolution linkage disequilibrium mapping of QTL.

In this article, we propose a linear regression method of high resolution mapping for QTL by using linkage disequilibrium analysis which is based on population data. Assuming that two marker loci flank one genetic trait locus, a linear regression is introduced based on an intuitive rationale. Then we derive analytical formulas of parameter estimations, and non-centrality parameters of appropriate tests of genetic effects and linkage disequilibrium coefficients. The merit of the regression method is shown by the power calculation and comparison.

Models

Consider a quantitative trait which is influenced by a quantitative trait locus Q, which is flanked by two markers A and B in an order of AQB. Suppose that there are two alleles Q1 and Q2 at the trait locus with frequencies q1 and q2. At the marker locus A, assume there are two alleles A and a with frequencies PA and Pa, respectively. For the marker B, assume that there are two alleles B and b with frequencies PB and Pb, respectively. Suppose that markers A and B are in Hardy–Weinberg equilibrium, ie,

.

However, they may be in linkage disequilibrium. Let us denote the measure of linkage disequilibrium between trait locus Q and marker A by DAQ=P(AQ1)−q1PA, the measure of linkage disequilibrium between trait locus Q and marker B by DQB=P(BQ1)−q1PB, and the measure of linkage disequilibrium between marker A and marker B by DAB=P(AB)−PAPB.12,13,14 In addition to the major QTL Q, assume that there is an error effect that influences the trait. Then the total variance can be decomposed as is variance explained by the putative QTL Q, and is error variance. The genetic variances is decomposed into additive and dominant components, respectively. Assume that there are n independent individuals from a population with trait values yi, genotype Ai at marker A and genotype Bi at marker B. Consider the following regression equation

where β is overall mean, wi is a row vector of covariates such as sex and age, γ is a column vector of regression coefficients for the covariates wi, and ei is error term. Assume that ei is normal . Besides, xAi, xBi,zAi and zBi are dummy random variables that are independent of ei, and are defined by

αA, αB, δA and δB are regression coefficients of the dummy variables xAi, xBi, zAi and zBi. Let us denote an experimental design matrix X by

a vector of regression coefficients by μ=(β,γτ,αABAB)τ, the quantitative traits by a vector Y=(y1,y2,…,yn)τ, and errors terms by e=(e1,e2,…,en)τ. Then we may write the model (1) as Y=Xμ+e. By standard regression theory, we may estimate the coefficients by .

 To give an intuitive rationale of model (1), let μij be the effect of genotype QiQj,i,j=1,2,μ1221. Let the genic effect of allele Qi be αi,i=1,2. Then genotypic effects can be expressed as μ110+2α1+d112012+d2220+2α2+d3, where μ0 is the overall population mean, di is the deviation of the related genotypic value from that of an additive effect model. Minimising , the estimates of μ012 are , and (Jacquard,15 Chapter 5). Plugging these estimates into μij, we can obtain that . Here αQ=q1μ11+(q2q112q2μ22 is the average effect of gene substitution, and δQ=2μ12−μ11−μ22 is the dominant deviation. Assume that marker A coincides with the trait locus Q, and marker allele A is trait allele Q1 and marker allele a is trait allele Q2. Then the trait value can be expressed as yi=μ+xQiαQ+zQiδQ+ei. In practice, information of trait locus Q is unknown, but the information at marker loci is available. This prompts us to propose regression model (1) to map QTLs.

 Assume that there are no covariates. Suppose that the markers A and B are in Hardy–Weinberg equilibrium. Then ExAi=ExBi=EzAi=EzBi=0. When the sample size n is large enough, we show in Appendix A that the coefficients are approximately given by

If the markers A and B are in linkage equilibrium, ie, DAB=0, then the above equations simplify to the following (Appendix A)

Property of regression coefficients

As in the previous section, let μij be the effect of genotype QiQj,i,j=1,2. If μ11=a, μ12=d, and μ22=−a as in the traditional quantitative genetics (Falconer and Mackay16), αQ=a+(q2q1)d and δQ=2d. For general case, one may form the above relations by letting a11−(μ1122)/2 and d12−(μ11−μ22)/2. It is well known that the additive variance and the dominant variance . A true random effect model describing the trait value is

where

Let us denote three ratios . In Appendix B, we will show that the coefficients of regression equation (1) are given by

Assume that the two markers A and B are not in linkage disequilibrium, ie, DAB=0. Then , , and . Hence, marker A and marker B independently contribute to the analysis of the trait values. Furthermore, assume the trait locus Q is in linkage disequilibrium with marker A but not with marker B. Then DQB=0 and so . Hence, only marker A contributes to the analysis and marker B has no effect on the result. This is equivalent to using one marker for the analysis.

 If one marker coincides with the trait locus, for instance locus Q is marker A, we can show that the other marker B does not contribute to estimations of the substitution and dominant effects of the trait locus. Actually, assume that allele A=Q1 and allele a=Q2. Then DAB=DQB and DAQ=q1q2. This leads to . Hence, marker A can fully estimate the substitution and dominant effects of the trait locus Q.

 In general, assume that marker A and marker B are in linkage disequilibrium. Then model (1) simultaneously takes care of the linkage disequilibrium and the effects of the putative trait locus Q. The parameters of linkage disequilibrium (ie, DAQ and DQB) and gene effect (ie, αQ and δQ) are contained in the mean coefficients. We may simultaneously test linkage disequilibrium of marker A and marker B with trait locus Q, the gene substitution and dominant effects by testing αABAB=0. From equation (4), we may test the linkage disequilibrium of markers A and B with the trait locus Q and the gene substitution effect αQ by testing αAB=0. From equation (5), we may test the linkage disequilibrium of markers A and B with the trait locus Q and the dominant effect by testing δAB=0.

Non-centrality parameters

Assume that there are no covariates. Then μ=(β,αABAB)τ. Let H be a q×5 matrix of rank q. By Graybill,17 Chapter 6, the test statistic of a hypothesis Hμ=0 is non-central F(q,n−5) defined by , where In is the n×n identity matrix. The non-centrality parameter of the test statistic F can be calculated by λ=[1/(2σ2)] (Hμ)τ[H[XτX]−1H2]−1Hμ. To test if there are additive and dominant effects, we may test the hypothesis HAB,ad : αABAB=0. Then the test matrix H is defined by

Let us denote the corresponding F-test statistic by FAB,ad. In Appendix C, we show

If one assumes that (a) the two markers A and B are not in linkage disequilibrium, then DAB=0; (b) the trait locus Q is in linkage disequilibrium with marker A but not with marker B, then DQB=0 and . Then , which only involves marker A and can be written as λA,ad. Correspondingly, we denote the related F-test statistic by FA,ad. Furthermore, assume (c) there is no dominant effect, ie, . Then is the non-centrality parameter of the related F-test statistic FA,a.

 To test other hypotheses, we may get the non-centrality parameters in a similar way by taking appropriate test matrices H. To test if there is dominant effect, we may test the hypothesis HA,B,d : δAB=0. The non-centrality parameter is . The related F-test statistic is denoted by FAB,d. To test if there is additive or substitution effect, we may test the hypothesis HAB,a : αAB=0. The non-centrality parameter is . The related F-test statistic is denoted by FAB,a. To test if there are additive and dominant effects at marker locus A given that there are effects at marker locus B, we may test the hypothesis HA|B,ad : αAA=0. The non-centrality parameter is

To test if there is dominant effect at marker locus A given that there are effects at marker locus B, we may test the hypothesis HA|B,d : δA=0. The non-centrality parameter is .

Power calculation and comparison

To investigate the usefulness of the methods proposed in this article, we performed power and sample size calculations. As usual, we denote the heritability by h2 which is defined by . In the power calculations, we first take the equal allele frequencies PA=q1=PB=0.5 at the two markers A and B, and the trait locus Q. Moreover, suppose that μ11=a,μ1221=d and μ22=−a. Assume that marker A and marker B are in linkage equilibrium, ie, DAB=0, the heritability h2=0.25, and a sample size n=120. Figures 1 and 2 show the power curves of the test statistics FAB,ad, FA,ad, and FA,a against the disequilibrium coefficient DAQ when DQB=0.15 for a mode of dominant inheritance with a=d=1.0 and a mode of recessive inheritance with a=1.0, d=−0.5, respectively. The statistic FAB,ad has the highest power, and FA,ad has higher power than that of FA,a. Hence, the regression approach that uses two markers A and B is advantageous over the one marker mapping that uses only one marker A or B.

Figure 1
figure 1

Power curves of the test statistics FAB,ad, FA,ad, and FA,a against the disequilibrium coefficient DAQ when q1=PA=PB=0.50, DAB=0.0, DQB=0.15, h2=0.25, n=120 for a dominant trait a=d=1.0.

Figure 2
figure 2

Power curves of the test statistics FAB,ad, FA,ad, and FA,a against the disequilibrium coefficient DAQ when q1=PA=PB=0.50, DAB=0.0, DQB=0.15, h2=0.25, n=120 for a recessive trait a=1.0 and d=−0.5.

 Assume that the markers A and B are in moderate linkage disequilibrium, ie, DAB=0.1, and that the linkage disequilibrium coefficients DAQ=DQB=0.15. Figures 3 and 4 show the power curves of the test statistics FAB,ad, FA,ad and FA,a against the heritability h2 for a mode of dominant inheritance with a=d=1.0 and a mode of recessive inheritance with a=1.0, d=−0.5, respectively. For a population with sample size n=250, the regression approach can achieve a high power for a trait with heritability h20.15. Hence, the high resolution linkage disequilibrium mapping is a promising tool in mapping complex traits.

Figure 3
figure 3

Power curves of the test statistics FAB,ad, FA,ad, and FA,a against the heritability h2 when q1=PA=PB=0.50, DAB=0.10, DAQ=DQB= 0.15, n=250 for a dominant trait a=d=1.0.

Figure 4
figure 4

Power curves of the test statistics FAB,ad, FA,ad, and FA,a against the heritability h2 when q1=PA=PB=0.50, DAB=0.10, DAQ=DQB=0.15, n=250 for a recessive trait a=1.0 and d=−0.5.

In a population, the linkage disequilibrium exists if mutations at the trait locus occur. Once the mutations occur, the recombination between a marker locus and the trait locus can dissipate the disequilibrium from generation to generation. Let us denote the frequency of haplo type AQ at the generation when the mutations occur by P(AQ)(0). Then the linkage disequilibrium coefficient is DAQ(0)=P(AQ)(0)-q1PA at the generation when the mutations occur. For the following generations, the disequilibrium coefficient is reduced by a factor 1−θAQ in each generation,12 where θAQ is the recombination fraction between trait locus Q and marker A. Suppose that the mutation is already T generations old. Then the disequilibrium coefficient is DAQ(T)=DAQ(0)(1−θAQ)T. Similarly, we may calculate the disequilibrium coefficients by DAB(T)= DAB(0)(1−θAB)T and DQB(T)=DQB(0)(1−θQB)T, where θQB is the recombination fraction between trait locus Q and marker B, and θAB is the recombination fraction between marker A and marker B.

Suppose that we know the map distance λAB between marker A and marker B. Under the assumption of no interference, we may calculate the recombination fraction θAB=[1−exp(−2λAB]/2 by Haldane's map function. Similarly, we may calculate the recombination fractions θAQ and θQB by the map distances λAQ and λQB. Assume that the map distance between marker A and marker B is λAB=5cM, and the other parameters are given by DAB(0)=0.20, DAQ(0)=DQB(0)=0.25, h2=0.25, n=120, T=20. Figures 5 and 6 show the power curves of the test statistics FAB,ad, FA,ad, and FA,a against the recombination fraction θAQ, for a mode of dominant inheritance with a=d=1.0 and a mode of recessive inheritance with a=1.0, d=−0.5, respectively. We can see that the power of FAB,ad is very high, although the power of FA,ad and FA,a decreases very rapidly as the recombination fraction θAQ increases. Hence, the regressions using two markers are advantageous for fine gene mapping, and appropriate for the dense marker map such as SNPs in human genome.

Figure 5
figure 5

Power curves of the test statistics FAB,ad, FA,ad, and FA,a against the recombination fraction θAQ when q1=PA=PB=0.50, DAB(0)=0.20, DAQ(0)=DQB(0)=0.25, h2=0.25, λAB=5 cM, n=120, T=20 for a dominant trait a=d=1.0.

Figure 6
figure 6

Power curves of the test statistics FAB,ad, FA,ad, and FA,a against the recombination fraction θAQ when q1=PA=PB=0.50, DAB(0)=0.20, DAQ(0)=DQB(0)=0.25, h2=0.25, λAB=5 cM, n=120, T=20 for a recessive trait a=1.0 and d=−0.5.

To investigate the less favourable case other than the equal allele frequencies of trait locus and marker loci, Figure 7 shows the power curves of FAB,ad, FA,ad, and FA,a against the linkage disequilibrium coefficient DAQ when q1=0.20, PA=PB=0.80, DAB=0.0, DQB=0.04, h2=0.25, n=120 for a dominant trait a=1.0 and d=0.8. The three power curves are very close. Moreover, the power decreases rapidly when the linkage disequilibrium between trait locus Q and marker A decreases. For a recessive trait a=1.0 and d=−0.5, Figure 8 shows the power curves against the recombination fraction when q1=0.20, PA=PB=0.80, DAB(0)=0.10, DAQ(0)=DQB(0)=−0.15, h2=0.25, λAB=5cM, n=120, T=20.

Figure 7
figure 7

Power curves of the test statistics FAB,ad, FA,ad, and FA,a against the disequilibrium coefficient DAQ when q1=0.20, PA=PB=0.80, DAB=0.0, DQB=0.04, h2=0.25, n=120 for a dominant trait a=1.0 and d=0.8.

Figure 8
figure 8

Power curves of the test statistics FAB,ad, FA,ad, and FA,a against the recombination fraction θAQ when q1=0.20, PA=PB=0.80, DAB(0)=0.10, DAQ(0)=DQB(0)=−0.15, h2=0.25, λAB=5cM, n=120, T=20 for a recessive trait a=1.0 and d=−0.5.

Figures 9 and 10 show two plots of the sample size against the heritability h2 at a significant level 0.05 for a given power 0.80. In a favourable case when q1=PA=PB=0.50, DAB=0.10, DAQ=DQB=0.15 for a dominant trait a=1.0 and d=0.80, the required sample size is lower than 400, if the heritability is not lower than 0.1 (Figure 9). However, for an extremely less favourable case when q1=0.20, PA=PB=0.80, DAB=0.0, DAQ=0.03, DQB=0.04 for a recessive trait a=1.0 and d=−0.5, the required sample size is huge (Figure 10). Unfortunately, the true QTL frequency is rarely, if ever, known. Hence, linkage disequilibrium mapping works only when the linkage disequilibria are reasonably high, at least one needs moderate linkage disequilibria.

Figure 9
figure 9

Sample sizes of the test statistics FAB,ad and FA,ad against the heritability h2 at a significant level 0.05 for a given power 0.80, when q1=PA=PB=0.50, DAB=0.10, DAQ=DQB=0.15 for a dominant trait a=1.0 and d=0.80.

Figure 10
figure 10

Sample sizes of the test statistics FAB,ad and FA,ad against the heritability h2 at a significant level 0.05 for a given power 0.80, when q1=0.20, PA=PB=0.80, DAB=0.0, DAQ=0.03, DQB=0.04 for a recessive trait a=1.0 and d=−0.5.

Discussion

With the development of dense marker maps, such as SNPs in human genome (The International SNP Map Work Group11), fine disease gene mapping is getting more and more important for the study of complex diseases. Association study is a simple and useful method in fine disease gene mapping (Cardon and Bell18; Risch and Merikangas19). In this article, we proposed a linear regression method to perform high resolution linkage disequilibrium mapping of QTLs. In the regression, we used information of two flanking markers to model the additive and dominant effects of a QTL, and also the linkage disequilibria between the markers and the trait locus. In addition to the additive and dominant effects, we may add the covariates to model their effects. Due to the simplicity, the method can be easily performed by routine statistical analysis softwares such as SAS and Splus.

After studying the merits of the method of using two markers as proposed in this article, we concluded that this method is well suited for mapping complex diseases. It provides higher power than that of using only one marker approach. The advantages of high resolution mapping have been explored by many authors by using linkage analysis of family data or plant/animal data20,21,22,23,24,25). However, there is not sufficient statistical analysis regarding the high resolution mapping by linkage disequilibrium mapping method. Using population data, Zhao et al.10 applied an approach of linkage disequilibrium mapping based on regression to map QTL in humans. Abecasis et al.,7 Allison et al.,26 Fulker et al.,3 Göring and Terwillinger,27 and Sham et al.9 have explored linkage and association studies of quantitative traits by variance–component procedures allowing a simultaneous test of allelic association for family data. One interesting approach is to combine both family and population data, and perform combined linkage analysis and linkage disequilibrium high resolution mapping.

The power of linkage disequilibrium mapping depends on the existence of disequilibrium between a trait locus and a marker. In a population, linkage disequilibrium exists if mutations at the trait locus occur. In the absence of tight linkage, the degree of linkage disequilibrium decreases very rapidly after a few generations due to the recombination between the trait locus and the markers. Hence, linkage disequilibrium mapping is appropriate for the analysis of dense marker maps to do high resolution fine gene mapping. In practice, one can perform linkage disequilibrium mapping following prior evidence of linkage. Linkage analysis is less sensitive to population stratification, population history, or environmental effects. Moreover, linkage mapping is appropriate for low resolution mapping to localise trait loci to broad chromosome regions (<10 cM). The two methods, linkage mapping and linkage disequilibrium mapping, are complementary for disease gene mapping.

Potential problems of linkage disequilibrium mapping include population stratification, population history, or environmental effects. It is well understood that for the same number of individuals, family based linkage disequilibrium methods are less powerful than the population based methods. However, utilising family based linkage disequilibrium approaches may avoid false positives due to the sources of linkage disequilibrium such as population admixtures rather than linkage. One research area is to combine the population and pedigree data to do linkage disequilibrium mapping, and use the pedigree data alone to perform linkage mapping (Fulker et al.8).

As in Sham et al.,9 we notice that the non-centrality parameter is reduced by a factor equal to R2AQ for additive variance, and a factor of R4AQ for dominant variance, if we use only one marker A to perform analysis. Hence, the power decreases rapidly when the linkage disequilibrium between the trait locus and the marker is reduced. The degree of linkage disequilibrium depends heavily on the map distance between the trait locus and the marker locus, and most likely maintains high linkage disequilibrium when the two loci are very close. Hence, the high resolution mapping method proposed in this article has a good potential for being used in fine disease gene mapping. As mentioned in Sham et al.,9 the property of the measurements R2AQ, R2QB and R2AB needs more investigation, and their roles in different scenarios should be studied more thoroughly.