Introduction

In the field of statistical genetics, methods such as linkage disequilibrium (LD) analysis have long been used to fine-map trait genes once linkage analysis has narrowed the gene to a small genomic interval (eg, 1–5 cM).1, 2 By LD analysis, we mean any method of analysis that compares differences in allele, haplotype, or multi-locus genotype frequency distributions among a case population and a control population. A commonly used LD method is a case control study (a population-based strategy), which uses allele, genotype, haplotype, or multilocus haplotype data from a group of unrelated cases and from controls, individuals who are matched with cases on factors such as ethnicity, gender, and age.1, 3 One limitation of such methods is that they are not robust to unbalanced matching of cases and controls.1 As an alternative, methods that use family-based controls have been proposed as a replacement for population-based methods.4, 5, 6, 7, 8 Such methods are robust to population stratification (unbalanced matching). Building on the work of Falk and Rubinstein,4 Ott,9 and Julier et al,10 Spielman et al6 developed the McNemar test as a test for linkage and named it the transmission/disequilibrium test (TDT). The sampling unit for this test is a trio of a father, mother, and an affected child genotyped at a di-allelic marker locus that is hypothesized to be close to a trait locus, and for which at least one of the parents is heterozygous at the marker locus. An important feature about this statistic is that it is valid (ie, does not increase type I error), as a test of linkage, for multiplex families11 (although the TDT is not valid as a test of association in the presence of linkage for multiplex families). As a result, the original TDT is one of the most widely studied statistical genetics tests of the last decade (the original 1993 paper has over 1500 citations in the ISI Web of Science as of this writing.)

Two potential limitations of the TDT statistic regard its robustness. Curtis and Sham12 showed that computation of the TDT statistic on trios in which one parent is missing marker genotype data increases the type I error rate of the statistic. Also, Gordon et al,13 using simulated data, demonstrated that random genotyping errors that result in Mendelian consistent genotype data for trios also cause an increase in type I error when these data are analyzed with the TDT. Mitchell et al14 proved analytically that undetected genotyping error can cause apparent transmission distortion at markers with alleles of unequal frequency, that this distortion is in the direction of over-transmission for common alleles, and thus undetected genotyping errors may contribute to an inflated false-positive rate among reported TDT-derived associations.

A number of authors have developed extensions of the TDT that address the first robustness issue and that are valid in the presence of missing parental genotype data.15, 16, 17 In particular, Weinberg's reformulation of the TDT in a likelihood framework17 made it possible to address both robustness issues. Regarding the genotype errors robustness issue, Gordon et al13 developed an extension of the TDT, called the TDTae (subscript ‘ae’ means ‘allowing for errors’), which is a valid test for linkage with genotype data from trios in the presence of random genotyping error. To our knowledge, however, no one has developed a TDT that addresses both of these robustness issues jointly for general pedigrees. The purpose of this work, therefore, is the presentation of a new TDT, which we also call the TDTae, that is a valid test for linkage in the presence of LD for pedigrees that have any number of untyped parents and that have random genotyping errors. We note that our ‘new’ TDTae reduces to the original TDTae13 when we assume (Materials and methods – Appendix) that R2=R12, and the genotype error model considered assumes errors in alleles as opposed to genotypes.13, 18

One of the key features of the TDTae is that it assumes a particular error model structure. In their 2001 work, Gordon et al presented an error model assuming that errors are introduced randomly and independently into alleles at a di-allelic locus. This error model was also considered for studies with cases and controls.18 Since that time, a number of new error models, based on differing assumptions about the nature of genotyping errors, have been introduced into the literature.19, 20 We comment that our TDTae is capable of using any of these error models. In this work, for reasons that will be described in the next section (Materials and methods – Error models), we consider only three error models: those introduced by Douglas et al,19 Sobel et al,20 and Mote and Anderson.21

It should be noted that recently other TDT tests that allow for random genotyping errors have been published.22, 23 The work by Bernardinelli et al uses a Bayesian framework. One advantage of this TDT, which in its present formulation is valid only for trios, is that the Bayesian formulation facilitates addressing the issue of a large number of parameters in the likelihood via Markov chain Monte Carlo (MCMC) methods22 (also see Summary and discussion). The TDT developed by Morris and Kaplan has the advantage of being extendable to multi-locus haplotypes.23

Materials and methods

We begin this section by commenting that all notations used from this point forward are defined in the appendix.

Error models

In this subsection on error models, we describe some general assumptions we make regarding errors. We then list the error models we consider for this work. Finally, we describe some features about the individual error models. To begin, a key assumption that we make throughout this work is that genotyping errors occur randomly and independently in any set of genotype data under consideration. In what follows, we consider three possible error models which we name after their respective authors: (1) Douglas Skol Boehnke (DSB);19 (2) Sobel Papp Lange (SPL);20 and (3) Mote Anderson (MA).21 Each error model can be completely described by its error model parameters. For each error model, we list the parameters below (Appendix – Error model parameters). In supplemental Tables 1–3 (see http://linkage.rockefeller.edu/derek/TDTAE2-error-tables.htm), we document how the penetrance functions (see below; Appendix – Likelihood equation terms) are defined in terms of the specific error model parameters.

Here, we present a list of some of the features of the different models. Through the remainder of this work, we assume that all marker loci have two alleles, labeled 1 and 2. The DSB model introduces errors into genotypes, and is the only model for which it is not possible for a homozygous 11 genotype to be incorrectly recoded as a homozygous 22 genotype, or vice versa. It is described by two parameters. The SPL model is, for di-allelic loci, described by three parameters. It is the most general error model possible for di-allelic loci, under the constraint that the errors for each genotype are allele-independent.20

The MA model, which is the most general error model possible in the sense that it can describe all other error models, is described by six parameters. The SPL and MA error models allow for the possibility that one homozygote is incorrectly coded as another homozygote. Finally, the models are nested in the following sense: The SPL model reduces to the DSB model by setting v2=0; the MA model reduces to the SPL model by setting e31=e13, e21=e23, e12=e32 (see below; Appendix – Error models). This nesting property is useful in our likelihood framework. It allows us to perform a generalized likelihood ratio test24 to determine which error model best fits the data when genotyping errors leading to Mendelian inconsistencies are observed.

Terminology

Consistent

This term is used in reference to a set of genotypes for a given pedigree structure. It means that there are no observed Mendelian inconsistencies in the pedigree for the given set of genotypes.

Valid

This term refers to a property of a test statistic. If a test is valid under some condition, it means that the test statistic maintains the correct type I error rate when that condition is true.

Likelihood function for consistent pedigrees

Assuming that we know the affection status of each individual in a pedigree , we can classify each individual in the pedigree into one of the following mutually disjoint categories:

  1. 1)

    The individual is a founder.

  2. 2)

    The individual is an affected child.

  3. 3)

    The individual is an unaffected child who is a parent.

  4. 4)

    The individual is an unaffected child who is not a parent.

Without loss of generality, we can reorder the pedigree so that the first i1 individuals are in category (1) (ie, founders), the next i2 individuals are in category 2, and so on. Note that

For our likelihood calculations, we do not consider people in category (4). For a further discussion about individuals in this category, see the Summary and discussion section. Consider now a consistent set of genotypes for the pedigree . Using the notation listed in the appendix and discarding the genotypes of those individuals in category (4), we can write the set as

where the first i1 individuals are founders, the next i2 individuals are affected children, and the remaining i3 individuals are unaffected children who are also parents. Note that the number of elements in this set is ni4. The likelihood of these data, , is given by the formula:

We compute the conditional probabilities for all possible consistent trios as a function of the genotypic relative risks R1 and R2 in Table 1.

Table 1 Conditional probabilities for trios with genotypes ga, gf(a), gm(a)

Our likelihood equation (1) bears a resemblance to the pedigree likelihood equation of Elston and Stewart25 and therefore some further comments on the relation of the two methods are warranted. While both of these likelihood methods are applied to phenotype and genotype data for general pedigrees, there is a major difference between the two. The Elston–Stewart method computes likelihoods as a function of the recombination fraction between a disease and marker locus, whereas our method computes likelihoods as a function of genotypic relative risks. While the Elston–Stewart algorithm is designed to test for linkage whether or not there is any association, our method tests for linkage only in the presence of association.

Note that, in our likelihood equation (1), we assume that founder mating type frequencies are the product of the genotype frequencies for each of the founder parents (terms . While this assumption may be violated when there is moderate to severe population stratification (the condition for which the original TDT was developed), we make this assumption to reduce the computational complexity of the problem. Please see the Summary and discussion section for more details on this issue.

Likelihood function for inconsistent pedigrees

Having provided the likelihood function for consistent pedigrees (Equation (1)), we now consider the case where the set of observed genotypes is inconsistent. To compute the likelihood of this data set requires that we assume a particular error model . Using the definition of conditional probability and the law of total probabilities, we obtain the likelihood as

where , is the set of all sets of consistent genotypes for the pedigree structure , is the probability of observing the set of genotypes conditional on the true set of consistent genotypes being and the error model being , and is the likelihood for the set (Equation (1)).

We now provide a more explicit formulation of the conditional probability in Equation (2). Recall from the introduction where we commented that a key assumption of our error models is that errors occur randomly and independently in a set of genotypes. It follows from this assumption that we can write the conditional probability as:

In Equation (3), the penetrance is set to 1 for those individuals ai who are missing genotypes for the di-allelic marker locus being tested.

There are several nice features about the likelihood expressed in formulas (1)–(3). First, we note that the likelihood can be computed on a set of inconsistent genotypes from an arbitrary pedigree, not just trios or nuclear families. Second, with the weighting specified in Equation (3) for untyped individuals, we see that the likelihood will be accurate regardless of the number of untyped parents in the pedigree (untyped affected children who are not parents are not used in our calculations). This property is not true for the original TDT when the genotype data are consistent.12 The validity of this likelihood and of the consequent likelihood ratio statistic in the presence of untyped parents extends work previously done by other authors who modified the original TDT to a TDT that is valid in the presence of missing parental data.15, 16, 17

Test statistic

Given (Equations 1, 2 and 3), we are now ready to present our test statistic, the generalized TDTae. As mentioned in the appendix (Likelihood ratio test), under the null hypothesis H0, we assume that the genotypic relative risks are both 1; that is, R1=R2=1.0. The generalized TDTae is a likelihood ratio test, and for a given set of observed genotypes (consistent or inconsistent), it is defined as:

It is important to note that, when applying this likelihood ratio test to genotype data for a di-allelic locus, we compute the corresponding statistic treating the 1 allele as the wild-type allele, and the 2 allele as the disease allele.

Maximization

When applying our test statistic, we perform a two-stage maximization procedure. We first compute the log-likelihood under the null and alternative hypotheses using a lattice of points from a multi-dimensional rectangle. We ‘cut’ the cube into a pre-specified number of intervals, and compute the log-likelihoods for the end points of each of the intervals. For example, if we consider the SPL error model, and specify four cuts, then the parameters p11, p12, and v1v3, all of which have values in the interval [0,1], will be tested at 4+1=5 values: 0.0, 0.25, 0.5, 0.75, and 1.0. For the relative risk parameters R1 and R2, we initially consider the interval [0, 20]. Thus, in the first stage of our maximization, the log-likelihood is computed for (c+1)4+e values under the alternative hypothesis, and for (c+1)2+e values under the null, where c is the number of cuts specified, and e is the number of error model parameters for a given error model. The parameter e=2, 3, and 6 for the DSB, SPL, and MA error models, respectively.

Once the log-likelihoods are computed in the first stage, the parameter values that provide the top five log-likelihoods under each hypothesis are then used as starting values for the Powell maximization procedure.26, 27, 28 We use the Powell procedure as implemented in the ‘Numerical Recipes in C’ text.29 The largest log-likelihood from each set of five runs is then chosen as the maximum log-likelihood for each hypothesis.

Null simulations

We simulated 1000 null replicates for each of six different scenarios:

[Two settings for percentage of individuals genotyped (100% or 80%)] × [Three error models (DSB, SPL, or MA)]

In each simulation, 200 fixed pedigree structures with a total of 873 individuals were used. The pedigree structures come from an ongoing psoriasis disease study.30 Each pedigree had at least one affected offspring from a total of 443 affected individuals in all 200 pedigrees. The median number of individuals in a pedigree was four, with the largest pedigree having 13 individuals, and the smallest having three (trio). The SIMULATE program31 was used to simulate the null data, and a computer program was written to both randomly insert errors into individuals' genotypes and randomly remove individuals' genotypes from the analysis. The minor marker allele frequency was 0.25. For the error simulations, the following parameter settings were used:

DSB: γ=0.01, η=0.04.

SPL: v1=0.02, v2=0.01, v3=0.05.

MA: eij=0.01(1≤i, j≤3, ij).

In addition, we perform the Kolmogorov–Smirnov (KS) goodness of fit test32, 33, 34 (as implemented in S-PLUS 6.1) to determine whether the empirical distribution of TDTae statistics for each simulation fits well to a central χ2 distribution with two degrees of freedom, the appropriate null distribution for the TDTae according to likelihood ratio theory.24 A large value for the KS test, or equivalently, a small P-value (less than 0.05) indicates that we reject the null hypothesis that the data come from such a central χ2 distribution.

We used five cuts for the initial search when considering the DSB and SPL simulations, and two cuts when considering the MA simulations. We chose two cutpoints for the MA simulations because each additional cutpoint resulted in a large computational cost.

Real data applications

Psoriasis

We applied our TDTae method to a study of psoriasis. In all, 75 single nucleotide polymorphisms (SNPs) and 32 polymorphic microsatellites from chromosome 17q25 at an average resolution of 80 kb were genotyped in 242 multiply-affected psoriasis families. Significant linkage was found for multiple SNPs in the 17q25 region on Chromosome 1730. We present here TDTae analyses for 16 of the SNP markers. We chose the SPL error model when applying our test statistic for the reason that we consider it to be a reasonable compromise between a general error model (MA) and one with a small number of parameters (DSB), which is more computationally tractable. Also, we used five cuts for the initial search.

Sitosterolemia

We also applied our TDTae method to a study of sitosterolemia, a rare recessive disorder.35 In all, 28 polymorphic microsatellites from chromosome 2p21 at an average resolution of 1 cM were genotyped in 30 nuclear and extended pedigrees. Results of those analyses were published previously,2 and subsequently two genes, ABCG5 and ABCG8, were cloned.36, 37 TDT methods were applied because evidence for linkage disequilibrium was detected with several of the loci.2 The P-values reported for the TDTae method are corrected for multiple testing. Also, we note that no observed genotyping errors were found in our analyzed data set. We considered the SPL model, and used five cut points for the initial search.

For a multi-allelic locus, the TDTae statistic is computed by downcoding all alleles to two alleles; the allele of interest vs all other alleles. For a marker with i alleles, each of which appears at least 30 times, i tests are performed. We choose the maximum likelihood ratio test statistic among all alleles, and multiply the corresponding uncorrected P-value by the number of alleles tested. The product of this number and the uncorrected P-value is the corrected P-value. This multiple testing procedure has also been applied to other well-known transmission disequilibrium tests.6, 38, 39

Results

Null simulations

We present the results of our null simulations in Table 2. These results indicate that, for our simulated data sets, the TDTae maintains correct type I error (with the exception of the MA model at the 10% level of significance when 80% of the individuals are genotyped, which is slightly conservative) and that the empirical distribution is well described by a central χ2 distribution with two degrees of freedom, as indicated by the KS goodness of fit test results (all P-values are >0.05). Again, the only exception to the KS test findings is for the MA model when 80% of the individuals are genotyped. For that simulation, the P-value from the KS goodness of fit test is 0.037. We hypothesize that this lack of goodness of fit stems from the fact that only two cuts were used for MA maximization (Materials and methods – Null simulations) and that therefore the maximum may not have been found in each replicate. Having said that, we do note that the distribution was ‘well-behaved’ in the tails, in the sense that the TDTae test statistic rejected the null hypothesis in the appropriate proportions for the 5 and 1% levels of significance. Also, given the behavior of the statistic for all other simulations, we hypothesize that if we increased the number of cutpoints, the TDTae test statistic would have the appropriate central χ2 distribution for this simulation as well.

Table 2 Empirical significance levels and 95% confidence intervals for TDTae null simulations (2500 replicates in each simulation)

Psoriasis data set

In Figure 1, we present P-values (−log transformed) for 16 of the SNP markers. This figure indicates that two of the markers (SNPs #5 and #15) displayed significant evidence for linkage at the 5% level with the TDTae after correction for multiple testing via the false discovery rate (FDR) method.40, 41 The threshold for 5% significance [−log(P)=2.34] is indicated in Figure 1 by a horizontal dotted line.

Figure 1
figure 1

Plot of –log(P-value) for TDTae statistic applied to 16 SNP markers genotyped in Psoriasis study.30 In this figure, the dotted horizontal line (x=2.34) is the threshold for significance at the 5% level of the value –log(P-value), after correcting for multiple testing using the FDR method.40, 41

We also present the maximum likelihood estimates (MLEs) of each of the parameters in the TDTae using the SPL error model for SNP markers #5 and #15 (Figure 1) in Table 3. It is important to note that there were observed genotyping errors for each of these markers. In fact, for marker #15 (the marker with the largest TDTae statistic value), under the alternative hypothesis, the SPL error model MLEs were v1=0.033, and v3=0.016.

Table 3 Maximum likelihood estimates of TDTae parameters for two SNP markers that show significant evidence for linkage in psoriasis study30

Sitosterolemia data set

In Figure 2, we present corrected P-values (−log transformed) for the 28 microsatellite markers using both our TDTae method (solid lines) and the original TDT method6 (dotted lines). We observe that, in almost all instances, our method provides more significant results than the original TDT method. We note that the most significant P-value for TDTae, 1.57 × 10−9, occurs for marker D2S2298, which is approximately 20 000 base pairs from the genes ABCG5 and ABCG836. Furthermore, the marker D2S414, with the second most significant P-value for TDTae, 1.00 × 10−6, is approximately a half million base pairs from D2S2298. These two markers are the only two that remain significant at the 0.0001 level after correction for multiple testing with the FDR method.40, 41

Figure 2
figure 2

Plot of –log(P-value) for TDTae statistic (solid line) and original TDT (dotted line) applied to 28 microsatellite markers genotyped in Sitosterolemia study. In this figure, we compute P-values for the TDTae applied to microsatellite markers using the following formula: the TDTae statistic is computed by downcoding all alleles at a locus to two alleles; the allele of interest vs all other alleles. For a marker with i alleles, each of which appears at least 30 times, i tests are performed. We choose the maximum likelihood ratio test statistic among all alleles, and multiply the corresponding uncorrected P-value by the number of alleles tested. The product of this number and the uncorrected P-value is the corrected P-value, and this is value that we report (−log transformed).

We comment that our TDTae method is more powerful than the classic TDT method for these data, even though no genotyping errors were observed. The reason for the increase in power is due to the fact that the genotypic relative risks do not satisfy the assumption R2 = Rfrac12;, based on the MLEs. For example, the MLEs for marker D2S2298 are R1=0.03003 and R2=0.077182, clearly not satisfying the multiplicative model condition assumed in the original TDT model. Having said that, we comment additionally that even without genotyping error, our method still provides correct type I error rates when there is missing parental data. In the sitosterolemia data set, 16% of all parents were missing all genotype data.

Summary and discussion

Since the development of the first TDT statistic, there has been much methodological research focusing on this test. Two key robustness issues are missing parental genotype information and genotyping errors. While there have been methods developed to address both issues separately, there has been no such method that addresses these issues jointly for general pedigrees. Our work seeks to fill that research void. We have developed the statistic using a likelihood framework, allowing for a more general disease model through the use of the relative risk framework of Weinberg.17 Also, our simulation results suggest that our test statistic maintains proper type I error rate in the presence of genotyping error and missing parental data. Finally, our real data analyses suggest that our test statistic may be powerful for both complex traits (eg, psoriasis) and Mendelian traits (eg, sitosterolemia).

We note that we assumed that the mating type frequencies of founders are given by the product of the individual genotype frequencies, unlike Weinberg.17 We make this simplification to reduce the number of parameters that we must maximize in finding the maximum log-likelihood of the data. While it may be more powerful to use the six mating types, we comment that our simplification reduces the number of parameters to be estimated by three. However, our assumption does make our statistic potentially non-robust to population stratification, the original condition for which the TDT and other statistics were developed.4, 6 We plan to extend our method to handle the more general mating-type frequencies proposed in Weinberg's work.17 However, we conjecture that, to get appropriate maximum likelihood estimates under the null and alternative hypotheses in a reasonable amount of time, we need approximate likelihood solutions like those implemented with MCMC methods (eg, see Sobel et al20). We also make this point in the next paragraph regarding larger numbers of individuals in a general pedigree. We note that the authors Bernardinelli et al22 have already implemented MCMC methods in their version of the TDT allowing for errors.

While we have laid the preliminary groundwork here for our generalized TDTae statistic, we note that this statistic requires further development. The likelihood ratio test is based on large sample theory, and may not be valid for small samples or situations where there is a significant amount of missing data. Also, our method's computation increases as the number of individuals in a pedigree increases. Thus, at present, our test statistic may only be useful for nuclear families and some smaller extended pedigrees. We plan to further develop this test statistic so that it will be valid for small samples and also that it is computationally feasible for general pedigrees. This may require the use of approximate likelihood approaches.42, 43, 44 This work is in progress.

Another important point to mention is that we do not consider unaffected siblings of affected siblings who are not parents in our likelihood formulation (Materials and methods – Individuals in category (4)). We note that, when there are no genotyping errors, such individuals can provide linkage information, particularly when parents are not typed.45 However, it is an interesting and open research question as to whether such individuals provide sufficient additional information when genotyping errors are present to balance the increase in computational cost that results from their inclusion.

Finally, we note that we have developed software to perform analyses using our TDTae method. The software may be downloaded from our website (ftp://linkage.rockefeller.edu/software/tdtae2/). The software will take LINKAGE-format31 files.