Optimal Trend Tests for Genetic Association Studies of Heterogeneous Diseases

The Cochran-Armitage trend test is a standard procedure in genetic association studies. It is a directed test with high power to detect genetic effects that follow the gene-dosage model. In this paper, the author proposes optimal trend tests for genetic association studies of heterogeneous diseases. Monte-Carlo simulations show that the power gain of the optimal trend tests over the conventional Cochran-Armitage trend test is striking when the genetic effects are heterogeneous. The easy-to-use R 3.1.2 software (R Foundation for Statistical Computing, Vienna, Austria) code is provided. The optimal trend tests are recommended for routine use.

Scientific RepoRts | 6:27821 | DOI: 10.1038/srep27821 Methods Notation. For a marker with two alleles a and A, each individual in a case-control study is genotyped with one of three genotypes, aa, Aa and AA (indexed by i = 0, 1, 2, respectively). Assume that the case-control study consists of a total of n = r + s subjects (r cases and s controls). These n subjects can be classified into a 2 × 3 table based on each subject's genotype and disease status as shown in Table 1.
Let (x 0 , x 1 , x 2 ) = (0, c, 1) where the coefficient c can assume any value. Under the null hypothesis of no genetic association, the following test statistic is distributed asymptotically as a chi-square distribution with one degree of freedom: The test with a coefficient of 0.5, Z(0.5), is the familiar Cochran-Armitage trend test 11-15 . Optimal Trend Test. Assume that the non-diseased population is in Hardy-Weinberg equilibrium with an allele frequency (for the A allele) of q. The expected genotype frequencies for the controls are then, respectively, Further assume that the genetic effect is heterogeneous; the allele relative risk (relative risk per A allele) is not a constant value but may vary from person to person. Let the expected value of this relative risk be denoted as RR, its coefficient of variation (standard deviation divided by mean; a measure of heterogeneity), as CV RR . The expected allele frequency for the cases is then and its variance, calculated by a Taylor approximation (S1 Exhibit), is then RR 2 This variance is also the Hardy-Weinberg disequilibrium coefficient in the diseased population, and therefore, the expected genotype frequencies for the cases are, respectively, where δ = Var(p).
In the above calculations, we assumed Hardy-Weinberg equilibrium for the non-diseased population and a gene-dosage genetic model (a constant increase or decrease in risk per A allele). We now alleviate these assumptions. In general, the expected genotype frequencies for the controls are, respectively, where Δ is the Hardy-Weinberg disequilibrium coefficient in the non-diseased population. The expected genotype relative risks are, respectively, where γ is a genetic model parameter. γ = 0 corresponds to an autosomal recessive model, γ = 0.5, a gene-dosage model, and γ = 1, an autosomal dominant model. As before, we allow the parameter RR to have a coefficient of variation CV RR , and the parameter p (though here it may not be interpreted as the expected allele frequency for the cases) to have a variance as prescribed in Equation (4). Under these conditions, the expected genotype frequencies for the cases (p 0 , p 1 and p 2 ) can be derived from a Taylor expansion. The formulas are rather cumbersome and are relegated to S2 Exhibit. With the p i and q i calculated for i = 0, 1 and 2, simple algebra shows that the following optimal coefficient will maximize the test statistic in Equation (1): i i i for i = 0, 1 and 2, respectively, are the expected genotype frequencies in the total case-control sample. Z(c optimal ) is our proposed optimal trend test.
An Example. We use published case-control data to demonstrate our method. Zhang et al. 23 examined the association between the adenosine diphosphate ribosyltransferase (ADPRT) gene (Val762Ala polymorphism) and lung cancer risk. The data (1000 cases and 1018 controls) are shown in Table 2. Using [9], we calculate the expected genotype frequencies in the total case-control sample as = =.
× . + × . Using [1], we then calculate From this, we obtain a very small p-value of 0.00095. By comparison, the conventional Cochran-Armitage trend test for this example results in a higher p-value of 0.00164. Zhang et al. 23 used a chi-square test with two degrees of freedom, which resulted in an even higher p-value of 0.00420. Such differences in p-values should not be taken lightly, considering that a severe multiple-testing penalty often has to be made before declaring significance in a genetic association study. Simulation Study. We perform a simulation study to examine the statistical properties of the optimal trend test. The non-diseased population is assumed to be in Hardy-Weinberg equilibrium (Δ = 0), with an allele frequency of q = 0.4. We assume a gene-dosage genetic model (γ = 0.5), and we consider situations where the A allele is a risk allele (RR = 2, 1.5, and 1.25, respectively) and a protective allele (RR = 0.5, 0.67, 0.8, respectively), in turn. For each scenario, we use a sample-size formula for the Cochran-Armitage trend test 13 to calculate the respective sample size needed for a case-control study (assuming an equal number of cases and controls) to achieve a power of 0.8 at a significance level of 0.05.
We consider various values of CV RR : 0.0 (no heterogeneity), 0.1, 0.2,… , 1.0 (profound heterogeneity). For each value of q, RRand CV RR , we use Equation (8) to calculate the optimal coefficient. We then perform Monte-Carlo simulations (a total of 1,000,000 simulations for each scenario) to calculate the empirical power of the optimal trend test (at the sample sizes described above). For comparison, we also calculate the empirical power of the Cochran-Armitage trend test. Figure 1 presents the results when the A allele is a risk allele (panels A, C, and E for the coefficients; panels B, D and F for the empirical powers). When the genetic effect is homogeneous (CV RR = 0), the optimal coefficients as Table 2

. Association between the adenosine diphosphate ribosyltransferase (ADPRT) gene (Val762Ala polymorphism) and lung cancer risk (data taken from ref. 23).
calculated from Equation (8) are very close to the coefficient of the Cochran-Armitage trend test, namely, 0.5. As a result, the powers of the optimal trend test and the Cochran-Armitage trend test are very similar. As the genetic effect becomes more heterogeneous (larger CV RR ), the optimal coefficient decreases (down to below zero), and the power of the optimal trend test increases (up to ~100%). The rates of the coefficient decrease/power increase are more striking for a weaker genetic effect (RR = 1.25; panels E and F) than for a stronger genetic effect (RR = 2; panels A and B). By comparison, the Cochran-Armitage trend test uses a constant coefficient of 0.5, and its power decreases gradually with greater heterogeneity. Figure 2 presents the results when the A allele is a protective allele. Similar findings can be seen in Fig. 1 when A is a risk allele, except that as the genetic effect becomes more heterogeneous, the optimal coefficient deviates away from 0.5 in the other direction, increasing up to beyond 1.0 rather than decreasing. We consider different values of q, Δ andγ, and the results (S3 Exhibit) all show a superiority of the optimal trend test over the conventional Cochran-Armitage trend test.

Discussion
The optimal trend test as proposed in this paper is a directed test that is most sensitive for a particular specified alternative. The optimal coefficient depends on the effect of the study gene (mean RR, variability CV RR and genetic model γ) and on the underlying population (allele frequency q, and Hardy-Weinberg disequilibrium coefficient Δ ). This a priori information is to be supplied by researchers, either by a literature search or an educated guess. As shown in this study, the power gain over the conventional Cochran-Armitage trend test is striking when the genetic effects are very heterogeneous.
Sometimes, to pinpoint exactly one set of RR, CV RR , γ, q and Δ , calculating the optimal coefficient can be difficult, but suggesting a list of possible sets of parameter values may be easier. Assuming that a researcher comes up with a total of m sets of parameter values, he/she can input these into our Equation (8) to calculate a total of m optimal coefficients, ... c c , , m 1 optimal o ptimal and then input these into our Equation (1) for a total of m optimal trend tests. Next, a summary test can be performed based on a weighted sum of these m test statistics: where w 1 , … , w m are the weights given to reflect the plausibility of each set of parameter values. The multiple testing problem should not concern us here, because we make one and only one summary test. Under the null hypothesis of no genetic association, Z summary 2 is distributed asymptotically as a mixture of chi-square variables (detailed in S4 Exhibit). (The test reduces to the optimal trend test in this paper when m = 1) The proposed optimal trend tests (and the summary test) are easy to calculate. S5 Exhibit presents the R 3.1.2 software (R Foundation for Statistical Computing, Vienna, Austria) code and a number of worked examples. The R program also allows for the direct input of the optimal coefficients. For example, if one suspects a gene-dosage model with heterogeneous effects, one can input one coefficient slightly above 0.5, say c 1 = 0.8, another coefficient slightly below 0.5, say c 2 = 0.2 and w 1 = w 2 = 1, to the R program to test = . + . . Z Z Z (0 8) (0 2) summary 2 2 2 As another example, if one is uncertain about the genetic model, one can input c 1 = 0.5 (gene dosage), c 2 = 1 (autosomal dominant), c 3 = 0 (autosomal recessive), and w 1 = w 2 = w 3 = 1 into the R program to test = . + + .