A penalized mixture model approach in genotype/phenotype association analysis for quantitative phenotypes.

A mixture normal model has been developed to partition genotypes in predicting quantitative phenotypes. Its estimation and inference are performed through an EM algorithm. This approach can conduct simultaneous genotype clustering and hypothesis testing. It is a valuable method for predicting the distribution of quantitative phenotypes among multi-locus genotypes across genes or within a gene. This mixture model's performance is evaluated in data analyses for two pharmacogenetics studies. In one example, thirty five CYP2D6 genotypes were partitioned into three groups to predict pharmacokinetics of a breast cancer drug, Tamoxifen, a CYP2D6 substrate (p-value = 0.04). In a second example, seventeen CYP2B6 genotypes were categorized into three clusters to predict CYP2B6 protein expression (p-value = 0.002). The biological validities of both partitions are examined using established function of CYP2D6 and CYP2B6 alleles. In both examples, we observed genotypes clustered in the same group to have high functional similarities. The power and recovery rate of the true partition for the mixture model approach are investigated in statistical simulation studies, where it outperforms another published method.


Introduction
Genetic association studies have been widely used to identify risk factors for complex diseases or to predict drug-treatment outcomes (efficacy or toxicity). One important approach is called candidate gene approaches. 1 It is frequently selected to investigate genes in known signaling and metabolic pathways. This approach typically narrows gene targets to a handful of candidates deemed to have a stronger potential of affecting outcomes. Consequently, it is feasible to investigate a dense SNP set per gene. For example, in our pharmacokinetics study of Tamoxifen in breast cancer patients, 35 CYP2D6 alleles were investigated from more than 70 known CYP2D6 polymorphisms.
In a candidate gene study, multiple SNPs per gene usually lead to many haplotypes or alleles, creating high genotype dimensions for genotype/ phenotype association analysis. A striking example is the CYP2D6 gene, which has greater than 70 alleles, including mutations, deletions, insertions, gene conversions and duplications (www.imm. ki.se/CYPalleles).
To have potential clinical benefit, association studies must address a two-fold question: whether a phenotype is associated with genetic variations, and whether the clinical outcome distribution among genotypes is well-defined (i.e. how many sub-population groups can be predicted by genetic polymorphisms). An ideal statistical approach should have a high power to test genetic effects on the phenotypes. It should also be able to group combinations of genetic variables into clusters, where samples in each cluster share a similarly distributed phenotype. Clustered genotypes that predict phenotypes have high clinical relevance as possible diagnostic markers, which could directly facilitate future clinical decisions.
In traditional statistical theory, many multiple comparison approaches were developed. 2,3 Scheffe, LSD, and Tukey's HSD tests can evaluate the overall phenotype difference among genotype groups, but they can't tell where the difference is. Newman-Kuels and Duncan tests are able to search for phenotype differences sequentially among genotype groups, but may result in overlapped grouping [Christensen,2 page 80, example 5.5.1]. Therefore, all of these approaches are capable of addressing the first part of the prescribed two-folded question: whether there is any genetic effect on the phenotype. However, none of them can provide decisive answer to the second part of the question: how genetic polymorphisms are grouped to predict the phenotype.
A restricted partition method (RPM) has been proposed 4 to address these two aims. The algorithm ranks the genotype groups from the smallest to the largest according to the phenotype means. Then, adjacent genotype groups are merged sequentially based on a Tukey's HSD test until it reaches a prespecified significant level. The overall type I error is controlled by the empirical distribution constructed for the R 2 statistic from a regression of the quantitative trait value on the final genotype grouping. This RPM method is an extension of a proposed multiple comparison approach for quantitative phenotypes. It has two important features that may affect its implementation. At first, it has inherent assumptions of equal phenotypic variance and equal sample sizes among genotype cells in Tukey's HSD test.
In practice, this assumption may or may not hold. Secondly, it uses disparate methods for genotype grouping (Tukey's HSD test) and testing genotype/ phenotype associations (R 2 ). Arbitrary threshold selection for both methods may not guarantee the optimal partition.
In this paper, we propose a parametric mixture model approach to genetic association studies, where the quantitative phenotype is assumed to follow multivariate normal distribution. Differential genotype cells are allowed to have different means and variances. A sequential likelihood ratio test, i.e. one mixture vs. two mixtures, two vs. three, and so on, among subgroups defined by genetic polymorphisms indicates the significance of the genetic effect on the phenotype. The optimal partition among genotypes for phenotype prediction is determined by probability assignments from the mixture model. Therefore, this mixture model approach can simultaneously perform p-value calculation and determine the optimal genotype partition. The innovation of our mixture model includes an added penalty term to avoid nonidentifiable parameters.
The performance of our approach is evaluated with two pharmacogenetics study examples, in which CYP2D6 and CYP2B6 alleles were genotyped to predict the pharmacokinetics of a CYP2D6 substrate and CYP2B6 protein expression respectively. Because the functional relationship between CYP2D6 alleles and metabolic activity and CYP2B6 alleles and protein expression has been extensively studied, their function based partitions will serve as objective standards for assessing our mixture-model-based partitions. In addition, statistical simulations were conducted to compare performance of RPM and the mixture model.

Mixture model specification
The history of the mixture model's application in genetics can be traced back as far as 1800s. 5 Many important contributions of this approach in population genetics are well documented. 6 We emphasize that the traditional mixture model approach has been to infer whether a phenotype from the population (such as blood pressure or drug response) is composed of multiple sub-populations determined by possible underlying, unknown genotypes. The mixture model formulation and its estimation procedure are introduced in great detail by McLachlan. 7 We reformulate the traditional mixture model to estimate if measured genotype groups can predict a number of unknown, underlying normal mixtures in measure phenotypes.
Let us assume that we have G genotype groups, and every genotype group has ng (g = 1, …, G) phenotype samples, y g = ( y g1 , …, y gng ), where y gi is a normal random variable. We write the probability of the measured phenotype y as a function of the observed G genotype groups defining partitions of y and the assignment of phenotype group y g to one of the assumed K clusters where z g = 1, 2, …, K is a multi-nominal random variable, and I{z g = k} indicates genotype group g follows distribution Pr(y g | m k , s k ). For the sake of simplicity, let's define s gk = I{z g = k}. The log-likelihood for mixture model (1) is Under the null hypothesis, the true model has only one distribution. If we fit the data by a mixture of K-components, any ( pk = p 0 = p true , 0  p k  1, k = 1, …, K ) will achieve the maximum in (2). This problem causes not only numerical difficulties in the mixture model estimation process, 8 but also theoretical difficulties in likelihood ratio tests. 9,10 The identification problem was solved in 11 by adding a penalty term into the log-likelihood function (2), by which the penalized likelihood function pl(.), (3), forces p k = 1/K when it reaches the maximum. Our aim is the classification of the genotype cells, while Chen 11 classify individual observations. This difference leads to distinctive estimation algorithms and asymptotic LRT.

e-M algorithm
In the estimation step (E-step), the random variable z gk (un-observed) is estimated by (4): The grouping of genotype g is based on the its highest probability assignment, In the maximization step (M-step), The E-and M-steps are iteratively conducted, and the convergence is monitored based on the relative difference of the penalized likelihood function (3).

sequential log-likelihood ratio test
To test the number of normal distribution mixtures present in the observed genotypes, a likelihood ratio test (LRT) is conducted. The marginal penalized loglikelihood for a mixture model of K-component is listed in (7).
The LRT is calculated by This LRT will be conducted sequentially in data analysis, i.e. (K 1 , K 2 ) = (1, 2), (2, 3), (3,4), etc., for all (K 1 , K 2 ) with K 2  g. The family-wise type I error is calculated as the cumulative p-value along the sequential test. The threshold is pre-specified at the 5% level. For each LRT step, parametric bootstrap (5,000 replications) is implemented to calculate the empirical p-value.

Data Analysis
Pharmacogenetic study of CYP2D6 genetic effect on tamoxifen metabolites in patients with breast cancer N-Desmethyltamoxifen (NDM), a major primary metabolite of tamoxifen, is hydroxylated by CYP2D6 to yield endoxifen. Due to its high antiestrogenic potency, endoxifen may play an important role in the clinical activity of tamoxifen. We conducted a prospective trial in 158 breast cancer patients taking tamoxifen to further understand the effect of CYP2D6 genotype and concomitant medications on endoxifen plasma concentrations. Thirty-five different genotypes (Fig. 1a) were determined from the 17 CYP2D6 alleles assayed. Plasma concentrations of tamoxifen and its metabolites were determined at the fourth month of tamoxifen treatment.
RPM was conducted for the log(NDM/Endoxifen) data. Results are presented in Table 2. The RPM sequential analysis stopped at the first iteration, with p-value = 0.036. The result suggests log(NDM/Endoxifen) is significantly different among all 35 genotype cells.
Pharmacogenetic study of CYP2B6 genetic effect on its protein expression in human liver tissues We conducted a retrospective study, investigating the effect CYP2B6 genetic polymorphisms on CYP2B6 protein expression in 83 human liver tissues. Seventeen genotypes (Fig. 2a) were determined from 9 CYP2B6 alleles assayed (*1, *2, *4, *5, *6, *13, *14, *15, and *22). This data were recently published by our group. 12 Protein expression level was done with western blotting in liver microsome samples. Much detail method description was described in 13 CYP2B6 protein expression data was fitted using the normal

Probability Density
Group-1 Group-2 Group-3  Figure 1. genotype/phenotype association analysis for the Tamoxifen study. A) is a raw data description. The x-axis is the nDM/endoxifen ratio in log-scale, where both nDM and endoxifen are Tamoxifen metabolites. The y-axis denotes the 35 CYP2D6 genotypes. B) Thirty-five genotypes are clustered into three groups by a mixture model, which are characterized by three normal distributions. The x-axis is the nDM/endoxifen ratio in log-scale, and y-axis is the probability density. c) shows genotype cell probability assignments (s gk ) to each of the three predicted normal mixture components, where colored bar lengths (scaled on (0,1)) indicate the value of s gk for each mixture component. In A), B), and c), green, blue, and red colors represent the memberships of three clusters. mixture model. Sample size and variance were clearly unequally distributed among 17 genotypes (Fig. 2a). The sequential LRT (Table 1) suggests CYP2B6 protein expression levels are optimally portioned into three groups based on genotype. The sequential test p-values for testing mixtures (1 vs. 2), (2 vs. 3), (3 vs. 4) were 0.001, 0.001, and 0.153 respectively, with a cumulative p-value = 0.002 for the mixture model of 3 components. The genotype group with smallest mean protein expression contains genotypes *6/*13, *5/*5, *5/*6, *1/*15, *5/*15, and *1/*4 (group 1 in Table 1). It has a mean of 2.81(pmol/mg) and a SD = 1.64, and approximately 31% of samples belong to this group. The second genotype group contains *6/*14, *2/*4, *1/*5, *6/*6, *1/*6, *5/*22, *2/*22, *4/*6 and *1/*2. Its protein expression has a mean of 11.6(pmol/mg) and a SD = 58.1, and 52% of the samples belong to this group. The third group contains genotypes *1/*22 and *1/*1. It has the largest protein expression with mean 28.1(pmol/mg) and SD = 259.7, and 17% of the samples belong to this group. Figure 2b displays the three mixture density distributions. Figure 2c shows genotype cell probability assignments (s gk ) to each of the three predicted normal mixture components.

NDM/Endoxifen
RPM was conducted for the CYP2B6 protein expression data. Results are presented in Table 2. The RPM sequential analysis stopped at the first iteration, with p-value = 0.007. The result suggests mean protein expression is significantly different among all 17 genotypes.

simulation studies
The preceding data analyses show discrepancies between the mixture model and RPM approaches.
In these comparisons, RPM partitions the genotype cells into more subgroups than the mixture model. As both methods emphasize the importance of dimensionality reduction, we look favorably on the mixture model result, though both detected significant genotype/phenotype associations in their respective genotype partitions. In the following simulation studies under two epistatic models, we compare the power of the two approaches to detect genetic effects and model recovery probabilities. Of importance is the ability of both approaches to recover the true model partition.
Data were simulated from two 2-locus, bi-allelic models: a checkerboard model (Fig. 3a) and a diagonal model (Fig. 3b). These two models have been thoroughly described by Culverhouse. 14 For each model, both alleles at each of the contributing loci are equally frequent (minor allele frequencies for a and b are 0.5), and the phenotype in each genotype cell is normally distributed.
Checkerboard models were simulated with 2 distributions among the 9 cells, with equal or unequal variances. One group consists of 4 genotype cells containing exactly one heterozygote (Fig. 3a, shaded cells), with a phenotypic mean of 0. The other five genotype cells have a higher phenotypic mean.   Diagonal models were simulated with 3 distributions among the 9 cells, with equal or unequal variances. All the cells off the main diagonal have a phenotypic mean of 0. The diagonal cells (Fig. 3a, dark shaded cells) have higher phenotypic means, with the double heterozygote (Fig. 3a, light shaded cell) phenotypic mean as half that of the other two cells, but with equal variance. The data were simulated as follows: assuming unrelated individuals, genotype cells are simulated   Figure 2. genotype/phenotype association analysis for the CYP2B6 study. A) is a raw data description. The x-axis is the CYP2B6 protein expression (pmol/mg). The y-axis denotes the 17 CYP2B6 genotypes. B) seventeen genotypes are clustered into three groups by a mixture model, which are characterized by three normal distributions. The x-axis is the protein expression level, and y-axis is the probability density. c) shows genotype cell probability assignments (s gk ) to each of the three predicted normal mixture components, where colored bar lengths (scaled on (0,1)) indicate the value of s gk for each mixture component. In A), B), and c), green, blue, and red colors represent the memberships of three clusters.
independently based on allele frequencies. Given an individual genotype cell, the phenotype was generated from a normal distribution. Phenotypes were simulated under two variance assumptions. In situation 1 In both RPM and mixture model analysis, the p-value threshold is set at 0.1% level in order to make the simulation results comparable to the original PRM simulation studies. 4 Power and model recovery probabilities from the simulations are reported in Table 3. Power was calculated by the proportion of simulated data sets where the null hypothesis was rejected. Recovery probability was estimated by the proportion of simulated data sets in which the true partition was recovered. Highlights of simulation are summarized as following: • For models with equal variance among genotype cells (situation 1), both RPM and the mixture model methods demonstrated comparable power, but the mixture model had much higher recovery probabilities. • For models with unequal variance among genotype cells (situation 2), the mixture model approach was more powerful and had higher recovery probabilities than RPM. • For RPM in the unequal variance situation, both checkerboard and diagonal models had considerable discrepancies between power and recovery probability estimates. This result is due to early rejection of the RPM multiple comparison tests, making it unable to fully recover the true partition. • Comparing the simulations under equal and unequal variance, the mixture model gained power and had increased partition recovery probability for models of unequal variance. • If the data distribution is un-symmetric (i.e. gamma distribution), both mixture model and RPM methods have comparable performance comparing their performance in data following normal distribution, respectively.

Discussion and conclusion
The penalized mixture model approach for quantitative phenotypes is a novel application of the mixture model to genotype clustering in genetic association studies. As demonstrated in pharmacogenetic studies of CYP2D6 and CYP2B6, along with simulations, this mixture model method is capable of clustering local haplotypes and multi-locus genotypes to significantly reduce complexity of high-dimensional genotype space. The approach has power to detect quantitative traits loci when genetic effects on phenotypes are marginal or purely epistatic. As demonstrated in two pharmacogenetic genetic studies and simulations, it can detect both main and interactions effects of genetic polymorphisms on quantitative phenotypes. Investigating the effect of CYP2D6 genotype on CYP2D6 metabolism of N-Desmethyltamoxifen, the mixture model approach generated three CYP2D6 genotype clusters in predicting log(NDM/ Endoxifen). Before we discuss the biological rational for this classification, let us review the functionality of CYP2D6 alleles. CYP2D6*1 is the wild type allele, which codes for a fully functional enzyme. CYP2D6*2, *33 and *35 alleles contain point mutations that do not affect the catalytic  properties of the protein product. CYP2D6*3-8, *11-16, *18-20, *38, *40, *42, *44 are associated with no enzymatic activity and CYP2D6*9, *10, *17, *29, *36, *37, *41 with reduced activity. [15][16][17] The presence of multiple copies of CYP2D6 alleles (i.e. *1, *2, *35, *41) have been reported in subjects with unusually high CYP2D6 catalytic activity. 18,19 Based on this prior functional information, all of the CYP2D6 alleles contained in the first genotype group in Table 1 have either no or reduced enzymatic activity. The majority of alleles in the third genotype group have either normal or high activities. There are only four heterozygous diplotypes that possess low enzymatic activity: *2/*9, *10/*35, *2xn/*4 and *41/*41xn. With the exception of *2/*35 and *1/*35, almost no genotypes in the middle group are homozygous for normal or no-enzymatic activity alleles. If these six genotype groups (*2/*9, *10/*35, *2xn/*4, *2/*35, *1/*35) were misclassified by the mixture model, they are account for only 10 out of 158 samples (6%). Therefore, the mixture model based partition is accurate according to well defined functionality of CYP2D6 alleles.
In exploring the effect of CYP2B6 genotype on expression of its protein product, 9 alleles were genotyped. CYP2B6 *1 represents fully functional expression and activity while *22 is associated with increased CYP2B6 expression. 20 The *5 allele reduces CYP2B6 protein by about 8-fold in isolated human liver microsomes. 13 The *6 allele has been shown to reduce function in vitro as well as the pharmacokinetics of its substrate efavirenz in clinical studies. 21,22 The other alleles (*2, *13, *14, *15) have very low or completely absent function. 13,23,24 CYP2B6*4 appears to increase 25 or decrease (our data) activity depending on the substrate tested.
In both examples, mixture model based partitions on CYP2D6 and CYP2B6 genotypes are supported by their functional information.
Comparing RPM to the mixture model approach, RPM detected genotype/phenotype associations with similar power. However, in the CYP2D6 and CYP2B6 pharmacogenetic studies, the mixture model generates three clusters for each data set, while RPM generated as many clusters as the original genotype cells. The result suggests a tendency of over clustering by the RPM method. This characteristic of RPM is confirmed in the simulation study, where RPM had a lower recovery rate for the true partition compared with the mixture model approach. Improvement in the mixture model's recovery rate was observed when the assumption of equal variance among groups was violated, while RPM's recovery probability was diminished.
In summary, the mixture model approach has adequate power to detect genetic effects on phenotypes and simultaneously cluster multiple genetic variables into homogeneous phenotype groups.