Genetic model selection for a case–control study and a meta-analysis

A case–control study often compares the prevalence of a specific disease among persons with normal alleles and persons with variant alleles, which generates an odds ratio (OR). The most common type of allele variation, single-nucleotide polymorphism, consists of a major allele (M) and a minor allele (m). Thus, the genotype can be a major allele homozygote (MM), a heterozygote (Mm) or a minor allele homozygote (mm). Odds are given for each genotype, and a pair of odds generates an OR. Summarizing data using two-by-two contingency is the simplest method of estimating an OR. Thus, dominant, multiplicative, recessive, and over-dominant models are often used. Traditionally, researchers used to calculate ORs using many models and then select the best model from among these calculated ORs. This may cause problems due to multiple comparisons. Therefore, we should choose the best model before calculating the OR for each model. In this article, we will discuss how to choose the best model among many subject-level models when evaluating the impact of the MM/Mm/mm genotype on the disease prevalence.


Introduction
Heredity involves the passing of traits, and occasionally the risk of diseases, to offspring from their parents. This phenomenon was known long before DNA was discovered in the 20th century. Mendelian inheritance is observed for some rare diseases. On the other hand, most common diseases do not present typical Mendelian inheritance. According to the common disease-common variant hypothesis, some of those common variants lead to susceptibility to complex polygenic diseases. Each variant of each gene that influences a complex disease will have a small effect on the disease phenotype and susceptibility (Marian, 2012;Pritchard and Cox, 2002). Case-control studies, often in the form of genome-wide association studies or meta-analysis, have been conducted to discover causative variants and to evaluate the impact of gene polymorphism on a specific disease.
A case-control study often compares the prevalence of a specific disease among persons with normal alleles and persons with variant alleles, which generates an odds ratio (OR). The most common type of allele variation, single-nucleotide polymorphism, consists of a major allele (M) and a minor allele (m). Thus, the genotype can be a major allele homozygote (MM), a heterozygote (Mm) or a minor allele homozygote (mm). Odds are given for each genotype, and a pair of odds generates an OR (Table 1). Summarizing data using two-by-two contingency is the simplest method of estimating an OR. Therefore, the three kinds of genotypes are often transformed into two variables. For example, a dominant model compares MM versus Mm + mm, and a recessive model compares MM + Mm versus mm. An over-dominant model assumes the heterozygote has the strongest impact and compares MM + mm versus Mm. On the other hand, co-dominant models including additive and multiplicative models hypothesize that MM, Mm, and mm are associated with the lowest, the intermediate, and the highest risk, respectively, or they are associated with the highest, the intermediate, and the lowest risk, respectively (Thakkinstian et al., 2005;Attia et al., 2003). While these models above discuss a subject-level phenomenon, the allelic model evaluates the impact of individual alleles on the disease. This allelic model produces an OR similar to that estimated from the multiplicative model (Thakkinstian et al., 2005;Attia et al., 2003).
Traditionally, researchers used to calculate ORs using many models and then select the best model from among these calculated ORs (Thakkinstian et al., 2005;Attia et al., 2003). This may increase the possibility of type I error due to multiple comparisons (Bagos, 2013). Therefore, we should choose the best model before calculating the OR for each model. Although model selection for genome-wide study was explained by Bagos (Bagos, 2013), another method for model selection for case-control study has been anticipated. In this article, we will discuss how to choose the best model among many subject-level models when evaluating the impact of the MM/Mm/mm genotype on the disease prevalence.

Methods and examples
In this article, for the additive model, we supposed the impact of Mm allele was estimated from the additive mean of impacts of MM and mm alleles. Similarly, for the multiplicative model, we supposed the impact of the Mm allele was estimated from the multiplicative mean of impacts of the MM and mm alleles. Although we knew that some researchers use the wording "log-additive model" instead of "multiplicative model" which is defined above and the wording "additive model" instead of "multiplicative model" which is defined above, we did not use these wordings for the current article. We defined OR1 and OR2 as follows:OR1 = odd Mm / odd MM = bd / aeOR2 = odd mm / odd Mm = ce / bf. Therefore, odd Mm = b / e = odd MM × OR1ori odd mm = c / f = odd MM × OR1ori × OR2ori.

Commonly used models
The most commonly used five subject-level gene models are recessive, multiplicative, additive, dominant, and over-dominant models (Thakkinstian et al., 2005;Attia et al., 2003). Each of the five models was originally defined using the relationship among odd MM , odd Mm , and odd mm . Using the formulas in legends of Table 1, we can obtain the relationship between OR1 and OR2 for each model.
The over-dominant model is defined by odd mm = odd MM . Therefore, OR2 = 1 / OR1. In the log-scale OR1-OR2 plane, the recessive, multiplicative, dominant and over-dominant models were drawn linearly, while the additive model was drawn as a curved line ( Fig. 1-A). This meant that a logistic regression analysis was applicable for the first four models by applying the explanatory variables that are shown in Table 2 to persons with MM, Mm, and mm genotypes. However, it is difficult to apply logistic regression analysis to an additive model. Data should be summarized by 2 by 3 contingency for multiplicative and additive models. Logistic regression assumes odds for a disease increases exponentially as a number of minor allele (0, 1 or 2) increases. This hypothesis exactly fits the multiplicative model but not for additive model.

Four-model strategy
We proposed to principally use the four genetic models, i.e. the recessive, multiplicative, dominant, and over-dominant models. We selected these four models because they are the only ones that are easily applicable to logistic regression analysis, and because they are symmetrically allocated in the log-scale OR1-OR2 plane ( Fig. 1-A).
Next, we defined new variables. OR1ori and OR2ori are OR1 and OR2 that are estimated from the original number of subjects observed. OR1mod and OR2mod are ORs obtained using one of the genetic models.
The first step in the four-model strategy is to calculate OR1ori and OR2ori. We can obtain OR1ori and OR2ori from a two-by-two contingency. Alternatively, we can also estimate OR1ori and OR2ori from logistic regression analysis.
The third step is to calculate ORstep3, which represents a one point increase of the explanatory variables. ORstep3 can be obtained from single-variable logistic regression analysis. To apply logistic regression analysis, explanatory variables depending on the gene model indicated in Table 2 were given for each subject. Objective variables were also given: 0 for a control and 1 for a case. For the recessive, dominant, and over-dominant models, we can use a two-by-two contingency to estimate ORstep3, instead of this logistic regression analysis; though the multiplicative model always requires logistic regression.  For third and fourth steps, models that were not selected in the second step were also presented for the purpose of comparison.
In the fourth step, OR1mod and OR2mod are calculated from OR step3. Given the relationship between OR1 and OR2 in the previous section for each model, OR1mod and OR2mod were provided as follows: OR1mod = 1, OR2mod = ORstep3 for the recessive model; OR1mod = OR2mod = ORstep3 for the multiplicative model; OR1mod = ORstep3, OR2mod = 1 for the dominant model; OR1mod = ORstep3, OR2mod = 1 / ORstep3 for the over-dominant model. Table 3 and Fig. 1-C present Examples 1 and 2.

Additive and harmonic models
Usually, a diseased subject is regarded as a case, and a healthy subject is regarded as a control. However, this definition is only based on convention. Therefore, we should be able to provide an optimal model even if the number of cases and the number of controls are switched. Recessive, multiplicative, dominant and over-dominant models are still applicable, even if the numbers of cases and controls are switched. Table 4 and Fig. 1-D present Examples 3-6. However, the additive model is not applicable, once the numbers of cases and controls are switched. Table 4 and Fig. 1-D present Example 7.
We proposed to use a harmonic model that is defined by odd Mm = 1 / ((1 / odd MM ) + 1 / odd mm ) / 2). Here, the harmonic mean, along with the multiplicative and the additive means, is a type of generalized mean. Using this relationship among odd MM , odd Mm , and odd mm and the formula in the legends of Table 1, we can find out the relationship OR2 = 1 / (2-OR1). The additive model and the harmonic model are mutually exchanged when the numbers of cases and the controls are switched. Table 4 and Fig. 1-D present Example 7.
Using only an additive model but not using a harmonic model is an unfair procedure. Therefore, when considering an additive model, we also have to suppose a harmonic model simultaneously. Anyway, it is difficult to use these two models because they do not fit a logistic regression formula.

Four-model strategy for meta-analysis
Meta-analysis is often conducted to estimate the true impact of each allele variant on susceptibility to a specific disease. For this, the four-model strategy is also applicable.  For third and fourth steps, models that were not selected in the second step were also presented for the purpose of comparison.
First, OR1ori and OR2ori are estimated from the number of subject in each original study. Then pooled OR1ori and pooled OR2ori are calculated.
Second, the best model plotting (pooled OR1ori, pooled OR2ori) on the log-scale OR1-OR2 plane is chosen.
Third, the ORstep3 for the selected model is estimated from the number of subjects in each original study. Then the pooled ORstep3 is calculated from ORstep3.
Fourth, OR1mod and OR2mod are calculated from the pooled ORstep3 . Table 5 and Fig. 1-E present Example 8.

Conclusion
The impact of allele variation on non-Mendelian diseases is investigated in many case-control studies and meta-analyses. However, an acceptable strategy to select the best model has not been developed. We developed a novel method to select the best method from among recessive, multiplicative, dominant, and overdominant models preceding the calculation of the OR for each model.

Financial statement
No support in the form of grants, gifts, equipment, and/or drugs was provided for the current study.