Trinculo: Bayesian and frequentist multinomial logistic regression for genome-wide association studies of multi-category phenotypes

Motivation: For many classes of disease the same genetic risk variants underly many related phenotypes or disease subtypes. Multinomial logistic regression provides an attractive framework to analyze multi-category phenotypes, and explore the genetic relationships between these phenotype categories. We introduce Trinculo, a program that implements a wide range of multinomial analyses in a single fast package that is designed to be easy to use by users of standard genome-wide association study software. Availability and implementation: An open source C implementation, with code and binaries for Linux and Mac OSX, is available for download at http://sourceforge.net/projects/trinculo Supplementary information: Supplementary data are available at Bioinformatics online. Contact: lj4@well.ox.ac.uk

P r(y i = 1|g i , β 0 , β 1 ) = exp(β 0 + β 1 g i ) 1 + exp(β 0 + β 1 g i ) , where β 0 is an intercept and β 1 is an effect size (log odds ratio). We use logistic regression to make conclusions about β 1 : what its value is, what the confidence interval is on this estimate, whether this value is larger than zero, etc.
Multinomial logistic regression is an extension of this approach, where instead of each individual being either a case or a control, they are instead either a control (y i = 0) or one of K different case phenotypes y i = k, where k ∈ (1, ..., K). The probability of being in any one of these categories is given by the multinomial logit link, given by P r(y i = k|g i , β 0 , β 1 ) = exp(β 0k + β 1k g i ) 1 + K s=1 exp(β 0s + β 1s g i ) . (2) In this case there is an intercept β 0k and an effect size β 1k for each phenotype k. We are no longer learning a single intercept and effect size, but instead a vector of intercepts β 0 and a vector of effect sizes β 1 . In practice, we generalize this to an arbitrary number of predictors (multiple variants, principal components, other confounders, etc), each with its own vector of effect sizes. We are usually interested in learning things about the components of β 1 : is this vector different from the zero vector (i.e. does this variant effect any phenotypes), which values are significantly larger than zero (i.e. which phenotypes share this genetic variant), how large are they (i.e. how strong an effect does the variant have on each phenotype), etc.
There are two different approaches to learning about β 1 : the frequentist approach, in which we calculate a maximum likelihood estimate of the parameter vector and calculate p-values based on the maximum likelihood, and the Bayesian approach, where we place a prior on the β 1 and calculate its posterior distribution, and use this to generate Bayes factors based on the marginal likelihood. The frequentist approach is fast and requires few (explicit) assumptions, whereas the Bayesian approach is flexible and allows us to make a broader range of inferences.  Figure S1: Total sample size (split evenly between phenotypes and controls) required for 50% power to detect association at p < 5 × 10 − 8 for an allele with OR = 1.2 and minor allele frequency f = 0.5 associated with between 1 and 4 of 4 phenotypes.

Fitting the frequentist model by maximum likelihood
We fit the maximum likelihood model using Newton's method (based on first and second derivatives given in the Technical Details section below). This calculates a point estimate for all parameter values (the maximum likelihood estimate, or MLE), as well as standard errors and confidence intervals around the MLE using the Fisher information (the inverse of the second derivative matrix, given in Technical Details).

Omnibus testing using likelihood ratio tests (LRTs)
A typical use of multinomial logistic regression is to improve power of a genome-wide association test by combining cases across multiple phenotypes, and testing whether there is any evidence for an effect of each variant on any of the phenotypes. This works by fitting a full model (including intercepts, genotypic effect sizes and effect sizes for confounders for all phenotypes) and a null model (including only intercepts and effect sizes for confounders, with a null effect of the variant on all phenotypes), and carrying out a likelihood ratio test to see if the full model is a significant improvement over the null model. The omnibus test can gain power by including evidence over multiple associated phenotypes, but also retains most of its power (compared to a single-phenotype test) even if the variant is associated with only one phenotype.
An alternative method is the so-called "lumping" test, in which all cases are treated as a single case set regardless of their specific disease, and binomial logistic regression is carried out on this "lumped" case set. If the variant under consideration is associated with all the diseases in the "lump" with the same effect size this method can be very powerful. However, if the effect sizes are different, or the variant is associated with some phenotypes but not others, this can rapidly dilute the power of the test. Figure S1 shows an example of the power of the multinomial omnibus test on simulated data. This method improves on the power of the lumping test and the single-disease association test in all cases, except the case of the variant being associated with all phenotypes (where it gives lower but comparable power to the lumping test).

Frequentist single-disease tests using Wald statistics and LRTs
As well as carrying out omnibus tests ("is this variant significantly associated with any of these phenotypes?"), we may also wish to carry out significance tests for each individual phenotype ("is this variant significantly associated with this one phenotype?"). Obviously, these singlephenotype tests can be carried out using ordinary binomial logistic regression, but we can also carry out single-disease tests in the context of frequentist multinomial logistic regression.
Wald tests are a simple way of testing the association for each individual disease by generating a Z-score statistic for each phenotype at that locus, equal to the log-odds ratio divided by the standard error. The per-disease likelihood ratio test is a more involved test, which involves fitting a full model (where the variant is associated with all phenotypes) and a reduced model (in which the variant is associated with all phenotypes other than the phenotype under consideration), and carries out a likelihood-ratio test to see if the former model is a significant improvement over the latter. The second test is significantly slower, and is included more for completeness than for any obvious utility, and in practice gives answers that are very similar to the Wald test.
Note that these tests do not test for evidence in favour of disease specificity; even if only one phenotype is significant, it doesn't follow that the association is specific to this phenotype, as the tests may just lack power for the other phenotypes. Making these sorts of inferences will be discussed in the Bayesian Model Selection section below.

Fitting the Bayesian model by maximum posterior
For the Bayesian analyses we specify a multivariate normal prior on effect sizes by setting a prior covariance (or correlation) matrix. This prior can encode information about the relationship between the effect sizes of the diseases (as described in more detail below, and in Technical Details), and also allows us to generate marginal likelihoods and posteriors that allow us to make a more flexible range of inferences.
During Bayesian model fitting, we estimate and report parameter values that maximise the posterior density (the maximum a posteriori, or MAP, estimates) using Newton's method, and approximate the posterior distribution as a normal distribution around these MAP estimates, allowing us to report Bayesian standard errors and 95% credible intervals on parameter estimates. We also calculate and report the marginal likelihood for the model (i.e. the likelihood after integrating out uncertainty in the parameter values). Technical details are given in Section 2.

Omnibus testing using Bayes factors
We can calculate a combined, omnibus Bayes factor in a very similar fashion to the frequentist likelihood ratio test, in order to assess evidence in favour of association between a variant and any of the phenotypes under consideration. This is very similar to the LRT, in that it compares the fit of the full and null models, except that it uses the marginal likelihoods (the likelihood after integrating out uncertainty in the parameter values) rather than maximum likelihoods to compare the models. Unlike the LRT, the omnibus Bayes factor can be greater or less than 1, with numbers greater than 1 indicating evidence in favour of association (i.e. the full model fits the data better than the null mode), and less than one (i.e. the null model fits the data better).

Bayesian fine-mapping
The Bayes factors generated from the Bayesian logistic regression can also be used to carry out fine-mapping across multiple phenotypes in order to identify causal variants, using the approach described in Maller et al 1 . Figure S2 shows a simulated example of how the cross-phenotype fine-mapping approach can significantly increase power to fine-mapping causal variants compared to either fine-mapping a single disease, or lumping all cases together across diseases.  Figure S2: Total sample size required in simulations to have a 50% power to fine-map an association such that the true variant has at least 95% of the posterior. We simulate two variants, the true causal variant and a tag variant with r 2 = 0.95 with the true causal variant. Both variants have allele frequency 0.5, the true variant has an odds ratio of 1.2, and samples are equally distributed between the 4 phenotypes and controls. The default prior was used.

Incorporating similarity of diseases using the Bayesian prior
The maximum likelihood based method either treats the two diseases as two separate categories, essentially assuming that the effect sizes are independent (and thus not sharing information across diseases when estimating effect sizes), or treats them as a single category (the "lumping test"), assuming that the effect size is identical. The Bayesian multinomial regression tests allow a much greater degree of freedom, by explicitly specifying a prior relationship between the effect sizes of the two diseases, in the form of a covariance matrix (or, equivalently, a correlation matrix and a vector of marginal prior variances). If the correlation is close to 1 between two diseases the effect sizes are assumed to be nearly identical (similar to the lumping test), if the correlation is close to zero the effect sizes are assumed to be unrelated (similar to the two-category likelihood-ratio test), and intermediate levels encode differing degrees of similarity. A value of less than zero implies that the two effects often go in different directions.
The correlation relationship between the different diseases can either be user-specified, or if given a large enough set of known associated variants, Trinculo can learn it by maximum likelihood (thus making it an empirical prior). Figure S3 shows an example of how accurately the correlation parameter can be inferred as a function of the number of known loci.

Bayesian Model Selection
Many questions we wish to answer in cross-disease association analysis concern what we may call the sharing model for a given variant, i.e. which set of phenotypes is this variant associated with and, crucially, not associated with. Such questions might include "is this variant specifically associated with a single phenotype?", or "is this variant shared by at least two phenotypes?". These questions are complicated by issues of power, as a variant may be truly associated with phenotypes in a way, due to a lack of power to detect it.
The Bayesian framework is well-placed to answer these questions. We can frame them in terms of constraints on the full model, whereby certain parameters are constrained to be zero (as described in Section 2.4), and then integrate out the uncertainty in the remaining parameters to calculate marginal likelihoods, which can in turn be used to calculate Bayes factors (for comparing between two sharing models) or posteriors and credible sets (for a larger number of models). Because uncertainty in parameter values is taken into account, this method is far less sensitive  Figure S4: Power to show specificity of association (BF >4) to one of two phenotypes, given effect size and true correlation. In each simulation the covariance matrix is learned from 200 known loci.
to power issues than frequentist p-value based approaches. Figure S4 shows the power to detect whether an association is specific to one of two phenotypes using this approach.
When you are analysing more than two phenotypes the number of possible sharing models increases rapidly, and the prior you place on these models can have a strong impact on the results. By default, the script included in Trinculo assumes a uniform prior across sharing models (i.e. each sharing model is given equal weight). However, this can have unexpected effects, as most of the prior weight is placed on intermediate sharing models where around half of the phenotypes are associated (for instance, for 6 phenotypes there are 20 models where the variant is associated with exactly 3 phenotypes, but only 6 where 1 phenotype is associated, and only 1 model where all six are associated). To get around this, Trinculo also supports a prior that is uniform on the number of associated phenotypes (which has the effect of upweighting models that are either widely shared or specific to a single phenotype), as well as user-provided prior files.
2 Technical details 2.1 Notation, likelihood and prior We generalize the notation given above to include J + 1 predictors (J variable predictors, plus the intercept), which we will write as a matrix X, with elements x ij (including the intercept, which we will call predictor 0, and write x i0 = 1∀i). We will write x i to refer to the vector of predictors for individual i (i.e. the row of the matrix X corresponding to individual i). We write the number of individuals as N .
We will refer to all the effect sizes across all predictors and phenotypes as the effect size matrix B, where B jk = β jk is the effect size for predictor j on phenotype k. We will write β k as the vector of effect sizes of all predictors on phenotype k, and β j as the vector of effect sizes of predictor j on all phenotypes.
The probability that an individual is in category k, conditional on their predictors x i and parameters B is given by The log likelihood is thus: We set a multivariate normal prior on the effect sizes for each predictor j, such that the log prior is given by where φ(·) is the normal probability density function.

Choice of priors
In the above section, each predictor j is given its own prior parameters (µ j , Σ j ). In practice, this would involve setting a very large number of parameters, so we instead just set two different sets of prior parameters, corresponding to the prior on genotypic effect sizes and the prior on nuisance parameters (intercepts and confounders). The values for the genotypic effects can either be taken as a default value, can be user set, or can be learned from the data (described below). The default prior covariance matrix is a generalisation of the default SNPTest prior, and sets Σ ab = 0.04 when a = b and Σ ab = 0 when a = b. The sign on the effect size is usually arbitrary, as the risk allele is chosen arbitrarily, so for all three cases we set µ j = 0.
Priors on the effect sizes of intercepts and confounders must be specified as well. Again, we set µ j = 0 for these values, and Σ ab = 0 when a = b (i.e. effect sizes are distributed around zero and are uncorrelated a priori ). We also set Σ aa = σ, where σ is single user-specified parameter. Finally, we normalize all confounders to have a variance of 1, to ensure that the effect sizes are measured on a common scale. We use σ = 1 by default; investigations in real data suggest that the Bayes factors are largely insensitive to σ, providing it is not set significantly below 1 (data not shown).

Derivatives of multinomial likelihood and prior
The first and second derivates of the log multinomial likelihood given in section 2.1 are given by and respectively. Here δ ab is the delta function, such that δ ab is equal to 1 if a = b, and 0 otherwise. For the derivatives of the log-prior density, it is easiest to give these in terms of the first and second derivatives of log normal probability density, and The first and second derivates of the log prior density are thus and respectively.

Calculating marginal likelihoods and Bayes factors
We often wish to fit constrained models where the effects of particular predictors on particular phenotypes (i.e. particular elements of B) are set to zero. We define a particular model M , such that M specifies the set of constrained predictors and phenotypes, i.e. β ab = 0 ∀(a, b) ∈ M . We can calculate marginal likelihoods for a given model as

Estimating an empirical covariance matrix prior
If we have a large number of loci, and we assume that their effect sizes are all drawn from the same distribution, we can estimate a single underlying covariance matrix Σ l = Σ across all loci l. For each locus we fit the maximum likelihood model described above and calculate the estimated genotypic effect size vectorβ l and the error covariance matrixΩ l . We can then model the estimated effect sizes for each locus as being drawn from the distributionβ l ∼ N (0, Σ +Ω l ), and define a log-likelihood (Σ) = l log(φ(β l |0, Σ +Ω l )). We calculate the maximum-likelihood estimate,Σ, using Newton's method. This estimated covariance matrix can in turn be used as an empirical prior for calculating marginal likelihoods for individual loci. The first and second derivates of (Σ) are given by and respectively, where W l = (Σ +Ω l ) −1 , A • B is the Hadamard (i.e. element-wise) product, 1 is a matrix where all elements are equal to 1 and S ab is a matrix where all elements are equal to zero except (S ab ) ab = (S ab ) ba = 1.

Simulation methodology
This simulation scheme is designed to efficiently sample genotype information from multiple phenotypes given confounding and ascertainment bias, even when certain categories are very rare. The aim is to simulate a set of predictors (genotypes and confounders) for each individual x i , conditional on a phenotype for each individual y i . We will write x i = {1, g i , x c i }, where the zeroth predictor is the intercept (x i0 = 1), the first predictor is g i , the genotype for this individual, predictors 2 to J are the confounders x c i (giving J − 1 confounders). The simulation schedule requires the following parameters: • the number of phenotypes K, • the number of individuals to sample in each phenotype category N k , • a population prevalence of each phenotype π k , • the population allele frequency of the risk variant f , • the number of confounders J − 1 • the effect of each confounder on each phenotype β kj , • the effect of each confounder on the allele frequency c j (e.g. from the results of a regression of confounders on dosage in a reference population), • the effect size of the risk variants on each phenotype, either set directly as β k1 or from a provided covariance matrix Σ to draw zero-centered multivariate normal values from.
The intercept values β k0 are set to give the given phenotype population frequencies π k , using the approximate equation We calculate the mean value of each confounder in each phenotype using numeric integration, and sample confounders independently. We can then calculate the genotype probabilities conditional on disease status and confounders using P r(g i |y i = k, where P r(g|x c ) is the population frequency of genotype g, given an adjustment for confounding between genotype and confounders. We model this confounding as g i |x c i ∼ Binom(2, f + c T x c i ).