MCMC Methods for Multi-Response Generalized Linear Mixed Models: The MCMCglmm R Package

Generalized linear mixed models provide a ﬂexible framework for modeling a range of data, although with non-Gaussian response variables the likelihood cannot be obtained in closed form. Markov chain Monte Carlo methods solve this problem by sampling from a series of simpler conditional distributions that can be evaluated. The R package MCMCglmm implements such an algorithm for a range of model ﬁtting problems. More than one response variable can be analyzed simultaneously, and these variables are allowed to follow Gaussian, Poisson, multi(bi)nominal, exponential, zero-inﬂated and censored distributions. A range of variance structures are permitted for the random eﬀects, including interactions with categorical or continuous variables (i.e., random regression), and more complicated variance structures that arise through shared ancestry, either through a pedigree or through a phylogeny. Missing values are permitted in the response variable(s) and data can be known up to some level of measurement error as in meta-analysis. All simu-lation is done in C / C++ using the CSparse library for sparse linear systems.

Due to their flexibility, linear mixed models are now widely used across the sciences (Brown and Prescott 1999;Pinheiro and Bates 2000;Demidenko 2004). However, generalizing these models to non-Gaussian data has proved difficult because integrating over the random effects is intractable (McCulloch and Searle 2001). Although techniques that approximate these integrals (Breslow and Clayton 1993) are now popular, Markov chain Monte Carlo (MCMC) methods provide an alternative strategy for marginalizing the random effects that may be more robust (Zhao, Staudenmayer, Coull, and Wand 2006;Browne and Draper 2006). Developing MCMC methods for generalized linear mixed models (GLMM) is an active area of research (e.g., Zeger and Karim 1991;Damien, Wakefield, and Walker 1999;Sorensen and Gianola 2002;Zhao et al. 2006), and several software packages are now available that implement these techniques, e.g., WinBUGS (Spiegelhalter, Thomas, Best, and Lunn 2003), MLwiN (Rasbash, Steele, Browne, and Prosser 2005), glmmBUGS (Brown 2009), JAGS (Plummer 2003). However, these methods often require a certain level of expertise on behalf of the user and may take a great deal of computing time. The MCMCglmm package for R (R Development Core Team 2009) implements Markov chain Monte Carlo routines for fitting multi-response generalized linear mixed models. A range of distributions are supported and several types of variance structure for the random effects and the residuals can be fitted. The aim is to provide routines that require little expertise on behalf of the user while reducing the amount of computing time required to adequately sample the posterior distribution. The package is available from the Comprehensive R Archive Network at http://CRAN.R-project. org/package=MCMCglmm. In this paper we explain the underlying structure of GLMM's and then briefly describe a general strategy for estimating the parameters. Few new results are presented, and we would like to acknowledge that many of the statistical results can be found in Sorensen and Gianola (2002) and many of the algorithm details that allow the models to be fitted efficiently can be found in Davis (2006). The main body of the paper introduces the software, using a worked example taken from a quantitative genetic experiment. We end by comparing the routines with WinBUGS (Spiegelhalter et al. 2003), and find MCMCglmm to be nearly 40 times faster per iteration, and to have an effective sample size per iteration more than 3 times greater.

Model form
The model has three components: a) probability density functions that relate the data y to latent variables l, on the link scale b) a standard linear mixed model with fixed and random predictors applied to l and c) variance structures that describe the expected (co)variances between the location effects (fixed and random effects). Although we develop these models in a Bayesian context where the distinction between fixed and random effects does not technically exist, we make the distinction throughout the manuscript as the terminology is well entrenched and understood.

Probability of the data y given the latent variable l
The probability of the i-th data point is represented by: where f i is the probability density function associated with y i . For example, if y i was assumed to be Poisson distributed and we used the canonical log link function, then Equation 1 would have the form: where λ is the canonical parameter of the Poisson density function f P .

Linear model for the latent variables l
The vector of latent variables are predicted by the linear model where X is a design matrix relating fixed predictors to the data, and Z is a design matrix relating random predictors to the data. These predictors have associated parameter vectors β and u, and e is a vector of residuals. In the Poisson case these residuals deal with any over-dispersion in the data after accounting for fixed and random sources of variation.

Variance structures for the model parameters
The location effects (β and u), and the residuals (e) are assumed to come from a multivariate normal distribution: where β 0 are the prior means for the fixed effects with prior covariance matrix B, and G and R are the expected (co)variances of the random effects and residuals respectively. The zero off-diagonal matrices imply a priori independence between fixed effects, random effects, and residuals. Generally, G and R are large square matrices with dimensions equal to the number of random effects and residuals. Typically they are unknown, and must be estimated from the data, usually by assuming they are structured in a way that they can be parametrized by few parameters. Below we will focus on the structure of G, but the same logic can be applied to R.
At its most general, MCMCglmm allows variance structures of the form: where the parameter (co)variance matrices (V) are usually low-dimensional and are to be estimated, and the structured matrices (A) are usually high dimensional and treated as known. We will refer to terms separated by a direct sum (⊕) as component terms, and the use of a direct sum explicitly assumes random effects associated with different component terms are independent. Each component term, however, is formed through the Kronecker product (⊗) which allows for possible dependence between random effects within a component term.
Equation 24 can be expanded to give: where the zero off-diagonals represent the independence between component terms.
In the simplest models the structured matrices of each component term are often assumed to be identity matrices and the parameter (co)variance matrices scalar variances: which assumes that random effects within a component term are independent but have a common variance. However, independence between different levels is often too strong an assumption. For example, if we had made two visits to a sample of schools and recorded test scores for the children, we may expect dependence between measurements made in the same school although they were sampled at different times. If the random effects are ordered schools within ages (u = [u 1 u 2 ]) where u 1 are the random effects for the schools at time period one, and u 2 for the same set of schools at time period 2, then an appropriate G component may have the form: Here the diagonal elements model different variances for the two sampling periods, and the covariance captures any persistent differences between schools. The identity matrix in the Kronecker product implies the schools are independent. Although the assumption of independence may be adequate in many applications, there are situations where it is not tenable. For example, when data have been collected on related individuals, or related species, then complicated patterns of dependence can arise if the characteristics are heritable. In these cases A is not an identity matrix but a matrix whose elements are equal to the proportion of genes the two individuals have in common.

Parameter estimation and DIC
For most types of model (non-Gaussian data) the distribution of l is not in a recognizable form and is updated using either Metropolis-Hastings updates or the slice sampling method of (Damien et al. 1999). Latent variables whose residuals are non-independent are sampled in blocks using Metropolis-Hastings updates and an efficient proposal distribution is determined during the burn-in phase using adaptive methods (Haario, Saksman, and Tamminen 2001;Ovaskainen, Rekola, Meyke, and Arjas 2008). The parameters of the mixed model (β and u) follow a multivariate normal distribution and can be Gibbs sampled in a single block using the method of Garcia-Cortes and Sorensen (2001). This method requires solving a large, but often sparse set of linear equations which can be done efficiently using methods provided in the CSparse library (Davis 2006). With conjugate priors the variance structures (R and G) follow an inverse-Wishart distribution which can also be Gibbs sampled in a single block in many instances. By fitting non-identified multiplicative working parameters for the random effects non-central F -distributed priors for the variance components can be fitted (Gelman 2006). This involves updating the working parameters each iteration which again can be achieved using the method of Garcia-Cortes and Sorensen (2001).
The deviance and hence the deviance information criterion (DIC) can be calculated in different ways depending on what is in 'focus' (Spiegelhalter, Best, Carlin, and van der Linde 2002). For non-Gaussian response variables (including censored Gaussian) MCMCglmm calculates the deviance using the probability of the data given the latent variables. For Gaussian data, however, the deviance is calculated using the probability of the data given the location parameters θ = [β u].
In the appendix the conditional distributions, and computational strategies for sampling from them, are described in more detail, together with a more in depth explanation on the computation of deviance and DIC.

Software
To illustrate the software we reanalyze experimental data collected on the Eurasian passerine bird, the Blue tit (Cyanistes caeruleus), see Hadfield, Nutall, Osorio, and Owens (2007). The data consist of measurements taken on 828 chicks distributed across 106 broods: R> library("MCMCglmm") R> data("BTdata") R> BTdata[1,] tarsus back animal dam fosternest hatchdate sex 1 -1.892297 1.146421 R187142 R187557 F2102 -0.6874021 Fem The day after the chicks hatch, approximately half of the brood are reciprocally swapped with chicks from another nest. This results in an unbalanced cross-classified data structure where chicks share a fosternest with both relatives and non-relatives. Using molecular methods (Griffiths, Double, Orr, and Dawson 1998) the sex of the chicks were determined in 94% of cases, and the response variables, tarsus length and back color, were measured in all birds. The response variables are approximately normal and were mean centered and scaled to unit variance. The date on which the chicks hatched was recorded for all nests. The parental generation is assumed to consist of unrelated individuals and all chicks from the same family are assumed to share the same mother and father. Although in this example, family structure can be modeled more efficiently by fitting genetic mother (dam) as a random effect, we will use the more general animal model Henderson (1976) which is parametrized in terms of the relationship matrix, A. The relationship matrix is defined by the pedigree; animal dam sire 1 R187557 <NAR> <NAR> a 3 column data frame with an individual's identifier (animal) in the first column and its parental identifiers in the second and third columns. The pedigree often contains more individuals than are present in the data frame (in this example the pedigree also includes the parental generation) but all animal's in the data frame must have a row in the pedigree.

MCMCglmm arguments
The function MCMCglmm within the R package of the same name is used for model fitting. Hadfield et al. (2007) were interested in estimating the covariance between tarsus and back for different sources of variation and to achieve this we fitted the model: R> m1 <-MCMCglmm( + fixed = cbind(tarsus, back)~trait:sex + trait:hatchdate -1, + random =~us(trait):animal + us(trait):fosternest, + rcov =~us(trait):units, prior = prior, + family = c("gaussian", "gaussian"), nitt = 60000, burnin = 10000, + thin = 25, data = BTdata, pedigree = BTped) In the following sections we work through the four main arguments taken by MCMCglmm: those that specify the response variables and fixed effects (fixed), the distribution of the response variables (family), the random effects and associated G-structure (random), and the Rstructure (rcov). The syntax used to specify the model closely follows that used by asreml (Butler, Cullis, Gilmour, and Gogel 2007), an R interface to ASReml (Gilmour, Gogel, Cullis, Welham, and Thompson 2002) -a program for fitting GLMM using restricted maximum likelihood (REML).

fixed: Response variables and fixed effects
The fixed argument follows the standard R formula language, and although multiple responses can be passed as a single vector, it is perhaps easier in many cases to pass them as a matrix using cbind. For example, By fitting trait as a fixed effect we allow the two responses to have different means, and by fitting interactions such as trait:hatchdate we allow different regression slopes of the traits on hatchdate. Multi-response models models are generally easier to interpret when an overall intercept is suppressed (-1) otherwise the parameter estimates associated with back are interpreted as contrasts with tarsus.

family: Response variable distributions
For the above model, two distributions must be specified in the family argument, and we assume Gaussian distributions with identity link functions for both: family = c("gaussian", "gaussian") Other distributions and link functions can be specified (see Table 1). Some distributions require more data columns than linear predictors. For example, censored data are passed as Distribution #Data #Liability Density function "gaussian" MCMCglmm. The prefix "zi" stands for zero-inflated, and the prefix "cen" standards for censored where y 1 and y 2 are the upper and lower bounds for the unobserved datum y. J stands for the number of categories in the multinomial/categorical distributions and this must be specified in the family argument for the multinomial distribution. The density function is for a single datum in a univariate model with w being a row vector of W. f and F are the density and distribution functions for the subscripted distribution (N =Normal, P =Poisson, E=Exponential). The J − 1 γ's in the ordinal models are the cutpoints, with γ 1 set to zero.
two columns, the first specifying the lowest value the data could take, and the second column specifying the highest value the data could take. However, only a single linear predictor (associated with the uncensored but unobserved data) is fitted for that distribution and it should be remembered that in this case trait is really indexing linear predictors, not data. Another example of this is the binomial distribution (specified as "multinomial2" in the family argument) which is generally specified as a two column response of successes and failures, but is parametrized by a single linear predictor of the log odds ratio. In addition, some distributions actually have more linear predictors than data columns. For example, the zero-inflated Poisson has two linear predictors; one for predicting zero-inflation and one for predicting the Poisson counts. Similarly, categorical data although passed as a single response are treated as a multinomial response with J − 1 linear predictors (where J are the number of categories). Again, it should be remembered that in this case several levels of trait may be associated with different aspects of the same data column.

random: Random effects and G
Simple variance structures, as represented in Equation 7, can also be specified as a standard R formula: although this is often inappropriate, especially for multi-response models where the implicit assumption has been made that fosternest effects are identical for both traits. Table 2 summarizes covariance matrix specifications for the general 3 case, but to illustrate, we will focus on a 2 × 2 (co)variance matrix (V f ) associated with fosternest effects: The diagonal elements are the fosternest variance components for tarsus length and back color, and the off-diagonal elements are the covariance between fosternest effects on the two traits. The specification above, without an interaction, forces the structure: where all components are forced to be the same. It is natural to form interactions with trait as we did with the fixed effects, although there are three possible ways this could be done. The straight forward interaction trait:fosternest although still fitting a single variance component across both traits, assumes that individual effects are independent between traits: More useful interactions can be formed using the idh() and us() functions. For example, idh(trait):fosternest fits heterogeneous variances across traits: although still assumes that the two traits are independent at the fosternest level. The specification us(trait):fosternest fits the completely parametrized matrix that allows for covariance across traits: Since the experiment was designed to measure the covariances between the two response variables, completely parametrized (co)variance matrices are specified: random =~us(trait):fosternest + us(trait):animal For models that have pedigree or phylogenetic effects the vector of random effects needs to be associated with the inverse relationship matrix A −1 . This matrix is formed by passing a pedigree or phylogeny to the pedigree argument of MCMCglmm. The individuals (or taxa) need to be associated with a column in the data frame, and this column must be called animal.
It is also possible to fit random interactions between categorical and continuous variables as in random regression models. For example, a random intercept-slope model with a covariance term fitted could be specified: random =~us(1+age):individual or for higher order polynomials the poly function could be used: random =~us(1 + poly(age, 2)):individual Another form of random effect structure that does not arise in the worked example is that arising in meta analysis. In meta-analysis each data point is measured with some error. If the sampling error around the true value is approximately normal, and the variance of the sampling errors known, then random effect meta-analyses can be fitted by passing the sampling variances to the mev argument of MCMCglmm. In the simplest case, without additional random effects and i.i.d R-structure, the latent variables are assumed to have the multivariate normal distribution: where D is a diagonal matrix with mev along the diagonal.

rcov: Residual variance structure R
The R-structure can be parametrized in the same way as the G-structure although currently direct sums are not possible. However, unlike the G-structure it is important that the residual model is specified in away that allows each linear predictor to have a unique residual. For multi-response models forming an interaction between trait and units satisfies this condition and as with the G-structure various types of interaction could be considered. Again, we will use a fully parametrized covariance matrix: rcov =~us(trait):units

prior: Response variables and fixed effects
If not defined, default priors are used which are not proper and this can lead to both inferential and numerical problems. The prior specification is passed to MCMCglmm via the argument prior which takes a list of three elements specifying the priors for the fixed effects (B), the G-structure (G) and the R-structure (R).
For the fixed effects, a multivariate normal prior distribution can be specified through the mean vector mu (β 0 ) and a (co)variance matrix V (B) passed as list elements of B. The default has a zero mean vector and a diagonal variance matrix with large variances (1e+10).
For non-parameter expanded models, the parameter (co)variance matrices are assumed to have (conditional) inverse-Wishart prior distributions and individual elements for each component of the variance structure take the arguments V, n and fix which specify the expected (co)variance matrix at the limit, the degree of freedom parameter, and the partition to condition on. The variance structure prior specification for the above models was R> prior <-list(R = list(V = diag(2)/3, n = 2), G = list( + G1 = list(V = diag(2)/3, n = 2), G2 = list(V = diag(2)/3, n = 2))) where the expected covariance matrices for all three components of the variance structure are diagonal matrices implying a priori independence between tarsus and back. The traits were scaled to have unit variance prior to analysis and so the specification implies the prior belief that the total variance is evenly split across all three terms. The term fix has been left unspecified and so all variance parameters are estimated. However, for certain types of model it is advantageous to be able to fix sub-matrices at certain values and not estimate them. The fix argument partitions V into (potentially) 4 sub-matrices where the partition occurs on the fix-th diagonal element. For example, if V is an n × n matrix then V is partitioned: V = V 1:(fix-1),1:(fix-1) V 1:(fix-1),fix:n V fix:n,1:(fix-1) V fix:n,fix:n and the lower right sub-matrix (V fix:n,fix:n ) is fixed and not estimated. When fix = 1 the whole matrix is fixed.
Two further arguments that can passed are alpha.mu and alpha.V which specify the prior distribution for the non-identified working parameters. When the matrix alpha.V is non-null parameter expanded models are fitted. When the variance-structure defines a single variance, the prior distribution is a scaled non-central F -distribution (Gelman 2006). Without loss of generality we can have V = 1 in the prior to give: where f F is the density function of the F -distribution defined by three parameters: the numerator and denominator degrees of freedom and the non-centrality parameter, respectively.

MCMC output
The model was ran for 60,000 iterations with a burn-in phase of 10,000 and a thinning interval of 25. MCMCglmm returns a list with elements:  The samples from the posterior distribution are stored as mcmc objects, which can be summarized and visualized using the coda package (Plummer, Best, Cowles, and Vines 2008). The element Sol contains the fixed effects (β), and, if pr = TRUE, also the random effects (u). The element VCV contains the parameter (co)variance matrices stacked column-wise, and, if pl = TRUE, Liab contains the posterior distribution of latent variables l. The element Deviance contains the deviance at each stored iteration and DIC contains the deviance information criterion (Spiegelhalter et al. 2002) calculated over all iterations after burn-in. Traces of the sampled output and density estimates are shown for the effects of gender on trait expression ( Figure 1) and the genetic covariance matrix associated with animal ( Figure 2).  Table 3: Deviance information criteria for several models where the covariance between the response variable for a designated source of variation was either estimated (us) or set to zero (idh). Each model was ran twice in order to asses the level of Monte Carlo error in calculating DIC.
We also fitted alternative variance structures where some or all covariances were set to zero, and Table 3 shows the DIC for each model. The priors on the reduced models were set up so that the marginal prior for the variances was the same as that in the full model. The sampling error of DIC can be large and so we ran all models for an additional 500,000 iterations.

Comparison with WinBUGS
We also fitted an identical model in WinBUGS (code available from the author) using a multivariate extension to the method proposed by Waldmann (2009). On a 2.5Ghz dual core MacBook Pro with 2GB RAM, MCMCglmm took 7.6 minutes and WinBUGS took 4.8 hours to fit the model. Moreover, the number of effective samples was 3.2 times higher in MCMCglmm (averaged over all parameters) indicating that the chain has better mixing properties. Because MCMCglmm samples all location parameters in a single block the gains in efficiency are expected to be even higher when the parameters show stronger posterior correlation.

Concluding remarks
This paper introduces an R package for fitting multi-response generalized linear mixed models using Markov chain Monte Carlo techniques developed in quantitative genetics (Sorensen and Gianola 2002). A key aspect of these techniques is that they update all location effects (fixed and random) as a single block which results in better mixing properties and shorter chain lengths than alternative strategies. This can involve repeatedly solving a very large but sparse set of mixed model equations, and the computational cost of doing this is minimized by using the CSparse C libraries for solving sparse linear systems (Davis 2006). For the example data set analyzed, MCMCglmm collected 120 times more effective samples per unit time than the same model fitted in WinBUGS. A range of distributions for the response variables are permitted, and flexible variance structures for the random effects and residuals included. It is hoped that this package makes the flexibility and simplicity of generalized linear mixed modeling available to a wider range of researchers.

A. Updating the latent variables l
The conditional density of l is given by: where f N indicates a Multivariate normal density with specified mean vector and covariance matrix. Equation 15 is the probability of the data point y i with linear predictor l i on the link scale for distribution f i , multiplied by the probability of the linear predictor residual. The linear predictor residual follows a conditional normal distribution where the conditioning is on the residuals associated with data points other than i. Vectors and matrices with the row and/or column associated with i removed are denoted /i. In practice, this conditional distribution only involves other residuals which are expected to show some form of residual covariation, as defined by the R structure. Because of this we actually update latent variables in blocks, where the block is defined as groups of residuals which are expected to be correlated: where j indexes blocks of latent variables that have non-zero residual covariances. A special case arises for multi-parameter distributions in which each parameter is associated with a linear predictor. For example, in the zero-inflated Poisson two linear predictors are used to model the same data point, one to predict zero-inflation, and one to predict the Poisson variable. In this case the two linear predictors are updated in a single block even when the residual covariance between them is set to zero, because the first probability in Equation 16 cannot be factored: We use adaptive methods during the burn-in phase to determine an efficient multivariate normal proposal distribution entered at the previous value of l j with covariance matrix mM. For computational efficiency we use the same M for each block j, where M is the average posterior (co)variance of l j within blocks and is updated each iteration of the burn-in period Haario et al. (2001). The scalar m is chosen using the method of Ovaskainen et al. (2008) so that the proportion of successful jumps is optimal, with a rate of 0.44 when l j is a scalar declining to 0.23 when l j is high dimensional (Gelman, Carlin, Stern, and Rubin 2004).
For the standard linear mixed model with a Gaussian response and identity link, P(l i = y i |y, θ, R, G) is always unity and so the Metropolis-Hastings steps are always omitted. When the latent variables within a block j are associated with missing data then their conditional distribution is multivariate normal and can be Gibbs sampled directly: where design matrices subscripted by j are the rows of the original design matrices associated with the latent variables in block j.

B. Updating the location vector θ = β u
Garcia-Cortes and Sorensen (2001) provide a method for sampling θ as a complete block that involves solving the sparse linear system: where C is the mixed model coefficient matrix: and W = [X Z], and B is the prior (co)variance matrix for the fixed effects.
θ and e are random draws from the multivariate normal distributions: and e ∼ N (Wθ , R) θ + θ gives a realization from the required probability distribution: Equation 19 is solved using Cholesky factorization. Because C is sparse and the pattern of non-zero elements fixed, an initial symbolic Cholesky factorization of PCP is preformed where P is a fill-reducing permutation matrix (Davis 2006). Numerical factorization must be performed each iteration but the fill-reducing permutation (found via a minimum degree ordering of C + C ) reduces the computational burden dramatically compared to a direct factorization of C (Davis 2006).
Forming the inverse of the variance structures is usually simpler because they can be expressed as a series of direct sums and Kronecker products: and the inverse of such a structure has the form which involves inverting the parameter (co)variance matrices (V), which are usually of low dimension, and inverting A. For many problems A is actually an identity matrix and so inversion is not required. When A is a relationship matrix associated with a pedigree, Henderson (1976); Meuwissen and Luo (1992) give efficient recursive algorithms for obtaining the inverse, and Hadfield and Nakagawa (2010b) derive a similar procedure for phylogenies.

C. Updating the variance structures G and R
Components of the direct sum used to construct the desired variance structures are conditionally independent. The sum of squares matrix associated with each component term has the form: where U is a matrix of random effects where each column is associated with the relevant row/column of V and each row associated with the relevant row/column of A. The parameter (co)variance matrix can then be sampled from the inverse Wishart distribution: where n u is the number of rows in U, and S p and n p are the prior sum of squares and prior degree's of freedom, respectively.
In some models, some elements of a parameter (co)variance matrix cannot be estimated from the data and all the information comes from the prior. In these cases it can be advantageous to fix these elements at some value and Korsgaard, Andersen, and Sorensen (1999) provide a strategy for sampling from a conditional inverse-Wishart distribution which is appropriate when the rows/columns of the parameter matrix can be permuted so that the conditioning occurs on some diagonal sub-matrix. When this is not possible Metropolis-Hastings updates can be made.

D. Ordinal models
For ordinal models it is necessary to update the cutpoints which define the bin boundaries for latent variables associated with each category of the outcome. To achieve good mixing we used the method developed by (Cowles 1996) that allows the latent variables and cutpoints to be updated simultaneously using a Hastings-with-Gibbs update.

E. Parameter expansion
As the covariance matrix approaches a singularity the mixing of the chain becomes notoriously slow. This problem is often encountered in single-response models when a variance component is small and the chain becomes stuck at values close to zero. Similar problems occur for the EM algorithm and (Liu, Rubin, and Wu 1998) introduced parameter expansion to speed up the rate of convergence. The idea was quickly applied to Gibbs sampling problems Liu and Wu (1999) and has now been extensively used to develop more efficient mixed-model samplers (e.g., van Dyk and Meng 2001;Gelman, van Dyk, Huang, and Boscardin 2008;Browne, Steele, Golalizadeh, and Green 2009).
The columns of the design matrix (W) can be multiplied by the non-identified working parameters α = [1, α 1 , α 2 , . . . α k ] : where the indices denote sub-matrices of Z which pertain to effects associated with the same variance component. Replacing W with W α we can sample the new location effects θ α as described above, and rescale them to obtain θ: where the identity matrices are equal in dimension to n x the number of elements in the subscripted parameter vector x.
Likewise, the (co)variance matrices can be rescaled by the set of α's associated with the variances of a particular variance structure component (α V ): The working parameters are not identifiable in the likelihood, but do have a proper conditional distribution. Defining X α as an n × (k + 1) design matrix, with each column equal to the sub-matrices in Equation 28 post-multiplied by the relevant sub-vectors of θ α , we can see that α is a vector of regression coefficients: and so the methods described above can be used to update them.

F. Deviance and DIC
The deviance D is defined as: where Ω is some parameter set of the model. The deviance can be calculated in different ways depending on what is in 'focus', and MCMCglmm calculates this probability for the lowest level of the hierarchy (Spiegelhalter et al. 2002). For Gaussian response variables the likelihood is the density: where Ω = {θ, R} but for other response variables variables it is the product: with Ω = l.
For multivariate models with mixtures of Gaussian and non-Gaussian data (including missing values) the likelihood of the Gaussian data is the density of y g in the conditional density: f N y g |X g β + Z g u + R g,l R −1 l,l (l − X l β − Z l u), R g,g − R g,l R −1 l,l R l,g where the subscripts g and l denote rows of the data vector/design matrices that pertain to Gaussian data, and non-Gaussian data respectively. Subscripts on the R-structure index both rows and columns. The likelihood of the non-Gaussian data are identical to Equation 34 giving the complete parameter set Ω = {θ g , R, l}.
The deviance is calculated at each iteration if DIC = TRUE and stored each thin-th iteration after burn-in. The mean deviance (D) is calculated over all iterations, as is the mean of the latent variables (l) the R-structure and the vector of predictors (Xβ + Zu). The deviance is calculated at the mean estimate of the parameters (D(Ω)) and the deviance information criterion calculated as: