Flexible linear mixed models with improper priors for longitudinal and survival data

We propose a Bayesian approach using improper priors for hierarchical linear mixed models with flexible random effects and residual error distributions. The error distribution is modelled using scale mixtures of normals, which can capture tails heavier than those of the normal distribution. This generalisation is useful to produce models that are robust to the presence of outliers. The case of asymmetric residual errors is also studied. We present general results for the propriety of the posterior that also cover cases with censored observations, allowing for the use of these models in the contexts of popular longitudinal and survival analyses. We consider the use of copulas with flexible marginals for modelling the dependence between the random effects, but our results cover the use of any random effects distribution. Thus, our paper provides a formal justification for Bayesian inference in a very wide class of models (covering virtually all of the literature) under attractive prior structures that limit the amount of required user elicitation.


Introduction
Longitudinal and survival data appear in many application areas such as medicine, biology, agriculture, epidemiology, economics and small area estimation. This kind of observations typically present within-and among-subject dependence due to grouping and/or repeated measurements. Hierarchical linear mixed models (LMM) are often used to account for parameter variation across groups of observations as well as dependence. The general formulation of this type of model is given by where y ij is the observed response for subject i at time t ij , j = 1, . . . , n i with n i recording the number of repeated measurements for subject i, and i = 1, . . . , r where r denotes the number of subjects, β is a p × 1 vector of fixed effects, u i are q × 1 mutually independent random vectors (often called random effects) and ε ij are i.i.d. residual errors. Let y = {y ij } be the n × 1 vector of response variables and X and Z denote the known design matrices of dimension n × p and n × qr, respectively, while ε is the n × 1 vector of residual errors, and u = (u 1 , . . . , u r ) . We assume that rank(X) = p, and n > p + qr throughout.
These models are used in different fields under different names and notation. In Bayesian analysis of variance, the use of these models goes back to Tiao and Tan (1965). In survival analysis, the n survival times grouped in T are usually transformed to logarithms and modelled through log(T) = Xβ + Zu + ε, which is often referred to as a Mixed Effects Accelerated Failure Time model (MEAFT, Komárek and Lesaffre, 2007). In this literature, the random effects u are typically called frailties. An additional challenge that often appears in this case is the presence of censored observations. In the context of production and cost frontiers used in economics, the logarithm of the output (or cost) of r firms are modelled using (1) with certain regularity restrictions on the regression coefficients β and sign restrictions on the random effects u i , which have a clear interpretation here in terms of inefficiencies. These models are known as stochastic frontier models (see e.g. Fernández et al., 1997). The random vectors ε and u are often assumed to be normally distributed. Given that the normality assumption can be restrictive in practice (and even impossible for stochastic frontier models), alternative distributional assumptions have been explored, such as smooth flexible (semi-nonparametric) distributions for the random effects and normal residual errors (Zhang and Davidian, 2001), skew Student-t distributions for the random effects (Lee and Thompson, 2008), normal random effects with residual errors modelled through a finite mixture of normals Lesaffre, 2007, 2008), and residual errors and random effects jointly distributed as a multivariate scale mixture of skew-normal distributions (Lachos et al., 2010), which makes the residual errors and the random effects dependent. Jara et al. (2008) model both the residual errors and the random effects using multivariate skew-elliptical distributions, and Dunson (2010) and Jara et al. (2009) use Bayesian nonparametric approaches. Although flexible, Bayesian nonparametric approaches tend to be more computationally expensive than simpler flexible parametric models, while leading to similar conclusions as shown in our examples.
In a Bayesian framework with improper priors, Hobert and Casella (1996) and Sun et al. (2001) present conditions for the propriety of the posterior under a certain improper prior structure for the parameters of model (1) with normal assumptions on both the random effects and the residual errors. When using improper prior structures, it becomes essential to check the propriety of the posterior distribution in order to justify their use and interpretation in a probabilistic framework. Hobert and Casella (1996) characterise cases where certain improper prior structures used in practice lead to proper posteriors. The use of such priors was becoming common at that time since they do allow for the construction of a Gibbs sampler. However, Hobert and Casella (1996) show that the posterior may still be improper in some cases and, consequently, their use is not theoretically justified. Rubio (2015) presents extensions of these propriety results to the use of certain classes of flexible random effects distributions. Hobert and Casella (1996) and Rubio (2015) assume independence of the random effects, while Sun et al. (2001) assume a specific structure for the covariance of the random effects. Fernández et al. (1997) proposed an improper prior structure which allows for using arbitrary random effects distributions, as long they are proper. More specifically, they adopt the following stochastic structure where p(u) is proper, and π(β) is proper or bounded. The condition that p(u) is proper is rather mild as this allows for using any parametric distribution for the random effects u with proper priors on the "deeper" parameters that index the random effects distribution. The propriety of p(u) is essential since an improper p(u) never leads to a posterior (see Theorem 2 from Fernández et al., 1997). The prior structure on (β, σ ε ) is interesting as it includes priors with useful properties such as invariance with respect to reparameterisation under affine transformations. In this paper, we extend (3) by relaxing the assumption of normality of the residual errors and derive conditions for the existence of the corresponding posterior distribution that also allow for censored observations, covering virtually all models in the recent literature. Before presenting our results, we clarify our main contributions relative to our previous work in flexible linear regression models. Rubio and Genton (2016) study linear regression models with skew-symmetric errors and improper priors, covering the case with censored observations. Analogous results are presented in Rubio and Yu (2017) in the context of linear regression models with two-piece residual errors. Both papers focus on the effect of using skewed residual error distributions in terms of the prediction of the residual life of censored individuals. They do not study the propriety of the posterior distribution of linear regression models with random effects, which requires special attention as shown in this paper. The Appendix includes proofs not mentioned in the text (Appendix A), conditions on the mixing distributions (Appendix B) and a further simulation study with skewed errors and random effects (Appendix C). The R codes used in the real data examples are available at http://www.rpubs.com/FJRubio.

Multivariate scale mixtures of normals: a warning
In the context of linear regression, a common strategy for relaxing the assumption of normality of the residual errors consists in using multivariate scale mixtures of normals (Lachos et al., 2010). Recall that a p-variate scale mixture of normal distribution (SMN p (μ, Σ, δ; H)) with location parameter μ, symmetric positive definite scale matrix Σ, and parametric mixing distribution H(· | δ) is defined through For example, when H(τ | δ) is a Gamma distribution with parameters (δ/2, δ/2) (and mean one), we obtain the p-variate Student-t distribution with δ degrees of freedom. Osiewalski and Steel (1993) and Breusch et al. (1997) have pointed out inferential peculiarities using such error distributions, both in Bayesian and classical regression settings. Here we explore the implications of using multivariate scale mixtures of normal in the context of LMM. Consider the following hierarchical structure for model (1) where p(u) is proper. For our theoretical results, it does not matter whether p(u) is assumed to be a fixed distribution without further recourse to deeper parameters or whether it is a marginal distribution obtained from a proper joint distribution. In practice, the latter is typically the case and it is derived from a proper parametric distribution with a proper prior on its parameters. Throughout Sections 2 and 3, we will only refer to p(u) for the sake of simplicity of notation. We adopt the following prior structure where b ≥ 0, π(β) is bounded, and π(δ ε ) is a proper prior with support Δ ⊂ R.
The following result shows that the marginal likelihood of the data for this hierarchical model can be factorised as the product of the marginal likelihood under normality multiplied by a quantity that does not depend on the data. (1), (4)-(6) the marginal likelihood of the data can be written asm
The factorisation of the marginal likelihood of the data can be used to obtain conditions for the propriety of the posterior distribution as follows.
Corollary 1. Consider the model (1), (4)-(6) and let (X : Z) denote the entire design matrix, and suppose that p(u) is proper. Consider also the following conditions: (a) rank(X : Z) < n, Proof. From Remark 1 we have that the marginal density of the data can be written as in (7). The finiteness of m(y) in this expression, under assumptions (a)-(b), is proved in Theorem 1 from Fernández et al. (1997). The second factor in (7) is finite by assumption (c).
For b = 0, condition (c) is satisfied by any mixing distribution. However, for b > 0 this condition may rule out the use of some mixing distributions and/or priors on δ ε , as discussed in the Supplementary Material (Appendix B). Remark 1 and Corollary 1 point out some issues with this extension. If we use Bayes factors to compare two models M 1 and M 2 of the type described in Remark 1, with different distributions for the residuals and corresponding marginal likelihoodsm 1 (y) andm 2 (y), we obtain (using obvious notation) Flexible linear mixed models 577 Therefore, the Bayes factor is determined solely by the mixing distributions and their priors, but not by the data (!). Intuitively, these results indicate that the data do not contain information about the additional shape parameter δ ε . The factorisation (7) indicates that the use of this kind of multivariate extension is vacuous, in the sense that the additional shape parameters do not really help model selection. Similar issues with the use of certain classes of multivariate distributions and priors have been studied in Maruyama and Strawderman (2014) for linear regression models. The next section presents an alternative extension that overcomes such problems.

The proposed extension: flexible linear mixed models
The inferential issues pointed out in the previous section are due to the use of a single mixing variable in the construction of the joint distribution of the residual errors. In other words, (5) really implies only a single realisation of the mixing variable, irrespective of the size of n. Thus, we only have a single observation of the mixing variable, and inference on its (precision) parameter is impossible from the sample. One way to avoid these issues consists of using one mixing variable for each individual error. In this case, each observation contributes to the estimation of the parameter of the distribution of the mixing variable (see West, 1984 for a discussion in the context of Bayesian linear regression). Thus, we consider the model in (1) and (4), but replace (5) by This is a rather wide class of symmetric and unimodal distributions, with tails which are either normal or fatter than normal, covering e.g. the normal, Studentt, logistic, Laplace, Cauchy and the exponential power family with power 1 ≤ q < 2 (see Fernández and Steel, 2000 for more details). These distributions naturally lead to models that are robust to the presence of outliers (see Rosa et al., 2003, who also adopt SMN 1 residual errors, but restrict their study to the use of proper priors and normal random effects). Using the same prior structure (6), we derive the following conditions for the propriety of the posterior distribution.
Theorem 1. For the model (1), (4), (8) with prior (6), and p(u) a proper distribution, consider the following conditions: Condition (a) indicates the need for repeated measurements (otherwise n = r and rank(X : Z) = n for sensible choices of Z), while conditions (b) and (c) characterise the priors and mixing distributions that produce a proper posterior. Given that all the distributional assumptions in Theorem 1 are continuous, it follows that condition (d) is satisfied with probability one. Condition (d) is equivalent to saying that y = Xβ + Zu has no solution, which can be checked numerically.
The only requirement on the distribution of the random effects u is that the marginal p(u) is proper, so as stated before, we can use an arbitrary parametric distribution u ∼ F (· | θ), with a proper prior on the parameter θ. This includes, for example, the use of finite mixtures of normals, scale mixtures of normals, skewed scale mixtures of normals, truncated distributions, etc. In our examples, we explore the use of distributions that can capture departures from normality in terms of asymmetry and tail behaviour.

Mixed effects accelerated failure time models with censoring
The propriety result presented in Theorem 1 can be extended to models used in survival analysis. Let T be a sample of n survival times. The MEAFT in (2) is exactly the same as (1) for the log survival times. However, a common difficulty that arises with survival regression models is the presence of censored observations (Komárek and Lesaffre, 2007;Vallejos and Steel, 2015). If the sample of survival times does not contain censored observations, the existence of the posterior is provided by Theorem 1. If the sample contains both censored and uncensored observations, the propriety of the posterior can be based on the uncensored observations as follows.
Theorem 2. Let T be a sample of survival times where n 0 observations are uncensored, with p + qr < n 0 < n. Consider the model (2), (4), (8) with prior (6), and p(u) a proper distribution. Then, the posterior is proper if the marginal likelihood of the uncensored observations is finite.
The following result presents conditions for the case when the sample contains only censored observations. Theorem 3. Suppose that n c survival times T j , j = 1, . . . n c , are observed as closed intervals I j , and the rest of the observations exhibit another type of censoring. Thus, let I 1 , . . . , I nc be finite-length intervals on the positive real line, n c ≤ n and let (X : Z) nc represent the design matrix associated to the n c interval-censored observations. Consider the model (2), (4), (8) with prior (6), p(u) assumed to be proper, and the following condition: (d ) The set E = I 1 × · · · × I nc and the column space of (X : Z) nc are disjoint.

Conditions (a)-(c) from Theorem 1 together with (d ) are sufficient for the propriety of the posterior.
Condition (d ) means that there is no -unobserved-true response that can be written as a linear combination of the columns of X and Z, and can be relaxed in terms of the dimensionality as follows.
(d ) There exists a set E = I j1 × · · · × I j n , p + qr < n ≤ n c , for some set of indexes J = {j 1 , . . . , j n } ⊆ {1, . . . , n}, such that E and the column space of (X , Z ) are disjoint, where X and Z are the design submatrices associated to the indexes J .
Another way of checking condition (d ) consists of formulating it as a linear programming problem. Denote η ∈ R p+qr , ξ = (ξ 1 , . . . , ξ nc ) ∈ E, and I j = [l j , u j ], j = 1, . . . , n c . Define the problem: Then, condition (d ) is equivalent to verifying the infeasibility of (9), for which several theoretical and numerical tools are available.

Stochastic frontier models
The stochastic frontier model with composed error takes the form of (1) where u ∈ U with U = {u : Zu ∈ R n + }. Several specifications of interest in the context of economics are studied in Fernández et al. (1997). The propriety results in Theorem 1 apply to this model, which represents an extension of Theorem 1 in Fernández et al. (1997).

Asymmetric errors
So far, we have considered symmetric extensions of the standard linear mixed model with normal errors. We now extend to asymmetric errors by using the class of two-piece distributions. Let f be a continuous density with a unique mode at 0, support on R and shape parameter δ. A random variable W is distributed according to a two-piece f -distribution (denoted W ∼ TP(μ, σ, γ, δ; f )) if its density function can be written as s (w | μ, σ, γ, δ where I(·) denotes the indicator function, and {a(·), b(·)} are positive differentiable functions. This density is continuous, unimodal, with mode at μ ∈ R, scale parameter σ ∈ R + , and skewness parameter γ ∈ Γ ⊂ R. It coincides with the density f when a(γ) = b(γ), and is asymmetric for a(γ) = b(γ) while retaining the tails of f in each direction. The most common choices for a(·) and b(·) correspond to the inverse scale factors parameterisation {a(γ), b(γ)} = {γ, 1/γ}, γ ∈ R + (Fernández and Steel, 1998), and the epsilon-skew parameterisation (Mudholkar and Hutson, 2000). Rubio and Steel (2014) study other parameterisations as well as some interpretable choices for the prior of the skewness parameter γ.
Suppose now that f in (10) is a univariate scale mixture of normals with mixing distribution H ε and consider the model (1) with the following error structure and prior where b ≥ 0, π(β) is bounded, and π(γ ε ) and π(δ ε ) are proper priors with γ ε ∈ Γ.
Condition (e) is automatically satisfied for b = 0 and any parameterisation {a(·), b(·)}. For the epsilon-skew parameterisation, condition (e) is satisfied for any b ≥ 0 and any proper prior π(·) given that the functions a(·) and b(·) are upper bounded. Thus, the introduction of skewness does not destroy the existence of the posterior distribution while allowing for more flexibility.

Simulation study I (Student-t errors, skewed random effects)
In this section we conduct a simulation study, inspired by that presented in Zhang and Davidian (2001), in order to assess the effect of different distributional assumptions on the random effects and the residual errors. We study the model: According to Zhang and Davidian (2001), r = 100 represents a case where the amount of information available to estimate both the fixed effects and the random effects is modest. Four scenarios are considered for the distribution of the residual errors ij and the random effects u i . For the first scenario we simulate from ij ∼ t(0, 0.5, 2) and u i ∼ TPN(−1.5, 0.5, 0.5); where t(0, 0.5, 2) represents a Student-t distribution with location parameter 0, scale parameter 0.5 and 2 degrees of freedom, and TPN(−1.5, 0.5, 0.5) represents a two-piece normal with location parameter −1.5, scale parameter 0.5 and skewness parameter 0.5 using the epsilon-skew parameterisation. For the second scenario we use ij ∼ N(0, 0.5) and u i ∼ TPN(−1.5, 0.5, 0.5). The third scenario consists of ij ∼ t(0, 0.5, 2) and u i ∼ N(−1.5, 0.5). The fourth scenario uses ij ∼ N(0, 0.5) and u i ∼ N(−1.5, 0.5). We simulate 100 data sets under these configurations. For each of these simulated samples, the model (13) was fitted assuming that ij ∼ t(0, σ ε , δ ε ) and u i ∼ TPN(μ, σ, γ) with the prior structure: where π(δ ε ) is the weakly informative priors for the degrees of freedom of a Student-t distribution proposed in Rubio and Steel (2015), π(μ) is a uniform prior on [−100, 100], π(σ) is a half-Cauchy density with mode 0 and unit scale parameter, which has been shown to induce a posterior with good frequentist properties in the context of Bayesian hierarchical models (Polson and Scott, 2012). Finally, π(γ) is a uniform prior on (−1, 1), which represents a weakly informative prior as described in Rubio and Steel (2014). The prior in (14) is consistent with prior (6) and ensures a proper p(u i ). The propriety of the posterior is then guaranteed by Theorem 1. For each of the 100 simulated sets we obtain a posterior sample of size 1, 000 using an adaptive Metropolis within Gibbs algorithm (Roberts and Rosenthal, 2009) using the 'spBayes' R package (Finley et al., 2007). The posterior samples are obtained after a burn-in period of 7500 iterations and thinned every 10 iterations in order to reduce correlation (17, 500 draws in total). Table 1 presents summary statistics of the posterior samples, the median (over the datasets) Bayes factors in favour of γ = 0 (symmetric random effects), calculated using the Savage-Dickey density ratio, and the average odds p/(1 − p), where p = P(δ ε > 10), which provides information about the appropriateness of the assumption of normal tails for the residual errors. The Savage-Dickey density ratio (Verdinelli and Wasserman, 1995) is calculated as the ratio of the marginal posterior density function and the prior density of the parameter under consideration (γ) evaluated at the point of interest (γ = 0). In order to evaluate the posterior density, we employ a kernel density estimator (with Gaussian kernel) of the posterior samples. The use of the (standard) Savage-Dickey density ratio is valid in our setting as we are using independent proper priors for the parameters that differ across models (see Verdinelli and Wasserman, 1995). Posterior medians are presented instead of posterior means as the existence of the posterior mean is not guaranteed in all the scenarios. From this table, we can observe that despite the relatively small sample size, these Bayesian model selection criteria correctly identify the true model in each scenario. In order to assess the impact of the assumption of normality, we implement the model with normal residual errors and random effects. Table 2 summarizes posterior inference for this model. The misspecification of the distribution of the residual errors leads to an overestimation of σ ε , which is in line with the fact that this scale parameter has a different interpretation in normal and Student-t cases, while the presence of unmodelled skewness affects the estimation of the location parameter μ. Note also that misspecification dramatically increases the uncertainty about the fixed effects.

Simulation study II (Student-t errors and random effects with censoring)
In this section, we present a simulation study with censored responses. We consider a model which is used in practice for modelling the evolution of a marker in longitudinal studies (see e.g. Vaida and Liu, 2009), in particular where ε ij and u i are mutually independent, i = 1, . . . , 100, and j = 1, . . . , 5. We ∼ t(0, 0.25, 2). The theoretical values of the regression parameters are (β 1 , β 2 , β 3 , β 4 , β 5 ) = (2.5, 3.0, 3.5, 4.0, 4.5). We simulate 100 data sets under these two configurations and truncate the values that are greater than log 10 (75000). This truncation strategy produces samples with 7%-18% censored observations. Then, we fit the following four models: ∼ t(0, σ, δ). For the most general model (Model 4) we adopt the prior structure: where π(δ ε ) and π(δ) are the weakly informative priors for the degrees of freedom of a Student-t distribution proposed in Rubio and Steel (2015), π(σ) is a half-Cauchy density with mode 0 and unit scale parameter. For Models 1-3 we use the obvious reduction of this prior. The propriety of the posterior is then guaranteed by Theorem 2. Tables 3 and 4 (with a star indicating the model used to generate the data) give a summary of the posterior results. Table 3 suggests that using a more complex model than necessary does not affect the inference on the parameters in the simple model (Model 1), which is encouraging, as we do not really seem to lose anything by allowing for flexibility. We merely find out that the assumed Student-t distributions have large degrees of freedom parameters, which makes them close to normal.
From Table 4 we conclude that wrongly assuming normality of the distributions tends to make inference less precise, especially if we get the tails of the residual errors wrong. It also seems that the (more latent) tail parameter of the random effects is harder to estimate than that of the ε ij 's.

Framingham heart study
We now illustrate the performance of the proposed models with real data. We use the data set reported in Zhang and Davidian (2001), which consists of measurements of the cholesterol level for 200 randomly selected individuals from the Framingham heart study. The measurements were taken at the beginning of the study and then every 2 years for 10 years. We are interested in the relationship between the cholesterol level and the age (at baseline) and gender of the patients. Zhang and Davidian (2001) model this relationship through the LMM: where y ij represents the cholesterol level divided by 100 at the jth time for subject i and t ij is (time − 5)/10, with time measured in years from baseline. Zhang and Davidian (2001) assume normal residual errors ε ij and the density of the random effects u i = (u 1i , u 2i ) is represented by a semi-nonparametric truncated series expansion.
The Bayes factors, calculated using the Savage-Dickey density ratio, slightly favour the general model. However, the evidence in favour of this model is not decisive (see Table 5). The Bayes factors associated to Models 4-6 (against Model 1) are not shown in Table 5 since they are virtually zero, but the Bayes factors associated to other sub-models are presented in this table for illustration. Bayes factors clearly indicate support for skewness of u 1t and dependence of both random effects, which leaves us with Models 1-3. An argument of parsimony could, then, lead to selecting Model 3. Notice that the intervals for the general model are more spread out (e.g. for β 4 in Table 6). This indicates that the sample does not contain enough information to accurately estimate all the parameters of this model. The posterior probability of {δ ε > 10} in Models 1 and 3 is approximately 0.005 and for Model 2 is approximately 0.015, which suggests that models with heavier tails than normal are favoured.
To assess the predictive performance of the different models we use the log pseudomarginal likelihood (LPML), which is defined as the sum of the log conditional predictive ordinate (CPO) statistic for each observation. The CPO for subject i is defined as the predictive density of the vector of observations y i given the rest of the observations. This is, CPO i = π(y i | y\{y i }), where the predictive density is evaluated at the vector of observations corresponding to the i th subject (see Bandyopadhyay et al., 2012). These quantities are approximated using a harmonic mean estimator as described in Bandyopadhyay et al. (2012). Table 6 shows the LPML for the different models, which favour Model 3. A larger sample would likely provide stronger evidence for model selection.
The analysis in Zhang and Davidian (2001) suggests departures from normality in the distribution of random intercepts, while the estimated random slope distribution can be approximated by a normal density. Our conclusions are in line with the conclusions of the parametric, semi-nonparametric and the nonparametric analyses in Zhang and Davidian (2001), Jara et al. (2008), and Jara et al. (2009). However, the models proposed here also allow us to interpret the meaning of the parameters, separating those controlling the shape of the distribution from those controlling the dependence between the random effects.

HIV-1 viral load after unstructured treatment interruption
Here we revisit the data set analysed in Vaida and Liu (2009), which concerns the study of 72 perinatally HIV-infected children and adolescents. Some of the subjects present unstructured treatment interruption, which is a common phenomenon among HIV-infected people, due mainly to treatment fatigue (Vaida and Liu, 2009). The number of observations from baseline (month 0) to month 24 ranges from 13 to 71. Out of 362 observations, 26 observations (7%) were below the detection limits and were censored at these values. Vaida and Liu (2009) proposed the model: where y ij is the log 10 HIV RNA (HIV Ribonucleic acid, which is used to monitor the status of HIV) for subject i at time t j , using t 1 = 0, t 2 = 1, t 3 = 3, t 4 = 6, t 5 = 9, t 6 = 12, t 7 = 18 and t 8 = 24 months.
We allow for Student-t tails in the random effects and the errors (with zero locations), and thus use Model 4 in Subsection 4.2 with the same prior structure as in (15). We also try the simpler models 1-3 as defined in Subsection 4.2.
The propriety of the posterior is ensured by Theorem 2. Table 7 reveals clear evidence for fat tails in both distributions, especially the residual errors. The predictive criterion LPML also favours the general model and particularly penalizes model with normal residual errors. Clearly, ignoring the heavy tails affects the inference on the fixed effects, especially for β 1 .
Finally, the predictive performance as measured by LPML is substantially better for our models 4 and 2 than for the best model in Bandyopadhyay et al. (2012), who obtain LPML = −366.27 on the same data using the Skewcontaminated normal model of Lachos et al. (2010), which has normal tails for both random effects and error distributions. Also, their posterior distribution of β 1 is substantially shifted towards smaller values (consistent with our models 1 and 3, which impose normal errors).

Discussion
This paper focuses on hierarchical linear mixed models which are used frequently for both longitudinal and survival modelling in a number of application areas, such as medicine, epidemiology and economics. These models are often based on simple distributional assumptions, such as normality for both the residual errors and the random effects. There is ample evidence in the literature that such assumptions can seriously affect inference on the random effects and the predictive distribution (see e.g. Lee and Thompson, 2008;Zhang and Davidian, 2001) and we add to that evidence in our analysis of simulated data. Misspecification of the distribution of the residual errors and the random effects tends to have a substantial effect on the predictive distribution, which explicitly depends on the aforementioned distributions. The use of normal assumptions when the true generating model is not normal also has a marked effect on the estimation of the variance of the residual errors and the random effects, as well as a decrease in the precision of the estimation of the fixed effects. We consider Bayesian inference for these models with flexible distributions and sensible prior structures. These priors are intended to convey a lack of strong prior information; as they are based on a combination of formal arguments (such as invariance) and pragmatic common sense, they end up being improper. Our results provide a formal justification for Bayesian inference in these wide classes of models: our models accommodate any proper distribution for the random effects (which could even be dependent, if required) combined with twopiece scale mixtures of normal distributions for the residual errors, as well as potential censoring. Our results thus cover all random effects distributions (both parametric and nonparametric) that were proposed in the literature except for those that are dependent on the residual errors (such as those in Lachos et al., 2010, which where used in Bandyopadhyay et al., 2012. The conditions for the propriety of the posterior distribution are rather mild and easy to check in practice, especially compared to those in previous papers (Hobert and Casella, 1996;Sun et al., 2001;Rubio, 2015), which obtain sets of conditions that combine the sample size, the number of clusters, the design matrix, and the value of the hyperparameters of the priors on the random effects. This also suggests that most of the posterior impropriety issues come from assuming improper priors on the "deeper" parameters of the random effects.
The analysis of simulated datasets leads to quite promising results, which suggest that priors are well-calibrated and sensible, and Bayes factors seem to provide reliable indications. In addition, Appendix C in the Supplementary Material presents results for the case where both the residual errors and the random effects display skewness, which corroborate the results presented in the main text.
Inference on both real datasets strongly indicates departures from normality in that the residual error distributions have much fatter tails than normal. In addition, the Framingham heart data support dependence between the random effect components with normal random slope and skew-normal random intercept. The HIV data present clear evidence for heavy-tailed random effects. By using simple Student-t distributions on both errors and random effects, we obtain considerably better predictive results on the HIV data than those presented by Bandyopadhyay et al. (2012), using their preferred model. We remind the reader that the distributions considered in Bandyopadhyay et al. (2012) are multivariate scale mixtures of skew-normal distributions, so perhaps might be affected by the kind of issues we describe for multivariate scale mixtures of normals in Section 2. For the real data, evidence based on Bayes factors points in the same direction as the predictive performance measured by LPML, which is reassuring.
Here we have employed the R package 'spBayes' in all of our numerical examples, mainly for reasons of efficiency and simplicity. However, the user can also implement the models proposed here in any other "general purpose" MCMC software such as Stan (gradient-based MCMC), PROC MCMC (from SAS, which implements a random walk Metropolis), or any package which includes a Metropolis-within-Gibbs sampler.

Proof of Remark 1
The marginal density of the data is given bỹ Consider the change of variableσ ε = σ ε /τ 1 2 , with corresponding Jacobian |J | = τ 1 2 . Then, where we can identify the first factor as the marginal density of the data associated to the model with normal distributional assumptions on ε.

Proof of Theorem 1
For practical purposes we will work with σ 2 ε instead of σ ε . The marginal density of the data is given bỹ Denote τ + = max i,j τ ij and τ − = min i,j τ ij . Then, we can obtain the following lower bound Consider the change of variableσ 2 ε = Using the latter expression together with Theorem 1 from Fernández et al. (1997) it follows that (a) is a necessary condition for the propriety of the posterior. Now, define T = diag(τ 1 , . . . , τ n ) (the diagonal matrix of mixing variables), Then, by using that π(β) ≤ K, we can obtain the following upper bound for (18) Following the proof of Theorem 1 in Fernández et al. (1997) we have By integrating β out we obtaiñ where K 1 > 0 is a known constant. Note that det(X X ) = det(X T X), and From Lemma 1 from Fernández and Steel (2000) we know that det(X T X) has upper and lower bounds that are both proportional to p i=1 τ mi = max{ p i=1 τ si : 1 ≤ s 1 < · · · < s p ≤ n and det(x s1 , . . . , x sp ) = 0}, where det(x s1 , . . . , x sp ) denotes the determinant of the submatrix of X corresponding to the observations y s1 , . . . , y sp . Define also τ c = max{τ i : i = m 1 , . . . , m p }. Using these results we obtain the following inequality where K 2 > 0 is a constant. Now, by defining the n × (p + 1) matrix L = (X : y − Zu) and subsequently applying the Binet-Cauchy theorem (Fernández and Steel, 2000), we obtain Given that y − Zu is not in the column space of X we have that c(Z,ỹ) > 0. Consequently where the minimum is taken over the non-zero determinants. Then, . Now, note that P (u), being a quadratic polynomial in q variables, is a continuous coercive function (see Definition 2.4 in Güler, 2010). Then, by Corollary 2.5 in Güler (2010), it follows that P (u) achieves a global minimum. Given that P (u) > 0 by assumption, it follows that there exists M > 0 such that P (u) > M for all u. From this lower bound follows that for some constant K 3 > 0. Using this inequality in (19) and splitting the integral into the possible orderings of {τ 1 , . . . , τ n } we can identify τ c and the result follows.

Proof of Theorem 2
The result follows by using that the censored observations contribute to the likelihood as factors in [0, 1]. Then, we can obtain an upper bound for the marginal likelihood of the data which corresponds, up to a proportionality constant, to the marginal likelihood associated to the uncensored observations. The result then follows from Theorem 1.

Proof of Theorem 3
By condition (d ), it follows that inequality (20) is satisfied for each y = y , where y ∈ E. Thus, the marginal likelihood of the data associated to each y is finite (Theorem 1). Given that E is compact, we can obtain a finite upper bound for the marginal likelihood of the data by integrating out y over E, and the result follows.

Proof of Theorem 4
By using the inequality and the change of variableσ ε = σ ε M (γ ε ), we can obtain an upper bound for the marginal likelihood of the data which is a product of the marginal likelihood of model (1), (4), (8) with prior (6) and the expression in condition (e). Then, the result follows by using the proof of Theorem 1.

Appendix C: Simulation study III (skewed errors and random effects)
In this section we study the model (13). Four simulation scenarios are considered for the distribution of the residual errors ij and the random effects u i . For the first scenario we simulate from ij ∼ N(0, 0.5) and u i ∼ N(−1.5, 0.5). For the second scenario we use ij ∼ TPN(0, 0.5, 0.5) and u i ∼ N(−1.5, 0.5). The third scenario consists of ij ∼ N(0, 0.5) and u i ∼ TPN(−1.5, 0.5, 0.5). The fourth scenario uses ij ∼ TPN(0, 0.5, 0.5) and u i ∼ TPN(−1.5, 0.5, 0.5). We simulate 100 data sets under these configurations. For each of these simulated samples, the model (13) was fitted assuming that ij ∼ TPN(0, σ ε , γ ε ) and u i ∼ TPN(μ, σ, γ) with the prior structure: π(β 1 , β 2 , σ ε , γ ε , μ, σ, γ) ∝ π(γ ε )π(μ)π(σ)π(γ) σ ε , where π(μ) is a uniform prior on [−100, 100], π(σ) is a half-Cauchy density with mode 0 and unit scale parameter and π(γ) and π(γ ε ) are uniform priors on (−1, 1). The propriety of the posterior is then guaranteed by Theorem 1. For each of the 100 simulated datasets we obtain a posterior sample of size 1, 000 after a burn-in period of 7500 iterations and thinned every 10 iterations (a total of 17, 500 MCMC draws). Table 8 summarizes the posterior samples and presents the median Bayes factors in favour of γ ε = 0 or γ = 0 (symmetric errors or random effects), calculated using the Savage-Dickey density ratio. From this table, we can observe that despite the relatively small sample size, these Bayesian model selection criteria correctly identify the true model in each scenario. In order to assess the impact of incorrectly imposing normality, we implement the model with normal residual errors and random effects. Table 9 presents posterior results for this model. The misspecification of the distribution of the  residual errors and the random effects clearly affects the estimation of μ (which can be interpreted as an intercept).