Just Another Gibbs Additive Modeler: Interfacing JAGS and mgcv

The BUGS language oﬀers a very ﬂexible way of specifying complex statistical models for the purposes of Gibbs sampling, while its JAGS variant oﬀers very convenient R inte-gration via the rjags package. However, including smoothers in JAGS models can involve some quite tedious coding, especially for multivariate or adaptive smoothers. Further, if an additive smooth structure is required then some care is needed, in order to centre smooths appropriately, and to ﬁnd appropriate starting values. R package mgcv imple-ments a wide range of smoothers, all in a manner appropriate for inclusion in JAGS code, and automates centring and other smooth setup tasks. The purpose of this note is to describe an interface between mgcv and JAGS , based around an R function, jagam , which takes a generalized additive model (GAM) as speciﬁed in mgcv and automatically gen-erates the JAGS model code and data required for inference about the model via Gibbs sampling. Although the auto-generated JAGS code can be run as is, the expectation is that the user would wish to modify it in order to add complex stochastic model components readily speciﬁed in JAGS . A simple interface is also provided for visualisation and further inference about the estimated smooth components using standard mgcv function-ality. The methods described here will be un-necessarily ineﬃcient if all that is required is fully Bayesian inference about a standard GAM, rather than the full ﬂexibility of JAGS . In that case the BayesX package would be more eﬃcient.


Introduction
This paper is about automatically and reliably generating JAGS (Plummer 2003) model specification code and data implementing any generalized additive model (GAM, Hastie and Tibshirani 1990) that can be specified in the R (R Core Team 2016) package mgcv (Wood 2006(Wood , 2016. The purpose of this is to allow models with the complex smooth structure permitted  Figure 1: Some of the rich variety of smooths available in the mgcv package. From top right: A simple one-dimensional adaptive smooth; multidimensional thin-plate splines; multidimensional tensor product smooths; Gaussian Markov random fields; soap film finite area smoothers; splines on the sphere. by mgcv (exemplified by Figure 1) combined with the complex random structure permitted by JAGS to be produced more easily than has hitherto been the case. As the paper's title makes clear, there is nothing new about using Markov chain Monte Carlo (MCMC) in general, or Gibbs sampling in particular, for smooth modeling. The paper's purpose is simply to make this easier and more automatic and hence less susceptible to implementation error, and to document the methods used to achieve this.
In principle, the JAGS package and language allows Bayesian inference about a very wide range of models that can be written as directed acyclic graphs (DAG). This class includes GAMs as one special case. The Bayesian view of spline smoothing and additive models is almost as old as splines and additive models themselves (Kimeldorf and Wahba 1970;Wahba 1983;Silverman 1985;Hastie and Tibshirani 2000;Fahrmeir and Lang 2001), and several authors have exploited this to use JAGS or BUGS (Spiegelhalter, Thomas, Best, and Gilks 1995) for generalized additive modeling, notably Crainiceanu, Ruppert, and Wand (2005) based on Ruppert, Wand, and Carroll (2003) and Zuur, Saveliev, and Ieno (2014).
In principle the mgcv package already included all the code required to set up smoothers for use with JAGS. This is because what is required is essentially the same as what is required to use any standard mixed modeling software for GAM inference: for example mgcv function gamm based on the appendix of Wood (2004) uses the nlme package (Pinheiro, Bates, DebRoy, Sarkar, and R Core Team 2016) in this way. However, a considerable degree of user expertise is required to implement this reliably in practice.
A particular area where difficulty can arise is in the use of centring constraints on model smooth components. Usually additive smooth model structures only make statistical sense if such constraints are applied (see e.g., Hastie and Tibshirani 1990), otherwise there is a global intercept associated with each smooth. However the JAGS requirement for all priors to be proper, means that failing to implement such constraints will not cause complete failure of Gibbs sampling. Instead one may see very wide credible intervals and poor mixing, but not realize that this is a model formulation problem rather than a statistical inevitability.

The jagam function
The new mgcv function jagam is designed to be called in the same way that the modeling function gam would be called. That is, a model formula and family object specify the required model structure, while the required data are supplied in a data frame or list or on the search path. However, unlike gam, jagam does no model fitting. Rather it writes JAGS code to specify the model as a Bayesian graphical model for simulation with JAGS, and produces a list containing the data objects referred to in the JAGS code, suitable for passing to JAGS via the rjags (Plummer 2016) function jags.model.
A simple model, with two univariate smooths and one tensor product smooth, exemplifies the approach. Suppose that we have a data frame, dat, containing the response and predictor variables, have loaded the mgcv package and have used setwd to set the working directory to something appropriate. The code R> jd <-jagam(y~s(x0) + te(x1, x2) + s(x3), data = dat, + family = Gamma(link = log), file = "test.jags") would specify a simple log gamma additive model structure, where f 2 is a scale invariant tensor product smoother, appropriate for representing smooth interaction terms. jagam returns a list containing standard mgcv GAM setup information (pregam) and a list, jags.data, containing the objects required by JAGS for model simulation. The function also writes a JAGS model specification in the file test.jags, as follows. In fact code comments are also auto-generated, but have been removed from the above in accordance with journal requirements. They are designed to make it easy to locate the code relating to particular model components, and to draw attention to parts that the user might wish to modify.
In normal use the file would be edited to include the more complex stochastic components likely to have been the motivation for taking a Gibbs sampling approach. It can of course be used un-modified to simply simulate from the posterior of the model parameters, as in the following example code.
If all is in order, then many users would want to use the simulation output directly, but the utility function sim2gam can also be used to convert the simulation output into a reduced version of a fitted gam object, suitable for further use with standard mgcv functions. For example R> jam <-sim2jam(sam, jd$pregam) R> par(mfrow = c(1,3)) R> plot(jam)

Smoothers in JAGS
In mgcv, smooth functions are represented using spline like basis expansions, with quadratic penalties on the basis coefficients being used to avoid overfit. Generically a function f (x) may be represented as where the β j are unknown model parameters/coefficients, and the b j (x) are spline like basis functions. K is chosen to be large enough to avoid oversmoothing, but small enough to avoid excessive computational cost. The fitted flexibility of f is controlled less by K than by the imposition, during fitting, of a quadratic smoothing penalty of the form where the S j are matrices of known coefficients and the λ j are smoothing parameters to be estimated. Often there is only a single component in the penalty, but the general summation form is necessary in order to implement adaptive and tensor product smooths, for example, and will also be used below to ensure the propriety of priors required for Gibbs sampling with JAGS. See Chapter 4 of Wood (2006) for more detail. Kimeldorf and Wahba (1970), Wahba (1983) and Silverman (1985) provide the basis for viewing such smoothers in a Bayesian way, with the penalties induced by improper multivariate Gaussian priors, having precision matrices proportional to j λ j S j . Adopting this viewpoint it is obvious that we can make inferences about smooths using Bayesian methods.
For practical Gibbs sampling in JAGS there are two cases to distinguish: 1. Those in which the prior precision matrix can be represented as a weighted sum of matrices that are all zero, apart from some unit entries on the leading diagonal, where no component matrices of the sum have unit entries in the same place.
2. Those in which the precision matrix can not be written in the above form.
Case 1 results in i.i.d. Gaussian priors on separate subsets of the parameters. For single smoothing parameter smooths it is possible to re-parameterize to achieve this case. Case 2 results in non-independent multivariate normal priors. In both cases the prior implied by the smoothing penalty is usually improper, as the penalty usually leaves some subspace of functions unpenalized. For Gibbs sampling with JAGS we require proper priors, but this is easily arranged.

Independent Gaussian prior smooths
Some smooths, such as the tensor product smoothers constructed by Wood, Scheipl, and Faraway (2013) or the truncated power basis P-splines advocated by Ruppert et al. (2003), automatically have penalties in which the S j are identity matrices with some unit entries set to zero (and no unit entries in common between different S j ). Let β j denote the set of coefficients for which the corresponding diagonal elements of S j are 1 rather than zero. Then the prior on each element of β j is N (0, 1/λ j ) and the elements are independent a priori. A vague prior would typically be placed on λ j . If β 0 denotes the coefficients that are unpenalized, we can impose a prior N (0, 1/λ 0 ) on these, where λ 0 is so small as to force the prior to be very vague, or λ 0 itself has a vague prior.
Any smooth that only has a single S j and single λ j can be re-parameterized to have a partial identity matrix prior precision matrix following Wood (2004). Dropping the index j, we find the symmetric eigen-decomposition S = UΛU . Suppose that the final M eigenvalues on the leading diagonal of Λ are zero (the remainder being positive). Define D to be the diagonal matrix with diagonal elements consisting of the square roots of the positive eigenvalues from Λ, followed by M unit entries. If we now adopt the re-parameterization β = DU β then the penalty matrix S becomes the identity matrix, with the last M leading diagonal elements set to zero. The corresponding model matrix is then XUD −1 . The last M elements of β are now un-penalized, and, as above, this can be handled by imposing independent N (0, 1/λ 0 ) priors on these. The same result could be achieved somewhat more efficiently with a pivoted Cholesky decomposition. mgcv contains functions to perform this reparameterization automatically. Note that the preceding re-parameterization is slightly different to that employed in Crainiceanu et al. (2005), which starts from an indefinite S, so that the reparameterization step also involves an element of approximation.

General Gaussian prior smooths
For a Gaussian likelihood, independent prior smooths can result in quite fast computation, because JAGS is then able to employ conjugate samplers. Similarly in generalized linear model settings, the samplers from the JAGS glm module can also lead to efficient computation. However outside these settings the independent prior approach is slow and block updates are preferable. In any case there are several important smoother classes that are not susceptible to writing in independent prior form, notably adaptive smooths and several types of tensor product smooth.
In fact implementing any quadratically penalized smoother in JAGS is straightforward, using dmnorm, the JAGS multivariate normal density. dmnorm is parameterized in terms of a precision matrix, for which j λ j S j can be used directly.
The only potential difficulty is that j λ j S j itself is usually rank deficient, implying an improper prior for the smooth. Again we must construct a prior for the null space of the smoothing penalty, but again it is possible to re-use existing mgcv facilities. Specifically, in the context of model selection, Marra and Wood (2011) propose a simple construction of a penalty on the null space related to the re-parameterization used in the previous section. Again use a symmetric eigen-decomposition j S j = UΛU . Now let U 0 denote the columns of U (eigenvectors) corresponding to zero eigenvalues. Let S 0 = U 0 U 0 . λ 0 β S 0 β can be used to penalize the null space of the smoother by adding λ 0 S 0 to the precision matrix, hence making the prior on β proper. mgcv can generate such null space penalties automatically.

Smoothing parameter priors
jagam automates two possibilities for smoothing parameter priors: vague gamma priors on the λ j , or bounded uniform priors on ρ j = log λ j . The former will be conjugate in a fully Gaussian setting, but the latter may be considered more interpretable for the user used to thinking about log smoothing parameters.

Centring the smoothers
As constructed so far, each smooth in an additive model would include its own global intercept. The data provides no information to identify these multiple intercepts, so they are only formally identifiable because of the priors put on them, which are vague priors of convenience. This lack of statistically meaningful identifiability will serve to substantially inflate credible intervals and promote slow mixing, so it is preferable to remove the redundant intercepts from the model. In an additive model context this is usually done by centring the smooths Tibshirani 1986, 1990;Chambers and Hastie 1991;Wood 2006). That is we impose the condition that each smooth should sum to zero over the observed values its covariates. i.e., n i=1 f (x i ) = 0. Other constraints are possible, but generally give wider credible intervals for the constrained smooths (see Section 4 of Wood et al. 2013, for a discussion). mgcv has facilities to simply absorb centring constraints into the basis by reparameterization, as described in section 4.2 of Wood (2006). This absorption is done before any reparameterization or construction of priors on the null space.

Initial values
In the Gaussian likelihood case, with gamma priors for the smoothing parameters, the initial values of the model coefficients and smoothing parameters are rather unimportant. In this situation conjugate samplers are used and, although poor starting values may prolong burn-in, eventually good results will be obtained.
Beyond the simple Gaussian context, more care is needed, since poor starting values can lead to poor tuning of non-conjugate samplers, and a consequent failure to properly explore the region of high posterior probability (along then with high sensitivity to the parameters of the smoothing parameter priors). jagam adopts the mgcv default smoothing parameter initializations and then performs one step of the penalized iteratively re-weighted least squares method for GAM fitting, in order to obtain starting values for the coefficients which are compatible with the initial smoothing parameters. The initial coefficients and corresponding standard errors are also used to set the scale of any required uninformative priors on the model coefficients: the prior standard deviation is set to 10 times the sum of the absolute value of the initial coefficient estimate and its standard error.

Further inference
Having setup a GAM for use in JAGS and simulated from it, the user will typically want to visualize the smooths and predict from them. In addition some notion of the effective degrees of freedom of the smooth is useful.
An obvious way to visualize the smooths is to draw curves from the posterior, and either compute appropriate pointwise quantiles in order to produce credible intervals, or to simply plot the curves. Examples are given below. Alternatively, smooths may be plotted with 'two standard error bands', in the manner introduced in Hastie and Tibshirani (1990). To this end it is only necessary to compute the mean coefficients from the simulation, to use in place of coefficient estimates,β, and to compute the observed covariance matrix of the simulated coefficients, V β , from which the standard error bands are readily computed. In factβ and V β also complete the preliminary gam object produced by jagam sufficiently for prediction using predict.gam.
Finally some notion of the effective degrees of freedom of the model and its component smooths is useful. In a simple Gaussian additive model context a measure of the model effective degrees of freedom is tr(F) where F = (X X + j λ j S j ) −1 X X. In the generalized additive model context F = (X WX + j λ j S j ) −1 X WX, where W is the diagonal matrix of iteratively re-weighted least squares weights used in fitting. In the presence of random effects it is better to use F = V β X WX/φ, in which W is the diagonal IRLS weight matrix with the random effects set to their posterior expectations and φ is the scale parameter or its estimate. When V β is computed by simulation then this latter definition has the advantage of including a component for smoothing parameter uncertainty, however to compute it in practice requires that the expected value of the response, mu, be monitored during simulation. See chapter 4 of Wood (2006) for further discussion.
Given F, then the effective degrees of freedom of component smooths are obtained by summing the leading diagonal elements of F corresponding to the coefficients of the smooth concerned. Notice that under substantial modification of the jagam template model (involving modification of the response distribution, for example), the weighted versions of F may make no sense. It may then be better to fall back on the effective degrees of freedom that would have been computed if the model were a simple Gaussian additive model, or to use the estimate proposed by Plummer (2002).

Examples
As two simple examples consider the union wages example and the Sitka growth example from Crainiceanu et al. (2005). Both datasets are available in the SemiPar R package (Wand 2014). Loading the JAGS glm module, via load.module("glm") improves the efficiency of both examples in this section.

The union wages data
The data frame trade.union contains a binary indicator of whether or not a worker is a trade union member, along with their hourly wage in US dollars. Consider the simple logistic regression model where smooth function f is represented by a rank 20 thin plate regression spline. A jagam call sets the model up R> jd <-jagam(union.member~s(wage, k = 20), data = trade.union, + family = binomial, file = "union.jags") resulting in the following JAGS model specification file (again the auto-generated comments have been removed to comply with journal requirements).  The following commands then compile and simulate from the model. R> library("rjags") R> load.module("glm") R> jm <-jags.model("union.jags", data = jd$jags.data, + inits = jd$jags.ini, n.chains = 1) R> sam <-jags.samples(jm, c("b", "rho", "mu"), n.iter = 10000, thin = 10) On a 3GHz mid range laptop computer, simulation took 18 seconds, yielding effective sample sizes averaging around 400 for rho and 800 for b and mu, for the 1000 samples stored. Crainiceanu et al. (2005) report around 9 minutes for this model (albeit with a slightly different smoothing penalty) for the same simulation length, on a 3.6GHz PC, although they do not report effective sample sizes, so the comparison is not completely straightforward. Note that failing to supply starting values greatly increases the adaptation and burn in time required to achieve reliable results for this example.
Finally a partial gam object can be created for convenient plotting and prediction. The following then produces a plot of the modeled probability of union membership against wages with a credible interval, along with a visualization of the union membership data. The interval is wide at high wages, failing to provide a very useful indication of the range of smooth shapes compatible with the data, so the following code also adds a sample of 20 curves from the posterior. And now add 20 smooth curves drawn from the posterior, to examine variability in the smooth shape.

The Sitka growth data
This example illustrates the modification of an auto-generated JAGS model file to implement random effects. The 'sitka' data contain repeated measurements over time of log size for Sitka spruce saplings grown under conditions of enhanced ozone, or control conditions. A simple model has a smooth effect for time, a random intercept for each tree and an ozone effect, , and j(i) is the index of the tree from which the i th measurement is taken. Everything is Gaussian, so a fully conjugate setup can be employed, and it is worth diagonalizing the smoothing penalties.
R> jd <-jagam(log.size~s(days) + ozone, data = sitka, + file = "sitka0.jags", diagonalize = TRUE) creates a default JAGS model file which can be modified to include the random effect as follows, where the gray italic code has been added to the non-italic auto-generated JAGS code (and again to meet journal requirements the auto-generated comments have been removed).  The modification involves adding a tree specific random effect, d, to the linear predictor. Notice the need to add id, the vector attributing measurements to trees, and nd, the number of trees, to the JAGS data. The following code compiles and simulates from the model, produces a default plot of the smooth effect of time (given the sum to zero identifiability constraint), a histogram of draws from the posterior for β and illustration of 25 curves, α + f (days), drawn from the posterior.

Conclusion
The JAGS software offers enormous flexibility in the specification of complex random effects structures. Incorporating spline type smoothers into such models is routine, but somewhat tedious to code on a case by case basis, as well as being prone to error, especially for smooths of several variables. The jagam function offers a useful automation of the process of incorporating any smooth built into mgcv into a JAGS model, while dealing seamlessly with initialization and centring constraints and allowing straightforward posterior prediction.
The main disadvantage of the approach is computational speed. Gibbs sampling for these models can be slow, especially if covariates are correlated. Indeed if only simple random effects are required then the random effects already available in mgcv may be much faster computationally. Similarly BayesX (see e.g., Fahrmeir and Lang 2001;Fahrmeir, Kneib, and Lang 2004;Umlauf, Adler, Kneib, Lang, and Zeileis 2015) is a substantially more efficient route to fully Bayesian inference with GAMs if the flexibility of JAGS is not required, while Stan (Carpenter, Gelman, Hoffman, Lee, Goodrich, Betancourt, Brubaker, Guo, Li, and Riddell 2017) offers another alternative likely to offer efficiency advantages. In these correlated settings it is also likely that Hamiltonian Monte Carlo methods (e.g., Girolami and Calderhead 2011) would enhance efficiency.