Locally adaptive smoothing with Markov random fields and shrinkage priors

We present a locally adaptive nonparametric curve fitting method that operates within a fully Bayesian framework. This method uses shrinkage priors to induce sparsity in order-k differences in the latent trend function, providing a combination of local adaptation and global control. Using a scale mixture of normals representation of shrinkage priors, we make explicit connections between our method and kth order Gaussian Markov random field smoothing. We call the resulting processes shrinkage prior Markov random fields (SPMRFs). We use Hamiltonian Monte Carlo to approximate the posterior distribution of model parameters because this method provides superior performance in the presence of the high dimensionality and strong parameter correlations exhibited by our models. We compare the performance of three prior formulations using simulated data and find the horseshoe prior provides the best compromise between bias and precision. We apply SPMRF models to two benchmark data examples frequently used to test nonparametric methods. We find that this method is flexible enough to accommodate a variety of data generating models and offers the adaptive properties and computational tractability to make it a useful addition to the Bayesian nonparametric toolbox.


Introduction
Nonparametric curve fitting methods find extensive use in many aspects of statistical modeling such as nonparametric regression, spatial statistics, and survival models, to name a few. Although these methods form a mature area of statistics, many computational and statistical challenges remain when such curve fitting needs to be incorporated into multi-level Bayesian models with complex data generating processes. This work is motivated by the need for a curve fitting method that could adapt to local changes in smoothness of a function, including abrupt changes or jumps, and would not be restricted by the nature of observations and/or their associated likelihood. Our desired method should offer measures of uncertainty for use in inference, should be relatively simple to implement and computationally efficient. There are many methods available for nonparametric curve fitting, but few which meet all of these criteria.
Gaussian process (GP) regression (Neal, 1998;Rasmussen and Williams, 2006) is a popular Bayesian nonparametric approach for functional estimation that places a GP prior on the function of interest. The covariance function must be specified for the GP prior, and the isotropic covariance functions typically used are not locally adaptive. Nonstationary covariance functions have been investigated to make GP regression locally adaptive (Brahim-Belhouari and Bermak, 2004;Schervish, 2004, 2006). Any finite dimensional representation of GPs involves manipulations of, typically high dimensional, Gaussian vectors with mean vector and covariance matrix induced by the GP. Many GPs, including the ones with nonstationary covariance functions, suffer from high computational cost imposed by manipulations (e.g., Cholesky factorization) of the dense covariance matrix in the finite dimensional representation.
Sparsity can be imposed in the precision matrix (inverse covariance matrix) by constraining a finite dimensional representation of a GP to be a Gaussian Markov random field (GMRF), and then computational methods for sparse matrices can be employed to speed computations (Rue, 2001;Rue and Held, 2005). Fitting smooth functions with GMRFs has been practiced widely. These methods use difference equations as approximations to continuous function derivatives to induce smoothing, and have a direct relationship to smoothing splines (Speckman and Sun, 2003). GMRFs have also been used to develop Bayesian adaptive smoothing splines (Lang et al., 2002;Yue et al., 2012Yue et al., , 2014. A similar approach is the nested GP (Zhu and Dunson, 2013), which puts a GP prior on the order-k function derivative, which is in turn centered on another GP. This approach has good adaptive properties but has not been developed for non-Gaussian data.
Differencing has commonly been used as an approach to smoothing and trend estimation in time series analysis, signal processing, and spatial statistics. Its origins go back at least to Whittaker (1922), who suggested a need for a trade off between fidelity to the data and smoothness of the estimated function. This idea is the basis of some frequentist curve-fitting methods based on penalized least squares, such as the smoothing spline (Reinsch, 1967;Wahba, 1975) and the trend filter (Kim et al., 2009;Tibshirani, 2014). These penalized least-squares methods are closely related to regularization methods for high-dimensional regression such as ridge regression (Hoerl and Kennard, 1970) and the lasso (Tibshirani, 1996) due to the form of the penalties imposed.
Bayesian versions of methods like the lasso (Park and Casella, 2008) utilize shrinkage priors in place of penalties. Therefore, it is interesting to investigate how these shrinkage priors Griffin et al., 2013;Bhattacharya et al., 2015) perform when applied to differencing-based time series smoothing. Although shrinkage priors have been used explicitly in the Bayesian nonparametric regression setting for regularization of wavelet coefficients (Abramovich et al., 1998;Johnstone and Silverman, 2005;Reményi and Vidakovic, 2015) and for shrinkage of order-k differences of basis spline coefficients in adaptive Bayesian P-splines (Scheipl and Kneib, 2009), a Bayesian version of the trend filter and Markov random field (MRF) smoothing with shrinkage priors has not been thoroughly investigated. To our knowledge, only Roualdes (2015), independently from our work, looked at Laplace prior-based Bayesian version of the trend filter in the context of a normal response model. In this paper, we conduct a thorough investigation of smoothing with shrinkage priors applied to MRFs for Gaussian and non-Gaussian data. We call the resulting models shrinkage prior Markov random fields (SPMRFs).
We borrow the idea of shrinkage priors from the sparse regression setting and apply it to the problem of function estimation. We take the perspective that nonparametric curve fitting is essentially a regularization problem where estimation of an unknown function can be achieved by inducing sparsity in its order-k derivatives. We propose a few fully Bayesian variations of the trend filter (Kim et al., 2009;Tibshirani, 2014) which utilize shrinkage priors on the kth-order differences in values of the unknown target function. The shrinkage imposed by the priors induces a locally adaptive smoothing of the trend. The fully Bayesian implementation allows representation of parameter uncertainty through posterior distributions and eliminates the need to specify a single global smoothing parameter by placing a prior distribution on the smoothing parameter, although complete automation is not possible so we offer ways to parameterize the global smoothing prior. In Section 2 we provide a derivation of the models starting from penalized frequentist methods and we show the relationship to GMRF models. In Section 2 we also describe our method of sampling from the posterior distribution of the parameters using Hamiltonian Monte Carlo (HMC), which is efficient and straight forward to implement. In Section 3 we use simulations to investigate performance properties of the SPMRF models under two different prior formulations and we compare results to those for a GMRF with constant precision. We show that the choice of shrinkage prior will affect the smoothness and local adaptive properties. In Section 4 we apply the method to two example data sets which are well known in the nonparametric regression setting.

Preliminaries
We start by reviewing a locally adaptive penalized least squares approach to nonparametric regression known as the trend filter (Kim et al., 2009;Tibshirani and Taylor, 2011;Tibshirani, 2014) and use that as a basis to motivate a general Bayesian approach that utilizes shrinkage priors in place of roughness penalties. We first consider the standard nonparametric regression problem to estimate the unknown function f . We let θ represent a vector of values of f on a discrete uniform grid t ∈ {1, 2, . . . , n}, and we assume y = θ + , where ∼N(0, Iσ 2 ), and y and are vectors of length n. Here all vectors are column vectors. Following Tibshirani (2014) with slight modification, the least squares estimator of the kth order trend filtering estimateθ iŝ where · q represents the L q vector norm, and D (k) is an (n − k) × n forward difference operator matrix of order k, such that the ith element of the vector ∆ k θ = D (k) θ is the forward difference ∆ k θ i = (−1) k k j=0 (−1) j k j θ i+ j . Note that D (k) has recursive properties such that D (k) n = D (1) n−k+1 D (k−1) n , where D (h) m has dimensions (m − h) × m. The objective function in equation (1) balances the trade-off between minimizing the squared deviations from the data (the first term in the sum on the right) with minimizing the discretized roughness penalty of the function f (the second term in the sum on the right). The smoothing parameter λ ≥ 0 controls the relative influence of the roughness penalty. Setting λ to 0 we get least squares estimation. As λ gets large, the roughness penalty dominates, resulting in a function with k-th order differences approaching 0 for all t. The trend filter produces a piecewise polynomial function of t 1 , . . . , t n with degree k − 1 as an estimator of the unknown function f . Increasing the order of the difference operator will enforce a smoother function.
The L 1 penalty in equation (1) results in the trend filter having locally adaptive smoothing properties. Tibshirani (2014) shows that the trend filter is very similar in form and performance to smoothing splines and locally adaptive regression splines, but the trend filter has a finer level of local adaptivity than smoothing splines. A main difference between the trend filter and smoothing splines is that the latter uses a squared L 2 penalty, which is the same penalty used in ridge regression (Hoerl and Kennard, 1970). Note that the L 1 penalty used by the trend filter is also used by the lasso regression (Tibshirani, 1996), and the trend filter is a form of generalized lasso (Tibshirani and Taylor, 2011;Tibshirani, 2014). In the linear regression setting with regression coefficients β j s, the L 1 and L 2 penalties can be represented by the generalized ridge penalty λ j |β j | q (Frank and Friedman, 1993), where q = 2 corresponds to the ridge regression penalty, q = 1 to the lasso penalty, and sending q to zero results in all subsets selection regression (Tibshirani, 2011). Based on what we know about lasso regression, subset selection regression, and ridge regression, we expect a penalty closer to subset selection to do better for fitting functions with a small number of large jumps, a trend filter penalty (L 1 ) to do better for fitting functions with small to moderate deviations from polynomials of degree k − 1, and a smoothing spline (squared L 2 ) penalty to do better for smooth polynomial-like functions with no jumps. This distinction will become important later when we assess the performance of different Bayesian formulations of the trend filter.
One can translate the penalized least squares formulation in equation (1) into either a penalized likelihood formulation or a Bayesian formulation. Penalized least squares can be interpreted as minimizing the penalized negative log-likelihood −l p (θ | y) = −l(θ | y) + p(θ | λ), where l(θ | y) is the unpenalized log-likelihood and p(θ | λ) is the penalty. It follows that maximization of the penalized log-likelihood is directly comparable to finding the mode of the log-posterior in the Bayesian formulation, where the penalty is represented as a prior. This implies independent Laplace (double-exponential) priors on the ∆ k θ j , where j = 1, . . . , n − k, for the trend filter formulation in equation (1). That is, p(∆ k θ j | λ) = λ 2 exp −λ | ∆ k θ j | . This is a well-known result that has been used in deriving a Bayesian form of the lasso (Tibshirani, 1996;Figueiredo, 2003;Park and Casella, 2008). Note that putting independent priors on the kth order differences results in improper joint prior p(θ | λ), which can be made proper by including a proper prior on the first k θs.
The Laplace prior falls into a class of priors commonly known as shrinkage priors. An effective shrinkage prior has the ability to shrink noise to zero yet retain and accurately estimate signals . These properties translate into a prior density function that has a combination of high mass near zero and heavy tails. The high density near zero acts to shrink small values close to zero, while the heavy tails allow large signals to be maintained. A simple prior developed for subset selection in Bayesian setting is the spike-and-slab prior, which is a mixture distribution between a point mass at zero and a continuous distribution (Mitchell and Beauchamp, 1988). This prior works well for model selection, but some drawbacks are that it forces small signals to be exactly zero, and computational issues can make it difficult to use ). There has been much interest in developing priors with continuous distributions (one group) that retain variable selection properties of the spike-and-slab (twogroup) yet do so by introducing sparsity through shrinkage . This approach allows all of the coefficients to be nonzero, but most are small and only some are large. Many such shrinkage priors have been proposed, including the normal-gamma (Griffin et al., 2010), generalized double-Pareto (Armagan et al., 2013), horseshoe (Carvalho et al., 2010), horseshoe+ (Bhadra et al., 2015), and Dirichlet-Laplace (Bhattacharya et al., 2015). The Laplace prior lies somewhere between the normal prior and the spike-and-slab in its shrinkage abilities, yet most shrinkage priors of current research interest have sparsity inducing properties closer to those of the spike-and-slab. Our main interest is in comparing the Laplace prior to other shrinkage priors in the context of nonparametric smoothing.

Model Formulation
It is clear that shrinkage priors other than the lasso could represent different smoothing penalties and therefore could lead to more desirable smoothing properties. There is a large and growing number of shrinkage priors in the literature. It is not our goal to compare and characterize properties of Bayesian nonparametric function estimation under all of these priors. Instead, we wish to investigate a few well known shrinkage priors and demonstrate as proof of concept that adaptive functional estimation can be achieved with shrinkage priors. Further research can focus on improvements to these methods. What follows is a general description of our modeling approach and the specific prior formulations that will be investigated through the remainder of the paper.
We assume the n observations y i , where i = 1, . . . , n, are independent and follow some distribution dependent on the unknown function values θ i and possibly other parameters ξ at discrete points t. We further assume that the order-k forward differences in the function parameters, ∆ k θ j , where j = 1, . . . , n − k, are independent and identically distributed conditional on a global scale parameter which is a function of the smoothing parameter λ. These assumptions result in the following general hierarchical form: (2) One convenient trait of many shrinkage priors, including the Laplace, the logistic, and the t-distribution, is that they can be represented as scale mixtures of normal distributions (Andrews and Mallows, 1974;West, 1987;Polson and Scott, 2010). The conditional form of scale mixture densities leads naturally to hierarchical representations. This can allow some otherwise intractable density functions to be represented hierarchically with standard distributions and can ease computation. To take advantage of this hierarchical structure, we restrict densities p(∆ k θ j | λ) to be scale mixtures of normals, which allows us to induce a hierarchical form to our model formulation by introducing latent local scale parameters, τ j . Here the order-k differences in the function parameters, ∆ k θ j , are conditionally normally distributed with mean zero and variance τ 2 j , and the τ j are independent and identically distributed with a global scale parameter which is a function of the smoothing parameter λ. The distribution statement for ∆ k θ j in Equation (2) can then be replaced with the following hierarchical representation: To complete the model specification, we place proper priors on θ 1 , . . . , θ k . This maintains propriety and can improve computational performance for some Markov chain Monte Carlo (MCMC) samplers. We start by setting θ 1 ∼ N(µ, ω 2 ), where µ and ω can be constants or allowed to follow their own distributions. Then for k ≥ 2 and h = 1, . . . , k − 1, we let ∆ h θ 1 | α h ∼ N(0, α 2 h ) and α h | λ ∼ p(α h | λ), where p(α | λ) is the same form as p(τ | λ). That is, we assume the order-h differences are independent with scale parameters that follow the same distribution as the order-k differences. For most situations, the order of k will be less than 4, so issues of scale introduced by assuming the same distribution on the scale parameters for the lower and higher order differences will be minimal. One could alternatively adjust the scale parameter of each p(α h | λ) to impose smaller variance for lower order differences.
For the remainder of the paper we investigate two specific forms of shrinkage priors: the Laplace and the horseshoe. We later compare the performance of these two priors to the case where the order-k differences follow identical normal distributions. The following provides specific descriptions of our shrinkage prior formulations.
Laplace. As we showed previously, this prior arises naturally from an L 1 penalty, making it the default prior for Bayesian versions of the lasso (Park and Casella, 2008) and trend filter. The Laplace distribution is leptokurtic and features high mass near zero and exponential tails ( Figure 1). Various authors have investigated its shrinkage properties (Griffin et al., 2010;Kyung et al., 2010;Armagan et al., 2013). We allow the order-k differences ∆ k θ j to follow a Laplace distribution conditional on a global scale parameter γ = 1/λ, and we allow γ to follow a half-Cauchy distribution with scale parameter ζ. That is, The use of a half-Cauchy prior on γ is a departure from Park and Casella (2008), who make λ 2 follow a gamma distribution to induce conjugacy in the Bayesian lasso. We chose to use the half-Cauchy prior on γ because its single parameter simplifies implementation, it has desirable properties as a prior on a scale parameter (Gelman et al., 2006;Polson and Scott, 2012b), and it allowed us to be consistent across methods (see horseshoe specification below). The hierarchical form of the Laplace prior arises when the mixing distribution on the square of the local scale parameter τ j is an exponential distribution. Specifically, we specify τ 2 j | λ ∼ Exp(λ 2 /2) and ∆ k θ j | τ j ∼ N(0, τ 2 j ) in the hierarchical representation. Horseshoe. The horseshoe prior (Carvalho et al., 2010) has an infinite spike in density at zero but also exhibits heavy tails ( Figure 1). This combination results in excellent performance as a shrinkage prior , and gives the horseshoe shrinkage properties more similar to the spike-and-slab variable selection prior than those of the Laplace prior. We allow the order-k differences ∆ k θ j to follow a horseshoe distribution conditional on global scale parameter γ = 1/λ, and allow γ to follow a half-Cauchy distribution with scale parameter ζ. That is, The horseshoe density function does not exist in closed form, but we have derived an approximate closed-form solution using the known function bounds (see Appendix A), which could be useful for application in some settings. Carvalho et al. (2010) represent the horseshoe density hierarchically as a scale mixture of normals where the local scale parameters τ j are distributed half-Cauchy. In our hierarchical version, the latent scale parameter τ j | γ ∼ C + (0, γ) and then conditional on τ j the distribution on the order-k differences is ∆ k θ j | τ j ∼ N(0, τ 2 j ). The horseshoe prior arises when the mixing distribution on the local scale parameter τ j is half-Cauchy, which is a special case of a half-t-distribution where degrees of freedom (df ) equal 1. Setting d f > 1 would result in a prior with lighter tails than the horseshoe, and setting 0 < d f < 1 would result in heavier tails. We tested half-t formulations with d f between 1 and 5 in test scenarios, but did not find an appreciable difference in performance relative to the horseshoe. We also attempted to place a prior distribution on the d f parameter, but found the data to be insufficient to gain information in the posterior for d f in our test scenarios, so we did not pursue this further.
Normal. The normal distribution arises as a prior on the order-k differences when the penalty in the penalized likelihood formulation is a squared L 2 penalty. The normal prior is also the form of prior used in Bayesian smoothing splines. The normal is not considered a shrinkage prior and does not have the flexibility to allow locally adaptive smoothing behavior. We use it for comparison to demonstrate the local adaptivity allowed by the shrinkage priors. For our investigations, the distribution on the order-k differences and associated scale parameter is:

Connections to Markov Random Fields
Here we briefly show the models represented by (2) can be expressed with GMRF priors for θ conditional on the local scale parameters τ. It is instructive to start with the normal increments model (6), which belongs to a class of time series models known as autoregressive models of order k. Rue and Held (2005) call this model a k-th order random walk and show that it is a GMRF with respect to a k-th order chain graph -a graph with nodes {1, 2, . . . , n}, where the nodes i j are connected by an edge if and only if |i − j| ≤ k. Since the normal model (6) does not fully specify the joint distribution of θ, it is an intrinsic (improper) GMRF. We make it a proper GMRF by specifying a prior density of the first k components of θ, p(θ 1 , . . . , θ k ). The Markov property of the model manifests itself in the following factorization: p(θ) = p(θ 1 , . . . , θ k )p(θ k+1 | θ 1 , . . . , θ k ) · · · p(θ n | θ n−1 , . . . , θ n−k ).
Equipped with initial distribution p(θ 1 , . . . , θ k ), models (5) and (4) also admit this factorization, so they are k-th order Markov, albeit not Gaussian models. However, if we condition on the latent scale parameters τ, both the Laplace and horseshoe models become GMRFs, or more specifically k-th order normal random walks. One important feature of these random walks is that each step in the walk has its own precision. To recap, under prior specifications (5) and (4) Our GMRF point of view is useful in at least three respects. First, GMRFs with constant precision have been used for nonparametric smoothing in many settings (see Rue and Held (2005) for examples). GMRFs with nonconstant precision have been used much less frequently, but one important application is to the development of adaptive smoothing splines by allowing order-k increments to have nonconstant variances (Lang et al., 2002;Yue et al., 2012). The approach of these authors is very similar to our own but differs in at least two important ways. First, we specify the prior distribution on the latent local scale parameters τ j with the resulting marginal distribution of ∆ k θ j in mind, such as the Laplace or horseshoe distributions which arise as scale mixtures of normals. This allows a better understanding of the adaptive properties of the resulting marginal prior in advance of implementation. In contrast, Lang et al. (2002) and Yue et al. (2012) appear to choose the distribution on local scale parameters based on conjugacy and do not consider the effect on the marginal distribution of ∆ k θ j . Second, we allow the local scale parameters τ j to be independent, whereas Lang et al. (2002) and Yue et al. (2012) impose dependence among the scale (precision) parameters by forcing them to follow another GMRF. Allowing the local scale parameters to be independent allows the model to be more flexible and able to adapt to jumps and sharp local features. We should also note that Rue and Held (2005) in section 4.3 show that the idea of scale mixtures of normal distributions can be used with GMRFs to generate order-k differences which marginally follow a t-distribution by introducing latent local scale parameters. Although they do not pursue this further, we mention it because it bears similarity to our approach.
Second, viewing the SPMRF models as conditional GMRFs allows us to utilize some of the theoretical results and computational methods developed for GMRFs. In particular, one can take advantage of more complex forms of precision matrices such as circulant or seasonal trend matrices (see Rue and Held (2005) for examples). One can also employ the computational methods developed for sparse matrices, which speed computation times (Rue, 2001;Rue and Held, 2005). We note that simple model formulations such as the kth-order random walk models can be coded with state-space formulations based on forward differences, which speed computation times by eliminating the operations on covariance matrices necessary with multivariate Gaussian formulations.
A third advantage of connecting our models to GMRFs is that the GMRF representation allows us to connect our first-order Markov models to subordinated Brownian motion (Bochner, 1955;Clark, 1973), a type of Lévy process recently studied in the context of scale mixture of normal distributions (Polson and Scott, 2012a). Polson and Scott (2012a) use the theory of Lévy processes to develop shrinkage priors and penalty functions. Let us briefly consider a simple example of subordinated Brownian motion. Let W be a Weiner process, so that W(t + s) − W(t) ∼ N(0, sσ 2 ), and W has independent increments. Let T be a subordinator, which is a Lévy process that is non-decreasing with probability 1, has independent increments, and is independent of W. The subordinated process Z results from observing W at locations T (t). That is, Z(t) = W[T (t)]. The subordinator essentially generates a random set of irregular locations over which the Brownian motion is observed, which results in a new process. In our hierarchical representation of Laplace and horseshoe priors for the first order differences, we can define a subordinator process T j = j i=1 τ 2 i , so that the GMRF p(θ | τ) can be thought of as a subordinated Brownian motion or as a realization of a Brownian motion with unit variance on the random latent irregular grid T 1 , . . . , T n . The subordinated Brownian motion interpretation is not so straight forward when applied to higher-order increments, but we think this interpretation will be fruitful for extending our SPMRF models in the future. One example where this interpretation is useful is when observations occur on an irregularly spaced grid, which we explore in the following section.

Extension to Irregular Grids
So far we have restricted our model formulation to the case where data are observed at equallyspaced locations. Here we generalize the model formulation to allow for data observed at locations with irregular spacing. This situation arises with continuous measurements over time, space, or some covariate, or when gaps are left by missing observations. For a GMRF with constant precision (normally distributed kth-order differences), we can use integrated Wiener processes to obtain the precision matrix (see Rue and Held (2005) and Lindgren and Rue (2008) for details). However, properly accounting for irregular spacing in our models with Laplace or horseshoe kth-order differences is more difficult. To use tools similar to those for integrated Wiener processes we would need to show that the processes built on Laplace and horseshoe increments maintain their distributional properties over any subinterval of a continuous measure. Polson and Scott (2012a) show that processes with Laplace or horseshoe first-order increments can be represented as subordinated Brownian motion. However, to meet the necessary condition of an infinitely divisible subordinator, the subordinator for the Laplace process needs to be on the precision scale and the subordinator for the horseshoe process needs to be on the log-variance scale. Both resulting processes are Lévy process, which means they have independent and stationary increments, but the increments are no longer over the continuous measure of interest. This makes representation of these processes over continuous time difficult and development of the necessary theory is out of the scope of this paper.
Absent theory to properly address this problem, we instead start with our hierarchical model formulations and assume that conditional on a set of local variance parameters τ, we can use methods based on integrated Wiener processes to obtain the precision matrices for the latent GMRFs. This requires the assumption that local variances are constant within respective intervals between observations. Let s 1 < s 2 < ... < s n be a set of locations of observations, and let δ j = s j+1 − s j be the distance between adjacent locations. We assume we have a discretely observed continuous process and denote by θ(s j ) the value of the process at location s j . For the first-order model and some interval [s j , s j+1 ], we assume that conditional on local variance τ j , θ(s) follows a Wiener process where θ(s Note that the resulting marginal distribution of θ(s j + h) − θ(s j ) after integrating over τ j is therefore assumed to be Laplace or horseshoe for all h, with the form of the marginal distribution dependent on the distribution of τ j . We know this cannot be true in general given the properties of these distributions, but we assume it approximately holds for h ≤ δ. The situation becomes more complex for higher order models. We restrict our investigations to the second-order model and follow the methods of Lindgren and Rue (2008), who use a Galerkin approximation to the stochastic differential equation representing the continuous process. The resulting formula for a second-order increment becomes and the variance of a second-order increment conditional on τ j is This adjustment of the variance results in good consistency properties for GMRFs with constant precision (Lindgren and Rue, 2008), so should also perform well over intervals with locally constant precision. We show in Appendices A and B that integrating over the local scale parameter τ j maintains the distance correction as a multiplicative factor on the scale of the resulting marginal distribution. We also provide a data example involving a continuous covariate in Appendix C where we apply the methods above for irregular grids.

Posterior Computation
Since we have two general model formulations, marginal and hierarchical, we could use MCMC to approximate the posterior distribution of heights of our piecewise step functions, θ, by work-ing with either one of the two corresponding posterior distributions. The first one corresponds to the marginal model formulation: where p(θ | γ) is a Markov field induced by the normal, Laplace, or horseshoe densities, and p(γ) is a half-Cauchy density. Note that a closed-form approximation to the density function for the horsehoe prior (see Appendix A) is needed for the marginal formulation using the horseshoe. The second posterior corresponds to the hierarchical model with latent scale parameters τ: where p(θ | τ) is a GMRF and the choice of p(τ j | γ) makes the marginal prior specification for θ correspond either to a Laplace or to a horseshoe Markov random field. Notice that the unconditional GMRF (normal prior) has only the marginal specification.
Both of the above model classes are highly parameterized with dependencies among parameters induced by differencing and the model hierarchy. It is well known that high-dimensional, hierarchical models with strong correlations among parameters can create challenges for standard MCMC samplers, such as component-wise random walk Metropolis or Gibbs updates. When faced with these challenges, random walk behavior can result in inefficient exploration of the parameter space, which can lead to poor mixing and prohibitively long convergence times. Many approaches have been proposed to deal with these issues, including block updating (Knorr-Held and Rue, 2002), elliptical slice sampling , the Metropolis adjusted Langevin algorithm (MALA) (Roberts and Stramer, 2002), and Hamiltonian Monte Carlo (HMC) (Duane et al., 1987;Neal, 1993Neal, , 2011. All of these approaches jointly update some or all of the parameters at each MCMC iteration, which usually improves mixing and speeds up convergence of MCMC. Among these methods, HMC offered the most practical choice due to its ability to handle a wide variety of models and its relative ease in implementation via readily availble software such as stan (Carpenter et al., 2016). We used a modification of HMC proposed by Hoffman and Gelman (2014) which automatically adjusts HMC tuning parameters. We used the open source package rstan (Stan Development Team, 2015a), which provides a platform for fitting models using HMC in the R computing environment (R Core Team, 2014).
Even with HMC, slow mixing can still arise with hierarchical models and heavy-tailed distributions due to the inability of a single set of HMC tuning parameter values to be effective across the entire model parameter space. Fortunately this problem can often be remedied by model reparameterizations that change the geometry of the sampled parameter space. For hierarchical models, the non-centered parameterization methods described by Papaspiliopoulos et al. (2003Papaspiliopoulos et al. ( , 2007 and Betancourt and Girolami (2015) can be useful. Non-centered parameterizations break the dependencies among parameters by introducing deterministic transformations of the parameters. The MCMC algorithm then operates directly on the independent parameters. Betancourt and Girolami (2015) discuss non-centered parameterizations in the context of HMC, and further examples of these and other reparameterization methods that target heavy-tailed distributions are provided in the documentation for stan (Stan Development Team, 2015b).
We note that after employing reparameterizations, HMC with stationary distribution equal to the hierarchical model posterior (8) had good convergence and mixing properties for each of our models and in nearly all of our numerical experiments. HMC that targeted the marginal model posterior (7) had fast run times and good mixing for the normal and Laplace formulations, but we could not effectively reparameterize the (approximate) marginal horseshoe distribution to remove the effects of its heavy tails, which resulted in severe mixing problems for the marginal horseshoe-based model. Therefore, in the rest of the manuscript we work with the hierarchical model posterior distribution (8) for all models. For SPMRF and GMRF models, the computation time needed to evaluate the log-posterior and its gradient scales as O(n), where n is the grid size. However, the hierarchical SPMRF models have approximately twice as many parameters as the GMRF or marginal SPMRF models. These hierarchical SPMRF methods are therefore slower than their GMRF counterparts. Since the computational cost of evaluating the log-posterior is only one factor determining the MCMC speed, we compared run times of the SPMRF and GMRF models on simulated and real data (see Appendix G). Our results show that SPMRF models are slower than GMRFs, but not prohibitively so.
We developed an R package titled spmrf which allows for easy implementation of our models via a wrapper to the rstan tools. The package code is publicly available at https: //github.com/jrfaulkner/spmrf.

Simulation Protocol
We use simulations to investigate the performance of two SPMRF formulations using the Laplace and horseshoe shrinkage priors described in Section 2.2 and compare results to those using a normal distribution on the order-k differences. We refer to the shrinkage prior methods as adaptive due to the local scale parameters, and the method with normal prior as non-adaptive due to the use of a single scale parameter. We constructed underlying trends with a variety of characteristics following approaches similar to those of other authors (Scheipl and Kneib, 2009;Yue et al., 2012;Zhu and Dunson, 2013). We investigated four different types of underlying trend (constant, piecewise constant, smooth function, and function with varying smoothness). The first row of Figure 2 shows examples of the trend functions, each illustrated with simulated normal observations centered at the function values over a regular grid. We used three observation types for each trend type where the observations were conditionally independent given the trend function values θ i , where i = 1, . . . , n. The observation distributions investigated were 1) normal: Note that we constructed the function values for the scenarios with normally distributed observations so that each function would have approximately the same mean and variance, where the mean and variance were calculated across the function values realized at the discrete time points. This allowed us to specify observation variances which resulted in the same signalto-noise ratio for each function, where signal-to-noise ratio is defined as the standard deviation of function values divided by the standard deviation of observations. The signal-to-noise ratios for our scenarios with normal observations were 6 for σ = 1.5 and 2 for σ = 4.5. We chose the mean sizes for the Poisson scenarios and sample sizes for the binomial scenarios so that the resulting signal-to-noise ratios would be similar to those for the normal scenarios with σ = 4.5. These levels allowed us to assess the ability of the models to adapt to local features when the signal is not overwhelmed by noise. We describe the trend functions further in what follows.
Constant. This scenario uses a constant mean across all points. We use this scenario to investigate the ability of each method to find a straight horizontal line in the presence of noisy data. The values used for the constant mean were 20 for normal and Poisson observations, and 0.5 for binomial observations.
Piecewise constant. This type of function has been used by Tibshirani (2014) and others such as Scheipl andKneib (2009) andZhu andDunson (2013). The horizontal trends combined with sharp breaks offer a difficult challenge for all methods. For the scenarios with normal or Poisson observations, the function values were 25, 10, 35, and 15 with break points at t ∈ {20, 40, 60}. For the binomial observations the function values on the probability scale were 0.65, 0.25, 0.85, and 0.45 with the same break points as the other observation types.
Smooth trend. We use this as an example to test the ability of the adaptive methods to handle a smoothly varying function. We generated the function f as a GP with squared exponential covariance function. That is, , where Σ i, j is the covariance between points i and j, σ 2 f > 0 is the signal variance and ρ > 0 is the length scale. We set µ = 10, σ 2 f = 430, and ρ = 10 for the scenarios with normal or Poisson observations. For binomial observations, f was generated in logit space with µ = −0.5, σ 2 f = 3, and ρ = 10 and then back-transformed to probability space. For all scenarios the function was generated with the same random number seed.
Varying smoothness. This function with varying smoothness was initially presented by DiMatteo et al. (2001) and later used by others, including Yue et al. (2012). We adapted the function to a uniform grid, t ∈ [1, n], where n = 100 in our case, resulting in the function For the normal and Poisson observations we made the transformation f (t) = 20 + 10g(t). For binomial observations we used f (t) = 1.25g(t) on the logit scale. We generated 100 datasets for each combination of trend and observation type. This number of simulations was sufficient to identify meaningful differences between models without excessive computation time. Each dataset had 100 equally-spaced sample points over the interval [1,100]. For each dataset we fit models representing three different prior formulations for the order-k differences, which were 1) normal, 2) Laplace, and 3) horseshoe. We used the hierarchical prior representations for these models given in Section 2.2. We selected the degree of k-th order differences for each model based on knowledge of the shape of the underlying function. We fit first-order models for the constant and piecewise constant functions, and we fit second-order models for the smooth and varying smooth functions. For the scenarios with normal observations, we set σ ∼ C + (0, 5). In all cases, θ 1 ∼ N(µ, ω 2 ), where µ is set to the sample mean and ω is two times the sample standard deviation of the observed data transformed to match the scale of θ. We also set γ ∼ C + (0, 0.01) for all models.
We used HMC to approximate the posterior distributions. For each model we ran four independent chains with different randomly generated starting parameter values and initial burn-in of 500 iterations. For all scenarios except for normal observations with σ = 1.5, each chain had 2,500 posterior draws post-burn-in that were thinned to keep every 5th draw. For scenarios with normal observations with σ = 1.5, chains with 10,000 iterations post-burn-in were necessary, with additional thinning to every 20th draw. In all cases, these settings resulted in 2,000 posterior draws retained per model. We found that these settings consistently resulted in good convergence properties, where convergence and mixing were assessed with a combination of trace plots, autocorrelation values, effective sample sizes, and potential scale reduction statistics (Gelman and Rubin, 1992).
We assessed the relative performance of each model using three different summary statistics. We compared the posterior medians of the trend parameters (θ i ) to the true trend values (θ i ) using the mean absolute deviation (MAD): We assessed the width of the 95% Bayesian credible intervals (BCIs) using the mean credible interval width (MCIW): whereθ 97.5,i andθ 2.5,i are the 97.5% and 2.5% quantiles of the posterior distribution for θ i . We also computed the mean absolute sequential variation (MASV) ofθ as We compared the observed MASV to the true MASV (TMASV) in the underlying trend function, which is calculated by substituting true θ's into equation for MASV.

Simulation Results
In the interest of space, we emphasize results for the scenarios with normally distributed observations with σ = 4.5 here. This level of observation variance was similar to that for Poisson and binomial observations and therefore offered results similar to those scenarios. We follow these results with a brief summary of results for the other observation types, and we provide further summary of other results in Appendix D.
Constant. The three models performed similarly in terms of absolute value of all the metrics (Table 1 and Figure 2), but the Laplace and normal models were slightly better at fitting straight lines than the horseshoe. This is evidenced by the fact that the horseshoe had larger MCIW and larger MASV than the other methods. The first column of plots in Figure 3 provides a visual example of the extra variation exhibited by the horseshoe.
Piecewise constant. The horseshoe model performed the best in all categories for this scenario and the normal model performed the worst (Table 1 and Figure 2). The Laplace model was closer to the normal model in performance. The horseshoe was flexible enough to account for the large function breaks yet still able to limit variation in the constant segments. Example fits for the piecewise constant function are shown in the second column of plots in Figure 3.
Smooth trend. The different models were all close in value of the performance metrics for the smooth trend scenario (Table 1 and Figure 2). The normal and Laplace models had smallest MAD, but the horseshoe had MSAV closer to the true MSAV. The fact that the values of the metrics were similar for all models suggests that not much performance is lost in fitting a smooth trend with the adaptive methods in comparison to non-adaptive.
Varying Smoothness. Again the models all performed similarly in terms of absolute value of the metrics, but there was a clear ordering among models in relative performance (Table 1 and Figure 2). The horseshoe model performed the best relative to the other models on all metrics. This function forces a compromise between having large enough local variance to capture the spike and small enough local variance to remain smooth through the rest of the function. The horseshoe was more adaptive than the other two methods and therefore better able to meet the compromise. The plots in the last column of Figure 3 provide example fits for this function.
The results for the scenarios with normal observations with σ = 1.5 and Poisson and binomial observations (see Appendix D) showed similar patterns to those with normal observations and σ = 4.5. For the constant function, the normal prior performed the best and the horseshoe prior the worst, although differences in terms of absolute values of the performance metrics were small. The relative differences were more pronounced with the scenarios with normal observations with σ = 1.5. For the piecewise constant function, the horseshoe prior performed the best for all scenarios and the normal prior the worst. All methods performed similarly for the smooth function, with the normal and Laplace generally performing a little better than the horseshoe. For the function with varying smoothness, the horseshoe performed the best and the normal the worst for all scenarios.

Data Examples
Here we provide two examples of fitting SPMRF models to real data. Each example uses a different probability distribution for the observations. The first example exhibits a change point, which makes it amenable to adaptive smoothing methods. The second example has a more uniformly smooth trend but also shows a period of rapid change, so represents a test for all methods. First we address the issue of setting the hyperparameter for the global smoothing parameter.

Parameterizing the Global Smoothing Prior
The value of the global smoothing parameter λ determines the precision of the marginal distributions of the order-k differences, which influences the smoothness of the estimated trend. Selection of the global smoothing parameter in penalized regression models is typically done via cross-validation in the frequentist setting (Tibshirani, 1996) or marginal maximum likelihood in the empirical Bayes setting (Park and Casella, 2008). Our fully Bayesian formulation eliminates the need for these additional steps, but in turn requires selection of the hyperparameter controlling the scale of the prior on the smoothing parameter. The value of this hyperparameter will depend on the order of the model, the grid resolution, and the variability in the latent trend parameters. Therefore, a single hyperparameter value cannot be used in all situations. Some recent studies have focused on methods for more careful and principled specification of priors for complex hierarchical models (Fong et al., 2010;Simpson et al., 2014;Sørbye and Rue, 2014). The method of Sørbye and Rue (2014) was developed for intrinsic GMRF priors and we adapt their approach to our specific models in what follows. We wish to specify values of the hyperparameter ζ for various situations, where the global scale parameter γ ∼ C + (0, ζ) . Let Q be the precision matrix for the Markov random field corresponding to the model of interest (see Appendix E for examples), and Σ = Q −1 be the covariance matrix with diagonal elements Σ ii . The marginal standard deviation of all components of θ for a fixed value of γ is σ γ (θ i ) = γσ ref (θ), where σ ref (θ) is the geometric mean of the individual marginal standard deviations when γ = 1 . We want to set an upper bound U on the average marginal standard deviation of θ i , such that Pr(σ γ (θ i ) > U) = α, where α is some small probability. Using the cumulative probability function for a half-Cauchy distribution, we can find a value of ζ for a given value of σ ref (θ) specific to a model of interest and given common values of U and α by: By standardizing calculations to be relative to the average marginal standard deviation, the methods of Sørbye and Rue (2014) allow us to easily calculate ζ for a model of different order or a model with a different density of grid points. For practical purposes we apply the same method to the normal and SPMRF models. This is not ideal in terms of theory, however, since the horseshoe distribution has infinite variance and the corresponding SPMRF will clearly not have the same marginal variance as a GMRF. This is not necessarily problematic since GMRF approximation will result in an estimate of ζ under the horseshoe SPMRF which is less informative than would result under similar methods derived specifically for the horseshoe SPMRF, and could therefore be seen as more conservative in terms of guarding against over smoothing.
In contrast, the Laplace SPMRF has finite marginal variance that is well approximated by the GMRF methods. We apply these methods in the data examples that follow.

Coal Mining Disasters
This is an example of estimating the time-varying intensity of an inhomogeneous Poisson process that exhibits a relatively rapid period of change. The data are on the time intervals between successive coal-mining disasters, and were originally presented by Maguire et al. (1952), with later corrections given by Jarrett (1979) and Raftery and Akman (1986). We use the data format presented by Raftery and Akman (1986). A disaster is defined as an accident involving 10 or more deaths. The first disaster was recorded in March of 1851 and the last in March of 1962, with 191 total event times during the period 1 January, 1851 through 31 December, 1962. Visual inspection of the data suggests a decrease in rate of disasters over time, but it is unclear by eye alone whether this change is abrupt or gradual. The decrease in disasters is associated with a few changes in the coal industry at the time. A sharp decline in labor productivity at the end of the 1880's is thought to have decreased the opportunity for disasters, and the formation of the Miner's Federation, a labor union, in late 1889 brought added safety and protection to the workers (Raftery and Akman, 1986).
This data set has been of interest to various authors due to uncertainty in the timing and rate of decline in disasters and the computational challenge presented by the discrete nature of the observations. Some authors have fit smooth curves exhibiting gradual change (Adams et al., 2009;Teh and Rao, 2011) and others have fit change-point models with abrupt, instantaneous change (Raftery and Akman, 1986;Carlin et al., 1992;Green, 1995). An ideal model would provide the flexibility to automatically adapt to either scenario.
We assumed an inhomogeneous Poisson process for the disaster events and binned the event counts by year. We fit first-order models using the normal, Laplace, and horseshoe prior formulations. We assumed the event counts, y i , were distributed Poisson conditional on the θ i : y i | θ i ∼ Pois exp(θ i ) . The marginal prior distributions for the first-order increments were ∆θ j ∼ N(0, γ 2 ) for the Normal, ∆θ j ∼ Laplace(γ) for the Laplace, and ∆θ j ∼ HS(γ) for the horseshoe. We used the same prior specifications as those used in the simulations for the remaining parameters, except we used the guidelines in Section 4.1 to set the hyperparameter on the global scale prior. Using calculations outlined in Appendix E, we set σ ref (θ) = 6.47 and U = 0.860. Setting α = 0.05 and substituting into Equation (E.6) results in ζ = 0.0105, so γ ∼ C + (0, 0.0105) for each model. We used HMC for approximating the posterior distributions. For each model we ran four independent chains, each with a burn-in of 500 followed by 6,250 iterations thinned at every 5. This resulted in a total of 5,000 posterior samples for each model. We were interested in finding the best representation of the process over time as well as finding the most likely set of years associated with the apparent change point. For this exercise we arbitrarily defined a change point as the maximum drop in rate between two consecutive time points.
Plots of the fitted trends ( Figure 4) indicate that the horseshoe model picked up a sharper change in trend and had narrower BCIs than the other models. The normal and Laplace models did not have sufficient flexibility to allow large jumps and produced a gradual decline in accidents rate, which is less plausible than a sharp decline in light of the additional information about change in coal mining industry safety regulations. The relative qualitative performance of the normal, Laplace, and horseshoe densities is similar to that for the piecewise constant scenario from our simulation study. The posterior distributions of the change point times are shown in Figure 4. The horseshoe model clearly shows a more concentrated posterior for the break points, and that distribution is centered near the late 1880's, which corresponds to the period of change in the coal industry. Therefore, we think the Bayesian trend filter with the horseshoe prior is a better default model in cases where sharp change points are expected. It is important to point out that we tried other values for the scale parameter (ζ) in the prior distribution for γ and found that the models were somewhat sensitive to that hyperparameter for this data set. In particular, the horseshoe results for ζ = 1 looked more like those for the other two models in Figure 4, but when ζ = 0.0001, the horseshoe produced more defined break points and straighter lines with narrower BCIs compared to the results with ζ = 0.01 (see Appendix F).

Tokyo Rainfall
This problem concerns the estimation of the time-varying mean of an inhomogeneous binomial process. We are interested in estimating the seasonal trend in daily probability of rainfall. The data are binary indicators of when daily rainfall exceeded 1 mm in Tokyo, Japan, over the course of 39 consecutive years . The indicators were combined by day of year across years, resulting in a sample size of m = 39 for each of 365 out of 366 possible days, and a size of m = 10 for the additional day that occurred in each of the 10 leap years. The observation variable y is therefore a count, where y ∈ {0, 1, . . . , 39}. Data were obtained from the NOAA's National Center for Climate Information (https://www.ncdc.noaa.gov). A smaller subset of these data (1983)(1984) was initially analyzed by Kitagawa (1987) and later by several others, including Rue and Held (2005). We fit SPMRF models with Laplace and horseshoe priors and a GMRF model (normal prior). All models were based on second-order differences. The observation model was and the marginal prior distributions for the second-order differences were ∆ 2 θ j ∼ N(0, γ 2 ) for the normal prior, ∆ 2 θ j ∼ Laplace(γ) for the Laplace, and ∆ 2 θ j ∼ HS(γ) for the horseshoe. We used the same prior specifications as those used in the simulations for the remaining parameters, except we used the guidelines in Section 4.1 to set the hyperparameter on the global scale prior. Using calculations outlined in Appendix E, we set σ ref (θ) = 906.7 and U = 0.679. Setting α = 0.05 and substituting into Equation (E.6) results in ζ = 5.89 × 10 −5 , so γ ∼ C + (0, 5.89 × 10 −5 ) for each model. We ran four independent chains for each model, each with a burn-in of 500 followed by 6,250 draws thinned at every 5. This resulted in a total of 5,000 MCMC samples retained for each model. The resulting function estimates for all models reveal a sharp increase in probability of rain in June followed by a sharp decrease through July and early August and a subsequent sharp increase in late August and September ( Figure 5). Changes through the rest of the months were relatively smooth. The estimated function displays some variations in smoothness similar to the function with varying smoothness used in our simulations. All methods resulted in a similar estimated function, but the horseshoe prior resulted in a smoother function that displayed sharper features at transition points in late June and early August, yet also had narrower credible intervals over most of the function. The normal and Laplace models resulted in a little more variability in the trend in January-April and in November. In their analysis of a subset of these data, Rue and Held (2005) used a circular constraint to tie together the endpoints of the function at the beginning and end of the year. We did not use such a constraint here, although it is possible with the SPMRF models. Even so, it is evident that the horseshoe model resulted in more similar function estimates at the endpoints than did the other two models.

Discussion
We presented a method for curve fitting in a Bayesian context that achieves locally adaptive smoothing by exploiting the sparsity-inducing properties of shrinkage priors and the smoothing properties of GMRFs. We compared the performance of the Laplace prior, which simply reformulates the frequentist trend filter to a Bayesian analog, to a more aggressive horseshoe shrinkage prior by using simulations and found that the horseshoe provided the best balance between bias and precision. The horseshoe prior has the greatest concentration of density near zero and the heaviest tails among the priors we investigated. This combination allows smooth functions to be fit in regions with weak signals or noisy data while still allowing for recovery of sharp functional changes when supported by informative data. The Laplace prior allowed more functional changes of moderate value to be retained and could not accommodate large changes without compromising the ability to shrink the noisy and smaller functional changes. This resulted in greater variability in the estimated functions and wider associated credible intervals for the models with the Laplace prior in comparison to those with the horseshoe prior when the underlying true functions had jumps or varying smoothness. The Laplace prior did have adaptive ability not possessed by the normal prior, but the horseshoe prior clearly had the best adaptive properties among the priors we investigated.
The Laplace prior performed better than the horseshoe for the constant and smooth functions in our simulations, with results closer to those of the normal prior, although the differences in performance among the three methods were relatively small. These functions do not have large deviations in order-k differences, and so there are many small or medium sized values for the estimated ∆ k θ. This situation is reflective of cases described by Tibshirani (1996) where the lasso and ridge regression perform best, which helps explain why the analogous SPMRF models with Laplace or normal prior distributions do better here. We expect that non-adaptive or mildly adaptive methods will perform better when used on functions which do not exhibit jumps or varying smoothness. However, it is reassuring that an adaptive method does nearly as well as a non-adaptive method for these functions. This allows an adaptive model such as that using the horseshoe to be applied to a variety of functions with minimal risk of performance loss.
Our fully Bayesian implementation of the SPMRF models eliminates the need to explicitly select the global smoothing parameter λ , either directly or through selection methods such as cross-validation (e.g., Tibshirani (1996)) or marginal maximum likelihood (e.g., Park and Casella (2008)). However, the fully Bayesian approach does still require attention to the selection of the hyperparameter that controls the prior distribution on the smoothing parameter. We found the methods of Sørbye and Rue (2014) to offer practical guidelines for selecting this hyperparameter, and we successfully applied a modification of those methods in our data examples. A highly informative prior on the global smoothing parameter can result in oversmoothing if the prior overwhelms the information in the data, while a diffuse prior may result in a rougher function with insufficient smoothing. Noisier data are therefore more sensitive to choice of parameterization of the prior on the global smoothing parameter. We tested prior sensitivity in the coal mining example and found that the horseshoe prior was more responsive to changes in hyperparmeter values than the normal and Laplace priors (see Appendix F). However, the results for the simulations and for the Tokyo rainfall example were much more robust to the value of the hyperparameter on the global scale due to the information in the data. As a precaution, we recommend first applying methods such as those by Sørbye and Rue (2014) to set the hyperparameter, but then also paying attention to prior sensitivity when analyzing noisy data with the SPMRF models.
We only addressed one-dimensional problems here, but we think the GMRF representation of these models can allow extension to higher dimensions such as the spatial setting by incorporating methods used by Rue and Held (2005) and others. We also plan to extend these methods to semi-parametric models that allow additional covariates.

Acknowledgments
J.R.F, and V.N.M. were supported by the NIH grant R01 AI107034. J.R.F. was supported by the NOAA Advanced Studies Program and V.N.M. was supported by the NIH grant U54 GM111274. We are grateful to Soumik Pal for making us think harder about subordinated processes. We thank two anonymous reviewers for their help in improving this manuscript.

A Approximation to the Horseshoe Density
There is no exact closed-form expression available for the horseshoe density function. We present an approximation to the horseshoe density that can be used without the need for explicit specification of the nuisance local scale parameters. Following Carvalho et al. (2010), the marginal distribution of u given global scale parameter γ is found by integrating over possible values of the local scale parameter τ, where u | τ ∼ N(0, δτ 2 ) and τ | γ ∼ C + (0, γ). Here δ is a constant representing a scale factor for the distance between adjacent points when this distribution is used for the increments of a kth-order smoothing model. This leads to We let B = 2γ/( √ 2π 3 δ) and introduce the substitution ω = τ −2 , which gives dτ = −1/(2ω 3/2 ), resulting in dω.
Now we introduce the substitution z = 1 + ωγ 2 , which gives dω = γ −2 dz, and results in where E 1 is the exponential integral function. Note that lim x→0 + E 1 (x) = ∞, but for x > 0, the function E 1 (x) is bounded as follows: Then for u ∈ {R : u 0} we have It follows that the target density is bounded by Let the left bound in equation (A.1) be denoted B 1 (u) and the right bound B 2 (u). Note that as u → 0, each of B 1 (u), p(u | γ) and B 2 (u) approach ∞. It can be shown that ∞ −∞ B 1 (u)du = √ 2/π and ∞ −∞ B 2 (u)du = 2/ √ π. Since √ 2/π < 1 < 2/ √ π, these bounds can be used to find an approximate expression for p(u | γ) that integrates to 1 and still satisfies equation (A.1). We set with constraints 0 < w < 1 and ∞ −∞ wB 1 (u) + (1 − w)B 2 (u)du = 1. Using the values for the integrated bounds and solving gives w = ( √ π − 2)/( √ 2 − 2). Substituting this value for w into equation (A.2) and simplifying gives the following closed-form approximation to the horseshoe density function:

B Marginal Laplace Distribution with Irregular Grid Spacing
The following is a derivation of the marginal prior distribution for the order-k differences when grid spacing is unequal. These derivations are based on the scale-mixture representation of the Laplace distribution. These results are known to apply to the first-order and second-order models, but higher orders. Let u j = ∆ k θ j and let δ j be a constant representing a scale factor for the distance between adjacent points when this distribution is used for the increments of a kth-order smoothing model. For convenience, subscripts on u and δ are dropped from here forward. We assume u|τ, δ ∼ N(0, δτ 2 ) and τ 2 |λ 2 ∼ Exp(λ 2 /2). Here λ = 1/γ is the global shrinkage parameter. It follows that where A = λ 2 2 √ 2πδτ 2 . Now we make the substitution ω = 1/τ 2 , which gives dτ 2 = −ω −2 dω, and the marginal density for u becomes where the last line follows from the fact that the integrand in the second-to-last line is the pdf of an inverse-Gaussian distribution with mean parameter µ = √ δλ/|u| and shape parameter λ = λ 2 . The result is the pdf of a Laplace distribution with mean zero and scale parameter λ/ √ δ. Note that the variance of the Laplace distribution is 2δ/λ 2 , which implies that the grid spacing δ scales the variance of the increments u.

C Data Example with Irregular Grid
We apply the SPMRF models to a data set with a continuous covariate. The response data are rent per square meter of floor space in Munich, Germany, and the covariate is the floor space in square meters. These data were analysed by Rue and Held (2005) using a second-order GMRF with irregular spacing. Here we apply a second-order GMRF and SPMRF models using the methods described in Section 2.4 of the main text. Let x represent the floor space measurements, and let x 1 < x 2 < ... < x n be the ordered set of unique floor measurement values. Further, let δ j = x j+1 − x j be the distance between adjacent floor space measurements. The marginal prior distributions for the second-order differences were ∆ 2 θ j ∼ N(0, d j γ 2 ) for the normal prior, ∆ 2 θ j ∼ Laplace(d 1/2 j γ) for the Laplace, and ∆ 2 θ j ∼ HS(d 1/2 j γ) for the horseshoe, where Using methods described in Section 4.1 of the main text and Appendix E, we calculated the value of the hyperparameter for the global scale parameter to be ζ = 0.00094, so γ ∼ C + (0, 0.00094) for all models. The results are shown in Figure

D Additional Simulation Results
Here we display plots with simulation results for normal data with σ = 1.5 ( Figure   q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q

E Parameterizing the Global Smoothing Prior
Here we provide additional details for calculating the hyperparameter ζ for the prior on the global scale parameter γ, where γ ∼ C + (0, ζ). First let Q be the precision matrix for the Markov random field corresponding to the model of interest (see examples below), and Σ = Q −1 be the covariance matrix with diagonal elements Σ ii . Following Sørbye and Rue (2014), the marginal standard deviation of all components of θ for a fixed value of γ is σ γ ( That is, σ ref (θ) is the geometric mean of the individual marginal standard deviations when γ = 1. Before going further, let us describe two precision matrices used in the accompanying paper and their associated covariance matrices. Sørbye and Rue (2014) use intrinsic formulations of a Gaussian Markov random field (GRMF), which is also possible with our models, but we chose to use proper GMRF models. This requires specification of the variance of the first θ, which we denote ω 2 = Var(θ 1 ). For a given value of γ, the n × n precision matrix for a first-order model is: and the corresponding covariance matrix is: ω 2 ω 2 · · · ω 2 ω 2 ω 2 + γ 2 ω 2 + γ 2 · · · ω 2 + γ 2 . . . ω 2 + γ 2 ω 2 + 2γ 2 ω 2 + 2γ 2 · · · ω 2 + 2γ 2 . . . ω 2 + 2γ 2 . . . . . .
. . . ω 2 + (n − 2)γ 2 ω 2 + (n − 2)γ 2 ω 2 ω 2 + γ 2 ω 2 + 2γ 2 · · · ω 2 + (n − 2)γ 2 ω 2 + (n − 1)γ 2 Therefore, the marginal variances for the first-order model are Σ 1,ii = ω 2 + (i − 1)γ 2 . For the second-order model, the n × n precision matrix is: There is an analytical form for the covariance matrix for the second-order model, but it suffices here to know that the form of the marginal variances is: Note that if we were using an intrinsic GMRF model, we would assume that ω 2 is infinite, which would result in a covariance matrix of rank n − k. Following Sørbye and Rue (2014) we would then use the generalized inverse of the precision matrix to calculate the marginal variances.
In practice we use the variance of the data (transformed data if the θ parameters are on a transformed scale) as an estimate of ω 2 . Although this is using the data twice, this offers a reasonable constraint on the marginal variances of the θs.
We want to set an upper bound U on the average marginal standard deviation of θ i , such that Pr(σ γ (θ i ) > U) = α, where α is some small probability. Using the cumulative probability function for a half-Cauchy distribution, we can find a value of ζ for a given value of σ ref (θ) specific to a model of interest and given common values of U and α by: .

(E.6)
It may be useful to note here that the median of a half-Cauchy distribution is equal to its scale parameter, since the median may be a more intuitive measure of the effect of different values of ζ.
For our data examples in the main text, we let U be the estimated standard deviation of the data on the appropriate scale. We know that the marginal variances of the θs should not exceed the variance in the observed data, on average. We set α = 0.05 as the probability of the average marginal standard deviation exceeding U. For the coal mining example in the main text, we found an estimate of the variance of the data on the log scale by n i=1 ln(y i + 0.5)/(n − 1), where y i is the observed count at time i = 1, . . . , n. For the Tokyo rain example, we estimated the variance of the data on the logit scale as n i=1 logit((y i + q i )/m i ))/(n − 1), where y i is the number of years with rain on day i out of m i possible years, and q i = 0.005I y i =0 − 0.005I y i =1 + 0I y i {0,1} , where I is an indicator function.
Suppose we have calculated ζ o1 , the hyperparameter for a first-order model given the corresponding average marginal standard deviation σ ref (θ o1 ) using Equation (E.6). If we wish to calculate the value of ζ o2 for a second-order model we can simply use Now suppose we have a model with n equidistant nodes and want to increase the density of the grid to kn nodes without changing the range of the grid. For a first-order model, Var(∆θ new ) = 1 k Var(∆θ), and for a second-order model Var(∆ 2 θ new ) = 1 k 3 Var(∆ 2 θ) (Lindgren and Rue, 2008;Sørbye and Rue, 2014). In terms of the hyperparameter for the global smoothing prior, for the first-order model ζ o1,new = k −1/2 ζ o1 , and for the second-order model ζ o2,new = k −3/2 ζ o2 .

F Prior Sensitivity
We tested the sensitivity of the three prior formulations (normal, Laplace, and horseshoe) to the value of the hyperparameter (ζ) which controls the scale of the distribution on the smoothing parameter γ, where γ ∼ C + (0, ζ) A smaller value of ζ constricts γ to be closer to zero, which in turn constricts the scales of the priors on the order-k differences. We tested three levels for the hyperparameter: a) ζ = 1, b) ζ = 0.01, and c) ζ = 0.0001. In general, we expect noisier data sets should be more sensitive to prior settings. The coal mine disaster data offered a good test set because the observations are relatively noisy. Clearly the horseshoe prior was the most sensitive to the level of ζ ( Figure F.1). In particular, the horseshoe results for ζ = 1 looked more like those for the other two models in Figure 4, but when ζ = 0.0001, the horseshoe produced more defined break points and straighter lines with narrower BCIs compared to the results with ζ = 0.01.

G Computational Efficiency
To compare SPMRF and GMRF models' computational efficiency, we calculated the effective number of posterior samples per second of computation time (ESSps) for different model formulations and data configurations. We used the scenario with a piecewise-constant expected value from our main simulations (Section 3) to test the effect of model type, model order, and number of grid cells (n) on the ESS per second of sampling time. Here sampling time is defined as the total run time minus the time spent in the adaptation (warm-up or burn-in) phase, where time is measured in seconds of CPU time. We also calculated the ESSps for the coal mining and Tokyo rainfall examples.
There were three simulated scenarios with piecewise-constant trend: 1) order-1 with n = 100 observations and grid points (one observation per grid point), 2) order-1 with n = 200, and 3) order-2 with n = 100. The observations in these scenarios were normally distributed with standard deviation σ = 4.5. For each of these scenarios we ran 4 independent chains each with 1,000 iterations of burn-in and 2,500 iterations post burn-in thinned at every 5 for a total of 2,000 retained posterior samples combined across chains. The maximum ESS would therefore be 2,000 for these scenarios. The chains were run in sequence so that the total time (TCPU) sampling times (SCPU) times are the respective cumulative times across 4 chains. For the coal mining and Tokyo rainfall examples we used the same settings for number of iterations and thinning as was used in the main text (Section 4). We calculated effective sample size using the methods described in the documentation for stan (Stan Development Team, 2015b). Tests indicated that doubling the number of grid points (with a single observation per grid point) resulted in approximately 60% fewer ESSps for each model formulation (63% fewer for both the Normal and Laplace and 53% for the Horseshoe), and changing from a first-to secondorder model resulted in approximately 90% fewer ESSps (91% fewer for the Normal, 90% for the Laplace, and 83% for the Horseshoe). On average across the five scenarios investigated, the Laplace formulation resulted in 77% fewer ESSps compared to the Normal formulation, and the Horseshoe resulted in 90% fewer ESSps. In terms of sampling times, the Laplace formulations on average took 4.8 times longer to achieve the same number of effective samples as the Normal formulations (range: 2.5 to 5.7 times longer), and the Horseshoe formulations took an average of 12.2 times longer than the Normal (range: 4.6 to 17.4).