Bayesian adaptive B-spline estimation in proportional hazards frailty models

: Frailty models derived from the proportional hazards regres- sion model are frequently used to analyze clustered right-censored survival data. We propose a semiparametric Bayesian methodology for this purpose, modeling both the unknown baseline hazard and density of the random ef-fects using mixtures of B-splines. The posterior distributions for all regres- sion coeﬃcients and spline parameters are obtained using Markov Chain Monte Carlo (MCMC). The methodology permits the use of weighted mix- tures of parametric and nonparametric components in modeling the hazard function and frailty distribution; in addition, the spline knots may also be selected adaptively using reversible-jump MCMC. Simulations indicate that the method produces smooth and accurate posterior hazard and frailty den- sity estimates. The Bayesian approach not only produces point estimators that outperform existing approaches in certain circumstances, but also of- fers a wealth of information about the parameters of interest in the form of MCMC samples from the joint posterior probability distribution. We illus- trate the adaptability of the method with data from a study of congestive heart failure.


Introduction
The effects of clustering in survival data are commonly addressed using a generalization of the Cox proportional hazards regression model (Cox, 1972) known as a proportional hazards frailty model or, for the purposes of this paper, a frailty model. Specifically, dependence among cluster members is assumed to arise through an unobserved random effect, shared by all members of a cluster, that acts multiplicatively on the hazard function (Clayton and Cuzick, 1985). The corresponding marginal hazard regression model no longer exhibits a proportional hazards structure; as a result, it is no longer possible to make direct use of the partial likelihood function (Cox, 1975) in obtaining estimates of the regression parameters. However, under the assumption that the distribution of the random effect is a member of a specified parametric family of distributions (e.g., a gamma distribution) and depending on the model specification for the baseline hazard function, one may fit either a fully parametric or semiparametric model to the data (e.g., Andersen et al., 1993, Ch. IX), permitting likelihoodbased estimation of all model parameters.
Extending and building upon these frequentist maximum-likelihood methods, Bayesian approaches to the frailty model have also emerged. For example, Clayton (1991) and Aslanidou, Dey and Sinha (1998) propose modeling the baseline hazard function in a gamma frailty model with processes that respectively have independent and correlated increments; see Ibrahim, Chen and Sinha (2001) and Müller and Quintana (2004) for a review. The resulting joint posterior distribution, though generally more computationally expensive to obtain than its (asymptotic) frequentist counterpart, is often better able to accommodate unusual data features and contains a wealth of information in the form of marginal and conditional posterior distributions for all parameters of interest. However, the available methods often fail to be semiparametric involving, for example, piecewise constant baseline hazard specifications that depend on some discretization of time.
Whether inference is frequentist or Bayesian, the common practice of specifying a parametric family of random effect distributions is a frequently cited disadvantage of the frailty model. For example, the results of Kosorok, Lee and Fine (2004, Prop. 6) show that misspecification of the frailty density can result in biased estimation of regression coefficients, with the direction of this bias being unpredictable. To counteract drawbacks related to frailty density misspecification, Naskar (2008) proposes to model the baseline hazard function nonparametrically and specifies the distribution of the frailty as a Dirichlet process. In order to carry out frequentist estimation in this model, Naskar (2008) employs a finite dimensional approximation to this process that amounts to the specification of a random discrete probability measure in which the number of point masses is (where feasible) set equal to the number of clusters. Earlier, Walker and Mallick (1997) proposed a related Bayesian methodology in which the baseline hazard function is specified similarly to Aslanidou, Dey and Sinha (1998) and the natural log of the frailty is assumed to follow an unspecified probability distribution F with a Pólya tree prior distribution. In theory, the Pólya tree specification corresponds to a continuous prior distribution; in practice, its use involves the specification of a finite binary partition of the support of the frailty distribution. A poorly chosen partition has the potential to obscure important features of the frailty distribution, or even introduce spurious ones.
In addition to parametric specification of the frailty distribution, there are some potential disadvantages to using a semiparametric model specification for the hazard function in the case of the frailty model. For example, Barker and Henderson (2005) conjecture that the use of a semiparametric model leads to systematic underestimation of the frailty variance because the survival time data primarily influence the estimation of model parameters through their rank ordering, rather than through their actual magnitudes. In a frequentist context, Barker and Henderson (2005) also provide some support for this conjecture by demonstrating the use of local polynomial smoothing methods can reduce the bias in the estimation of the frailty variance. It is not known to the authors whether an analogous drawback holds in the case of Bayesian approaches to semiparametric frailty models.
Collectively, the observations made above suggest consideration of methodology that allows for highly flexible but smooth specifications of both the frailty density and baseline hazard function. In his Ph.D. thesis, Komárek (2006) presents a Bayesian nonparametric approach to clustered survival, based on the accelerated failure time model formulation of Pan (2001) and the random effect density estimation methodology of Ghidey, Lesaffre and Eilers (2004). Both the random effects density and the event time distribution are modeled as smooth mixtures of G-splines, with computations facilitated by the use of the Gibbs sampler. The resulting error and frailty distributions are smooth, allowing for easy visualization and interpretation. In this paper, we propose a related model that remains within the proportional hazards framework (i.e., conditionally upon the frailty) while implementing the desirable features of the formulation in Komárek and Lesaffre (2008). Specifically, we propose to model the baseline hazard as a penalized mixture of B-splines and the frailty density as a penalized mixture of normalized B-splines. Regarding the respective penalties as corresponding to particular prior specifications and imposing prior distributions on various other model parameters, we consider a Bayesian approach to estimation using Markov Chain Monte Carlo (MCMC). Our model formulation and the deconvolution approach bear some resemblance to those of Staudenmayer, Ruppert and Buonaccorsi (2008) and Ruppert, Nettleton and Hwang (2007). The resulting posterior estimates of the baseline hazard and frailty density are smooth and accurate, and can correctly identify unusual frailty densities and baseline hazard forms given sufficiently large samples.
Additional model flexibility is achieved in two ways. First, inspired by the work of Hjort and Glad (1995) and Hjort and Jones (1996), we permit the inclusion of parametric components that can incorporate prior knowledge about the form of the baseline hazard function and frailty density; as shown later, the resulting spline component can be interpreted as an (additive) smooth but local departure from the specified parametric model. Second, we allow the number and position of knots for the hazard and frailty B-spline bases to be chosen in a data-dependent manner, facilitated here using a reversible jump MCMC procedure. These extensions respectively improve the performance of the proposed methods in situations when the hazard or frailty density can be well-modeled by standard parametric forms or are distinctly non-smooth in nature.
The proposed adaptive knot selection procedure is related to the work of Denison, Mallick and Smith (1998) and Biller (2000). Such "free-knot" spline methods are popular in Bayesian curve-fitting and nonlinear regression (e.g Smith and Kohn, 1996;DiMatteo, Genovese and Kass, 2001;Lindstrom, 2002). In the survival analysis setting, Mallick, Denison and Smith (1999) have suggested a Bayesian MARS approach building on the methods of Kooperberg, Stone and Truong (1995) and LeBlanc and Crowley (1999) for nonparametrically modeling regression coefficient effects, but found that the associated computational complexity made it impractical. Retaining a semiparametric effects structure and modeling the baseline hazard and frailty density nonparametrically allows great model flexibility while remaining practical for use with large data sets.
The remainder of this paper is organized as described below. In Section 2, we introduce the necessary notation, develop the basic model structure and the aforementioned extensions, and obtain the log-posterior parameter density under an appropriate noninformative right censoring scheme. We summarize the posterior sampling algorithm and key implementation details in Section 3. Section 4 contains illustrative simulation results, and discusses the relative merits of our approach and existing ones. In Section 5, we use our methodology to analyze data from a study of congestive heart failure patients, to demonstrate the flexibility and utility of the approach in a practical setting. We conclude with a brief discussion in Section 6. Appendices A and B respectively contain examples of important penalties and useful parameterizations of parametric hazard functions and frailty distributions. The accompanying supplementary materials (Sharef et al., 2010) to this paper contain an expanded version of Section 3, including the various conditional posteriors used in our Gibbs sampling algorithm and adaptive knot selection stages. The methodology described herein has been implemented in the R package splinesurv, available from CRAN.

Remark:
We adopt a Bayesian philosophy in the development of estimation and inference procedures in this paper. However, we also take the pragmatic view that the proposed methods provide a valuable tool for frequentist-based inference. For example, in the simulations of Section 4, we regard the target of estimation as being the combination of a fixed but unknown set of regression parameters, a baseline hazard function, and a frailty density and assess the frequentist performance of several relevant posterior summaries.

Model structure
The basic proportional hazards frailty model, described in Section 2.1, utilizes B-spline formulations for both the baseline hazard and frailty probability density function, along with prior specifications that may be used to encourage smoothness. In Section 2.3.1, this model is extended by considering convex combinations of the aforementioned B-splines and (as appropriate) a parametric hazard or density function. The resulting model, which continues to retain the form of a proportional hazards frailty model, permits (but does not force) shrinkage towards a specific parametric hazard function or frailty distribution as well as allows one to incorporate specific prior knowledge. We describe our methods for adaptive selection of the number and placement of the B-spline knots in Section 2.3.2. Each of the baseline hazard and frailty density curves can then be specified as some combination of a basic model and its optional extensions, indexed by a variety of priors and options. The result is a flexible family of models that allows prior knowledge about the form and smoothness of either curve to be incorporated into the model fit to the desired extent.

The proportional hazards frailty model
We assume that the observed data structure includes information on a failure time outcome and covariate information for m independent clusters of J i ≥ 1 subjects, i = 1 . . . m. In the absence of any process that censors the observation of failure times, we suppose that each subject (i, j), i = 1 . . . m, j = 1 . . . J i experiences a failure (i.e., event) at time X ij , and may have available p-dimensional fixed covariates Z ij . We allow for the possibility of correlation between subjects within the same cluster by introducing a set of m positive cluster-level random effects, or frailties, U i , i = 1 . . . m. The frailties U 1 . . . U m are assumed to be independent and identically distributed with some density f (·) that has mean one; in addition, for i = 1 . . . m, subjects within the i th cluster are assumed to be conditionally independent of each other given U i . Given a baseline hazard function λ 0 (·), a p-dimensional vector of regression coefficients β, the frailties U = (U 1 . . . U m ) T and covariate information Z = {Z ij , j = 1 . . . J i ; i = 1 . . . m}, we further assume that the distribution of X ij , j = 1 . . . J i , can be specified via the proportional hazards model Evidently, if the variance of the frailty probability density f (·) is zero, then P {U 1 = U 2 = · · · = U m = 1} = 1 and all subjects are mutually independent; otherwise, members of the same cluster are marginally dependent. In typical applications, the observation of failure times on one or more subjects may be censored. To allow for the possibility of right-censoring, we assume that the follow-up data observed on subject (i, j) takes the form T ij = min{X ij , C ij } and δ ij = I(X ij ≤ C ij ), where C ij , j = 1 . . . J i ; i = 1 . . . m are potential censoring times. Censoring is assumed to be independent and noninformative in the same sense as Nielsen et al. (1992) (i.e., given λ 0 (·) and β). Let T and δ denote the vector of failure times and censoring indicators corresponding to the covariates Z and frailties U defined above. Under the indicated assumptions and conditionally upon Z, the "full data" likelihood function for (β, λ 0 ) may then be written where Λ 0 (t) = t 0 λ(s)ds (e.g. Andersen et al., 1993, Ch. IX).

Spline specifications
As indicated earlier, we intend to use splines for modeling λ 0 (·) and f (·). Initially, we propose to model λ 0 (·) as a non-negative linear combination of K λ B-spline basis functions B λk (·), parametrized for example as per de Boor (2001). The splines are of order Q λ , defined on N λ = K λ − Q λ interior knots ξ λ distributed over the range of the event times. The splines are indexed by parameters θ λ , with the weight associated with each spline basis function being given by w λk = e θ λk . That is, for t ≥ 0, the baseline hazard and cumulative hazard functions are assumed to have the representations Similarly, the frailty density f (·) is initially modeled as a convex combination of normalized B-spline basis functionsB uk (·) of order Q u defined on knots ξ u over a sufficiently large range, wherẽ The splines are indexed by parameters θ u , and both the B-splines and weights are normalized to ensure that the density integrates to 1. That is, given θ u , we assume the frailties U i are independent and identically distributed with probability density function f (·|θ u ), where for x ≥ 0 and is zero otherwise. In order to ensure that (2.4) has mean one, it is further assumed that θ u satisfies the additional constraint In a frequentist approach to parameter estimation, one would now maximize the marginal likelihood function that is obtained from (2.2) by integrating out U . In practice, the EM algorithm or some related variant is typically used for this purpose (e.g. Andersen et al., 1993, Ch. IX).

Bayesian modeling
The Bayesian approach to estimation for the frailty model treats U as an additional unknown set of parameters; upon introducing prior distributions for all unknown model parameters, inferences for specific parameters (including U ) may be derived from the appropriate marginal posterior distribution. In the present context, a Bayesian approach requires one to specify the prior distribution for the regression parameters β and spline parameters θ = (θ λ , θ u ).
We assume that β, θ λ , and θ u are independent. The prior distribution for β is assumed to be multivariate Gaussian (e.g. Staudenmayer, Ruppert and Buonaccorsi, 2008); the prior distributions for θ λ and θ u also have a Gaussian structure, but may incorporate a specific penalty to induce smoothness in the B-spline coefficients and avoid overfitting. Specifically, denoting the penalty functions as p λ (θ λ ) and p u (θ u ), the corresponding priors may be written (2.7) The penalty functions p u (·) and p λ (·) may be chosen to follow a simple quadratic form, as in (2.5), or to penalize second differences in the parameters θ u , θ λ , the integrated squared second derivative of the spline, or chosen to impose some other form of smoothness criteria. Examples of possible penalty functions are presented in Appendix A. Lastly, we assume inverse-gamma priors for the error variance parameters σ 2 β , σ 2 λ , and σ 2 u ; for example, we have for a fixed hyperparameter α β = (α β1 , α β2 ) T , and with analogous definitions for the priors on σ 2 λ and σ 2 u . These prior distributions may be quite diffuse. Remark: The prior distributions (2.6) and (2.7) corresponding to specific choices of the penalty functions p u (·) and p λ (·) may be improper. Improper priors, while not uniformly accepted among Bayesians, are commonly used in statistics. There are many reasons for this; Ghosh, Delampady and Samanta (2006, Ch. 5) provide a nice discussion. In our setting, the benefits of using an improper prior for either (2.6) or (2.7) stem mainly from the ability to incorporate smoothness and related qualitative criteria in a natural way. Suitably noninformative choices, provided one is led to a proper posterior distribution, may also lead to results that correspond more closely to those that would be obtained using a related frequentist procedure (e.g., penalized likelihood). However, because the use of improper priors also can create difficulties with the behavior and convergence of MCMC schemes (Robert and Casella, 2004), care is required in the selection of these penalty functions. Under the proposed model specification, and with σ denoting the set of prior variance parameters (σ 2 β , σ 2 λ , σ 2 u ), the log-posterior density (up to a constant) for the parameters of interest may be written with the terms in (2.9) and (2.10) corresponding to the likelihood (2. 2) with f (·) replaced by (2.4), the terms in (2.11) corresponding to the prior distributions (2.5)-(2.7), and the terms (2.12) corresponding to the inverse gamma prior distributions for σ 2 β , σ 2 λ , and σ 2 u . (e.g., see (2.8)).

Shrinking towards parametric models
One advantage of the Bayesian approach is the possibility of supplementing the observed data with prior knowledge. In fact, the selection of a specific set of prior distributions for use in (2.9)-(2.12) implicitly defines a target towards which the posterior means of (2.3) and (2.4) are respectively to be shrunk. For example, the use of a mean zero multivariate Gaussian prior distribution shrinks the baseline hazard function (2.3) towards a constant function. Such targets have little influence on the model fit in cases where data contain substantial information about the shape of the underlying curve. However, in the absence of strong guidance from the data, these implicit targets can play a more significant role and it may be advantageous to define a more explicit target for shrinkage. Unfortunately, it is not easy to define an informative prior on the spline parameters θ that induces shrinkage towards a specified parametric target. We therefore propose a modification to the spline models of Section 2.2 that permits the inclusion of a parametrically specified basis function. Specifically, we propose to replace the baseline hazard (2.3) with where φ λ ∈ [0, 1] and λ 0p (·|η λ ) and Λ 0p (·|η λ ) respectively denote the hazard and cumulative hazard functions for a specified parametric family of probability distributions (e.g., Weibull or lognormal). Similarly, we propose to replace the frailty density in (2.4) with (2.14) where φ u ∈ [0, 1] and f p (·|η u ) denotes the density function for a specified parametric family of probability distributions (e.g., gamma or lognormal). The baseline hazard function in (2.13) may be suggestively rewritten as with analogous representations available for the cumulative hazard and frailty density. Consequently, we may view the proposed model extension as being equivalent to using a parametric baseline hazard, allowing for possibility of a smooth but local deviation. The prior on the weight φ λ can be used to specify the degree of confidence in the parametric component. For example, a prior favoring small values of φ λ ensures that the fit will shrink more towards the parametric specification in data-poor circumstances. The use of Beta priors on the weights φ = (φ λ , φ u ) with fixed hyperparameters α φ λ , α φu provides a simple but sensible choice. For example, a nonuniform prior such as Beta(1, 2) (triangular) ensures that the weight of the nonparametric component shrinks if it does not capture much information beyond that captured by the parametric portion. Alternatively, a Beta(2, 1) prior may be used to place additional weight on the nonparametric component if an unusual structure is suspected. Fixing φ = (1, 1) reduces the model to that considered in Section 2.1, whereas φ = (0, 0) leads to a purely parametric Bayesian survival model. Priors for the parametric terms η λ and η u necessarily depend on the desired parametric form. Denote the priors by π λ (η λ |τ η λ ) and π u (η u |τ ηu ), where τ η λ , τ ηu index the prior distributions and may themselves have priors depending on fixed hyperparameters, say, α η = (α η λ , α ηu ). In practice, we have found it effective to parametrize the distributions λ 0p , f p in a way that permits Gaussian priors. Some example choices for common parametric forms are presented in Appendix B.
The log of the resulting posterior density (up to a constant) is then given by: where ℓ(U , θ, σ|η, φ, T , δ, Z) is analogous to (2.9)-(2.12), with (2.13) and (2.14) substituted for the hazard and frailty curves. The terms in (2.15) correspond to the Beta priors on the weights, and the terms in (2.16) correspond to the priors on the parametric components.

Adaptive knot selection
Little attention has been paid in Sections 2.2 and 2.3.1 to the choice of the number of spline knots N λ , N u and their positions ξ λ , ξ u . Both have a profound effect on the smoothness of the estimated hazard curve or frailty density: including only few widely-spaced knots generally leads to very smooth curves, whereas for multiple knots in close proximity, smoothness has to be enforced by a penalty in order to avoid the risk of overfitting.
In the work of Staudenmayer, Ruppert and Buonaccorsi (2008) and Komárek and Lesaffre (2008), the number of knots is fixed and their positions are distributed evenly over the range of the data. If the underlying curve is smooth, this approach yields excellent results in conjunction with smoothing penalties in the priors of eq. (2.6) and (2.7), provided that the knots and smoothing penalties are well-chosen. However, in the survival analysis setting, such specifications can be more challenging because the hazard function is really only observable through the uncensored event times.
Most forms of penalized smoothing act globally over the range of the data; however, there is often no a priori reason to suspect that all regions of the function(s) being modeled are similarly smooth. One way to relax the imposition of global smoothness in spline smoothing is to permit the data to help determine the "best" choices for both the number and position of the knots. An advantage of the Bayesian approach is that the number and positions of knots may be treated as additional parameters to be estimated. Mallick, Denison and Smith (1999) propose a Bayesian MARS method in which the hazard is a nonparametric function of the covariates, estimated by reversible-jump MCMC (Green, 1995) following Denison, Mallick and Smith (1998). Biller (2000) introduced an approach for automatic knot selection for generalized linear models using natural cubic splines, in which the number of knots and spline weights were chosen by reversible-jump MCMC methods. These procedures make it possible to sample from the posterior of the model set consisting of different numbers and placements of knots, and corresponding different dimensionality of the spline parameters. However, such additional flexibility does not come for free: estimation via MCMC is more complicated in comparison with the fixed knot setting because the total parameter dimension is not fixed a priori, but changes across successive Monte Carlo iterations.
It is possible to extend the spline formulation of Section 2.1 in the manner described above while taking into account the added complexities of the survival model specification and frailty deconvolution. In addition to the steps required to sample from the posterior of the parameters introduced in the preceding sections, the adaptive knot selection procedure consists of three possible moves that affect the hazard and frailty density curves by changing their B-spline basis: the addition of a knot (birth step), the removal of a knot (death step), and the change in position of a knot (move step). We allow the N = (N λ , N u ) knots ξ = (ξ λ , ξ u ) to take positions on a larger, predetermined set of candidate positions, say ξ c λ = {ξ c λ1 , . . . , ξ c λM λ }, ξ c u = {ξ c u1 , . . . , ξ c uMu }. These candidate knot specifications constitute a prior on knot placement. For example,the M λ candidate knots used for the baseline hazard might be selected as quantiles of the observed event times, making data-rich regions more likely to contain knots. During the birth step, an unoccupied candidate knot is added to the knot set; during the death step, an occupied knot is removed; and, during the move step, a knot relocates from its current position to a nearby candidate position.
When selecting the knots adaptively, the smoothness of the curve can be dictated by the number of knots and their positions, with no need for an additional smoothing penalty. In this case, the prior on the number of knots N plays a key role in specifying the smoothness. Denison, Mallick and Smith (1998) suggests placing Poisson priors on the number of knots, and simulation experiments conducted by Biller (2000) indicate that this gives good results in the context of nonlinear regression. The Poisson prior is strongly informative, and allows great control over the smoothness of the resulting curve, at the risk of overfitting. In contrast, less informative priors such as a Geometric or Negative Binomial can be used to penalize large numbers of knots and give preference to smoother curves. We consider these and other priors in the supplementary materials.
Let π(ξ|N ) and log π(N ) respectively denote the prior distributions selected for knot position and parameter dimension. Then, the joint log-posterior density (up to a constant) is obtained by incorporating this prior information into (2.9)-(2.12); specifically, we have where ℓ(U , θ, σ|T , δ, Z, ξ, N ) is as in eq. (2.9)-(2.12). A parametric component may be included as well, in the same way as discussed in Section 2.3.1. Further details on the reversible-jump MCMC procedure are provided in Section 3.3.

Sampling from the posterior distribution
The algorithm used to sample from the joint posterior distribution consists of three types of steps: initialization, parameter updates via Gibbs sampling and Metropolis-Hastings MCMC, and, if desired, adaptive knot selection via reversible-jump MCMC. After initialization, posterior samples of all parameters can be drawn by ordinary and reversible-jump MCMC steps, repeated as long as needed to ensure convergence of the chain and a sufficient number of samples from the posterior. Figure 1 shows the structure of the algorithm. We briefly discuss initialization of the algorithm in Section 3.1 and the Metropolis-Hastings update steps in Section 3.2. The reversible-jump method used for knot selection is discussed in Section 3.3. Expanded versions of these sections are contained in the supplementary materials and in Sharef (2008).

Initialization
Even though the chain can in principle be initialized at any value, we find that good starting values hasten the convergence of the chain and reduce the risk of numerical problems. Fixing all hyperparameters α to be 0.01 and all variances σ to be 0.1 results in suitably noninformative priors. We have found that fitting a frequentist proportional hazards model (e.g. via coxph in R) provides good initial values for both the frailties U and regression coefficients β. For the hazard spline, we set the number of knots to N λ = min( i J i /4, 35) and distribute the knots evenly across the quantiles of the event times. The frailty density spline is specified on N u = min(m/4, 35) knots distributed evenly across twice the range of the initial values of U as determined above, subject to positivity constraints. The spline and parametric coefficients θ λ , θ u , η λ , η u are initialized by maximizing their conditional log-likelihoods given the remaining terms.

Metropolis-Hastings MCMC steps
Once initial values have been obtained, the algorithm proceeds through the MCMC sampling loop of Fig. 1 for a specified number of iterations. Apart from the adaptive knot selection steps, the loop consists of successive Gibbs sampling steps, in which each set of parameters is updated in turn by Metropolis-Hastings. Cluster frailties U are updated independently, with candidates generated from gamma distributions centered at their current values. Regression coefficients β and curve coefficients θ λ , η λ , η u are each updated with normal candidates. The frailty spline parameters θ u are updated pairwise in order to ensure that the frailty density mean remains fixed at 1. That is, when generating a proposal vector, we select two elements, say θ ur and θ uv , to update; upon generating a new proposal value for θ ur , the mean constraint implies the value for θ uv . For the weights φ, a beta transition kernel is used; lastly, each of the variance parameters in σ 2 is sampled from its corresponding conditional posterior Inverse-Gamma distribution. Each of these steps depends on a distinct tuning parameter, set in such a way to maintain an acceptance rate of approximately 25%. Since setting so many tuning parameters by hand is impractical, they can instead be automatically tuned during the burn-in period. Further details on candidate generation and acceptance probabilities may be found in the supplementary materials and also in Sharef (2008).

Reversible-jump MCMC for adaptive knot selection
Throughout the steps discussed in Section 3.2, the number of knots in the model, hence dimension of the spline parameters θ λ , θ u , has remained fixed. In order to enable adaptive knot selection, we not only allow knots to move, but also permit changes in dimension, such as adding a knot (birth step) or deleting a knot (death step). The basic procedure is identical for both the hazard function and frailty density; hence, we discuss the procedure below in general terms only, omitting subscripts that identify the parameters as referring to either curve.
As indicated above, adaptive knot selection requires three types of steps: the "move" step, in which the position of a single knot is changed to some new point between its neighbor knots; the "birth" step, in which a new knot is added after a randomly chosen knot and the dimension of the spline parameter θ increases; and, the "death" step, in which a randomly chosen knot is removed and the dimension of the parameter decreases. The move step requires no dimension change and a standard Metropolis-Hastings update is sufficient. However, the death and birth steps involve a change in model dimension; hence, we use reversible-jump MCMC subject to a "dimension-matching" constraint (Green, 1995). Typically, transitions between a model indexed by a parameter set θ of dimension k and a candidate model indexed by parametersθ of dimensionk are accomplished by generating m uniform random numbers u and computing the candidate by a deterministic functionθ =θ(θ, u). For the reverse move, one generatesm random numbersũ and computes the candidate as θ = θ(θ,ũ). To ensure reversibility, the mapping between (θ, u) and (θ,ũ) must be bijective, and in particular, the dimension-matching constraint m + k =m +k must hold.
Following Denison, Mallick and Smith (1998), at each iteration we choose randomly whether to execute a birth, death, or move step. Given N = n, the probability of taking birth, death or move step is respectively b n = c min 1, where the constant c controls the rate of dimension-changing steps and b n π N (n) = d n+1 π N (n + 1). As in Denison, Mallick and Smith (1998), we take c = 0.4. We provide further details on the move step in Section 3.3.1, the birth step in Section 3.3.2, and the death step in Section 3.3.3.

Knot position change (move step)
In the move step, a single knot position ξ k to be moved is chosen uniformly from the set of interior knots, and changed to a random new candidate position located between its neighboring knots. That is, the candidate knot positioñ ξ k is selected uniformly from the set of candidate locations ξ c ∈ ξ c such that ξ k−1 < ξ c < ξ k+1 . Since the prior on the knot positions is discrete uniform over the set of candidate knots, the prior probabilities for knots ξ and the candidatẽ ξ are identical. Since no dimension change is required, the new knot positions are accepted with a Metropolis-Hastings probability. The spline parameters θ remain unchanged.

Knot addition (birth step)
In the birth move, a random unoccupied candidate knot ξ c ∈ ξ c is chosen to be added to the current set of knots ξ, of length N . Denote by k the interval of the current knot set containing ξ c , so that ξ k < ξ c < ξ k+1 . The new candidate knot set is then given byξ = {ξ 1 , . . . , ξ k , ξ c , ξ k+1 , . . . , ξ N }, of lengthÑ = N + 1. The set of spline coefficients θ of length K must be updated to a candidate setθ of lengthK = K + 1. There are simple rules for non-destructively inserting a new knot into a B-spline function (de Boor, 2001), but using these directly would violate the reversibility and dimension-matching constraint between the birth and death moves mentioned earlier. Since the birth move begins in a model of dimension K and its reverse begins at dimension K + 1, we need to generate an additional random number for the birth move. Intuitively, since removing a knot is a destructive procedure and may cause the shape of the curve to change, we must during the birth move be able to generate the set of curves that would reduce to the original curve upon removal of the new knot.
To do this, we compute the candidate spline parametersθ for inserting a knot ξ c ∈ (ξ k , ξ k+1 ) as follows: , and u ∼ U (0, 1). These rules correspond to the deterministic rules in de Boor (2001), except that the parameterθ k+Q is perturbed by a random amount, rather than by the knot ratio r k+Q . The candidate number of knotsÑ and spline parametersθ are then accepted with probability where R L is the relevant likelihood ratio (see (2.17)), R P is the prior ratio encompassing the priors on the number of knots and knot positions, R T is the transition ratio and J is the Jacobian of the transformation in (3.1). Expressions for these terms may be found in the supplementary materials.

Knot deletion (death step)
In the death step, a single knot ξ k is chosen uniformly from the set of knots ξ to be removed. The candidate knot set for the next iteration is thenξ = {ξ 1 , . . . , ξ k−1 , ξ k+1 , . . . , ξ N }. The spline parameters are correspondingly adjusted by the inverse of the transformation in (3.1), that is, by deleting θ k+Q−1 and adjusting the remaining parameters as Because the birth and death moves are symmetrically defined, the likelihood ratio, prior ratio, transition ratio and Jacobian determinant are the inverses of those in eq. (3.2).

Simulation studies
We implemented the methodology described in Section 3 in the R package splinesurv. In order to establish the performance and flexibility of the method, we conducted simulation studies under a variety of settings that intend to investigate the capacity of the method to correctly identify the form of the underlying baseline hazard and frailty density, for different numbers of clusters and cluster sizes. Furthermore, we desired to show that the estimated regression coefficients β and frailty variance parameter ν 2 can be accurately estimated. We remark here that ν 2 is not an explicit model parameter but rather a functional of the frailty density; for example, under (2.4), ν 2 = ν 2 (θ u ), where (4.1) alternatively, one may replace (2.4) with (2.14). We focus on this particular summary since it is frequently of interest in analyses utilizing a parametric frailty model. We consider three scenarios within which to test the method, differing in the form of the "true" baseline hazard and frailty density used to generate simulated data. The first, referred to as the "Parametric" scenario, is characterized by a Weibull hazard and a lognormal frailty density. In the "Smooth" scenario, the baseline hazard is a smooth curve not easily described by typical parametric forms, and the frailty density is a mixture of two lognormal distributions. In the "Stepfunction" scenario, the baseline hazard is a discontinuous step function, and the frailty distribution is a mixture of uniform densities. Figure 2 contains plots of the hazard, survival and frailty in each of the three scenarios.
For purposes of the simulation, a replication consists of first generating frailties U i , i = 1 . . . m from the scenario's frailty density. A single covariate is generated for each subject as Z ∼ N (0, 1). The single regression coefficient is fixed at β = 1. Given the frailty and covariate, event times can then be generated using the baseline hazard for the scenario. Censoring times are independently generated from a Weibull hazard with parameters chosen for each scenario to yield approximately a 20% censoring rate. The simulated sample generated in this way can then be fit using the splinesurv package.

Curve fitting performance
To explore the effects of sample size on the quality of the curve fits, we first simulate a single dataset for various sample sizes, under each scenario, and explore the effect of different model specifications. We limit ourselves to four sample sizes for each scenario, setting the number of clusters to either m = 10 or m = 500, and the cluster size to either J i = 10 or J i = 500, i = 1 . . . m.
We also include a sample size and censoring profile matching the data to be analyzed in Sec. 5 The methodology is very flexible, offering a range of choices of penalty functions, parametric distributions, prior parameters, and the option of adaptive knot selection. For brevity, we only select one model specification for purposes of demonstrating curve-fitting here. For both the hazard and frailty, we include a spline component only, using a simple Gaussian prior on the spline parameters (corresponding to the penalty function in Section A.1). We allow for adaptive knot selection with a truncated Poisson prior on the number of knots, with means µ λ = µ u = 10 and a maximum of 35 knots, and 100 candidate knots distributed uniformly over the range of the data. Each fit was run for a 2000iteration burn-in, during which tuning parameters were chosen adaptively to ensure approximately a 25% parameter acceptance rate, followed by 3000 iterations used for constructing posterior estimates.
The results are shown in Figure 3, where we report posterior means and 95% pointwise credible intervals, and suggest that the methodology functions as intended: in each scenario, the fitted models capture the features of the underlying hazard and frailty curves with sufficiently large samples. The number of clusters appears to have a more immediate effect on the quality of the fit than does the cluster size, especially for the frailty density. In order to obtain an accurate estimate of the frailty density, a large number of clusters is required, but these clusters need not be large. With few, large clusters, the form of the hazard can be identified, while the frailty distribution cannot. Hazard estimates in the Stepfunction scenario display sharp spikes at the points of discontinuitythis is an artifact caused by the use of cubic splines in a scenario where linear splines would have been better able to capture the discontinuity.
Following these single dataset analyses, we repeated this process 1000 times. Averaging the posterior mean estimates over multiple replications shows that the baseline hazard curve is estimated with great accuracy and only minimal smoothing bias (see Fig. 4). If the number of clusters is large, the frailty density can be well-estimated also. Further such simulation results (not shown here) indicate that fixed-knot penalized splines perform well in the Parametric and Smooth scenarios, but do quite poorly in the Stepfunction scenario, as the sharp trough cannot be captured without significant smoothing error. Further, the inclusion of a correctly specified parametric component improves curve-fitting performance in the Parametric scenario, and does not significantly affect the other scenarios.

Parameter estimation performance
In order to establish the ability of the procedure to estimate the regression parameter and the frailty variance, we conduct a simulation study at smaller sample sizes. Sample sizes under all three scenarios are limited to 10, 50 and 500 clusters of size 10 or 50 or 500, excluding the largest combination. We further introduce a sample with the same cluster number and sizes, and censoring  characteristics similar to the data set to be analyzed in sec. 5. The scenarios are specified as before.
We consider six spline model specifications: The first two are fourth order spline-only models with adaptively chosen knots and Poisson(10) or Poisson (20) priors respectively on the number of knots, as described for Figure 3. The third and fourth add parametric components, consisting of a Weibull baseline and a lognormal frailty curve, with a Beta(1, 2) prior on the weight, thus giving slight preference to the parametric component. We note here that these parametric components are correctly specified in the Parametric scenario but not otherwise. The fifth is a fixed-knot penalized spline model specified according to Section 2.1, with equally spaced knots and a penalty on the integrated squared second derivative, following Section A.3. The sixth is similar, but penalizes the squared second differences between the parameters, as per Section A.2. Penalized spline fits can be sensitive to the choice of hyperparameters; these were chosen here so as to give reasonably smooth curves in several test scenarios; we intentionally did not choose the "best" settings, but instead tried to select parameters as one might do if the underlying curve were unknown.
In Bayesian estimation by MCMC, the collection of posterior samples contains a great deal of useful information. However, for the purposes of this simulation study, and to enable comparison with frequentist methodology, we construct point estimators from these posterior samples and summarize their frequentist performance. Natural estimators for the regression coefficient β include the mean and median of its marginal posterior distribution, estimated for each simulated dataset from the corresponding MCMC samples from the joint posterior distribution. Similarly, for the frailty variance ν 2 , one may utilize the posterior mean or median of (4.1), obtained for each simulated dataset using the corresponding MCMC samples from the posterior distribution of the parameters defining either (2.4) or (2.14).
Tables 1 and 2 respectively contain the simulated biases of the posterior mean and median point estimators based on 1000 simulations for each of the spline model specifications. For comparison, we also fit a (misspecified) gamma frailty model as described in Therneau and Grambsch (2000) using the R routine coxph. In general, the results show that the proposed Bayesian modeling scheme yields good estimates of the regression coefficient, with the similarity between means and medians reflecting symmetry in the posterior. With few exceptions, the biases are acceptably small and similar to the regression coefficients estimated under the gamma frailty model. When additionally accounting for estimation of the frailty variance, we see that the adaptive methods perform very well in the Parametric setting; here, increasing the prior mean number of knots also has no clear effect, as evidenced by the comparable performance of the Adapt(10) and Adapt(20) methods. In fact, most of the methods perform quite well in the Parametric setting across the range of sample sizes considered, a notable exception being the Penalized (2 nd diff.) prior when estimating ν 2 . The story is somewhat less consistent outside of the Parametric setting, where the Penalized (2 nd deriv.) prior arguably exhibits some robustness and turns in the best overall performance. The Penal- ized (2 nd diff.) prior is observed to perform especially well in the Stepfunction (i.e., nonsmooth) setting and acceptably well in the Smooth setting; the adaptive methods also perform well across all settings with larger total sample sizes. The adaptive methods that include a parametric component perform admirably in estimating the regression coefficients in all cases. These also do very well in estimating the frailty variance in the Parametric setting (i.e., where both para- metric components are correctly specified). Outside of this setting, the degree of smoothness and sample / cluster size have a clear impact on performance.
In general, the results suggest that the Penalized (2 nd deriv.) prior is the best overall choice unless one has strong prior information to suggest that the baseline hazard and frailty distributions are particularly smooth (in which case adaptive methods with or without a parametric start perform commendably well) or particularly nonsmooth (in which case Penalized (2 nd diff.) is perhaps the best choice). The frequentist performance of the posterior mean and median tend to be comparable for estimating regression coefficients; the posterior median exhibits somewhat less bias in comparison with the mean for estimating the frailty variance, with the greatest benefits being observed in the Smooth and Stepfunction settings when used in combination with adaptive knot selection.
As expected, the results for the gamma frailty model in Table 1 exhibit good performance for estimating the regression coefficient in all settings, despite a misspecified frailty distribution; however, relatively poor performance is observed for estimating the frailty variance. In our opinion, the results suggest that the use of the Penalized (2 nd deriv.) prior creates an acceptable and generally robust tradeoff, with the potential for mildly increased bias in the regression coefficients being offset by substantial reductions in the bias of the estimated frailty variance.

Congestive heart failure data description
We consider data from a study conducted in a 487-bed, not-for-profit community hospital located in southeast Michigan. The study population consisted of patients with either systolic or diastolic heart failure, and was originally identified for the purposes of a randomized, controlled clinical trial designed to compare a pro-active case management strategy to standard care on all-cause re-hospitalizations. A planned secondary analysis was to determine prognostic factors for readmission or mortality.
Patients were eligible for the study if they were hospitalized on an internal or family medicine service between October 29, 2002 andSeptember 20, 2003 and received intravenous diuretics to treat possible heart failure. Intervention patients were assessed by a cardiology nurse practitioner who developed a protocol-driven discharge plan that could include telemanagement, an outpatient nurse-run heart failure clinic, or usual care. All control patients were managed by the usual discharge planning activities of hospital staff. Of the 440 patients enrolled in the study, 17 died during the index hospitalization and were removed from the sample, resulting in a cohort of 423 patients. Unfortunately, approximately one-half of the patients assigned to the intervention arm were discharged prior to receiving the complete intervention. Using an intent-to-treat analysis, (e.g. Fisher et al., 1990), no difference between the intervention or control groups was found for the outcome of all-cause subsequent hospitalizations or emergency department encounters.
Below, we proceed to re-analyze the data, defining the event of interest as a patient's first rehospitalization or death during the 180 day period following the index hospitalization. Of the 423 patients, 257 such events were observed, of which 233 are rehospitalizations and the remainder are deaths. The remaining 39.2% of patients are considered to be censored as of the end of followup. Patients are clustered into 31 groups by their attending physician, ranging in size between 1 and 80 patients, with mean and median cluster sizes of 14 and 5.5 respectively. A wide range of explanatory variable data are available for each patient. Since treatment was not administered to half of the intervention group patients, treatment group membership is excluded from the set of covariates. Covariates with more than 5% missing values were not considered for the purposes of this analysis; for all remaining covariates, we imputed any missing covariate values by their respective median values. A subset of these remaining covariates has been selected using a combination of stepwise automated procedures and consultations with study clinicians. These covariates, along with basic descriptive statistics, are summarized in Table 3.

Model selection and fitting
In the absence of external information on which to base a model choice, we fit all six model specifications presented in Sec. 4.2. Prior to analysis, all covariates were centered and standardized; we found empirically that doing so improved the mixing properties of the MCMC procedure. The sampling chain is run for 50,000 iterations, discarding the first 20,000 iterates as burn-in and thinning the chain to every 10th sample. In order to provide a measure of model discrimination and means for selection, we used a modified version of the original Deviance Information Criterion (DIC; Spiegelhalter et al., 2002). In particular, we computed whereD and V are respectively the sample mean and variance of deviance calculated at each MCMC step. Here, V /2 is used as an estimator of the effective number of parameters p D because, unlike the usual estimator, it is guaranteed to be non-negative; see (Gelman et al., 2004, p. 182). Estimated DIC values are shown Table 4. Consistent with simulation results, these results indicate that a model fitting penalized splines to the hazard and frailty density using a prior that reflects a second-derivative-type penalty is a good choice. We focus our attention below on the indicated model; results pertinent to this model fit are summarized in Table 5 and Figure 5. Recall the hazard and frailty are specified following Sections 2.1 using fourth-order (cubic) splines. Table 5 shows the posterior mean and 95% posterior intervals for the covariate effects and the variance of the random effects in the first three columns. We observe substantial agreement in the signs and magnitudes of the regression coefficients in comparison with both the gamma and lognormal frailty models fit using coxph and summarized in the lower half of Table 5. For comparison, we also provide the regression results and frailty variance obtained under the Adapt(10) spline model specification. Comparable results are again observed for the regression parameters, with the frailty variance estimate having a substantially larger posterior mean and variation. Both of the spline-based models suggest the existence of a frailty effect, whereas the frailty variance is effectively estimated to be zero in both the gamma and lognormal frailty models. Based on the results summarized in Section 4, care is clearly required when interpreting these point estimates; all variance estimators exhibit some bias. For example, with the indicated sample size and under the Smooth setting (e.g., Table 1), we see a positive bias for the Adapt(10) setting and negative bias for the Pen. (2 nd deriv.) setting, suggesting that the posterior mean of the frailty variance may perhaps lie closer to 0.5 or 0.6.
The top panel of Figure 5 shows the posterior mean estimate of the hazard, survival and frailty density curves, as well as pointwise 95% credible bands for each. The shape of the hazard and frailty density estimates is arguably poorly described by a parametric model. We observe that the risk for readmission or death is greatest shortly after discharge from the index hospitalization, declining rapidly during the first few weeks post-discharge, then more slowly before reaching a minimum at approximately 120 days. There is a secondary peak around 150 days. This pattern of risk suggests that the design of interventions for postponing mortality or readmissions should be targeted at the care transition from the hospital to home or to another facility, with most of the benefit being realized in the first few months following the index event.
The Bayesian approach allows the marginal posterior distributions of the physician-level effects U i , i = 1 . . . 31 (i.e., frailties) to be examined. The lower panel of Figure 5 shows boxplots that summarizes the main features of these marginal posteriors. The height and width of the boxes indicate that, with increasing cluster size, the marginal posterior distributions of the frailties exhibit less dispersion and tend to be larger than the prior mean. In the absence of heterogeneity, one would expect to see each boxplot centered approximately at the prior mean (i.e., one). The observed pattern is consistent with the existence of heterogeneity, and perhaps even suggests the possibility at least two distinct groups of physicians or defining practice characteristics.
The results summarized above rely on approximate samples from the posterior distribution obtained via MCMC. We monitor the mixing of chain parameters by examining trace plots and autocorrelation functions of the posterior samples. The trace plots for coefficients in Figure 6 indicate that the regression coefficient estimates have converged. Kernel density estimates based on the posterior samples suggest approximately normal marginal posterior distributions. However, estimates of the frailty variance do appear to mix at a lower rate than the other parameters, and show some evidence of long-term autocorrelation.
Previous studies have suggested that physician-level variables, such as physician volume and specialty, impact both the readmissions and mortality of heart failure patients (Jong et al., 2003;Cujec et al., 2005). The influence of the physician as the frailty effect on heart failure outcomes, as examined in our study, may be a novel contribution. However, it is also possible that the observed heterogeneity merely reflects physician-level differences not currently adjusted for in the hazard model. For example, in this data set, the physician clusters actually represent groups of physicians who share responsibilities as attending physicians for hospital care. There are inter-cluster differences in training (specialist and generalist) and also scope of practice: academic hospitalists, private practice hospitalists, and community physicians (Halasyamani et al., 2005). The physician effect may also serve as a proxy for differences between in-hospital and post-hospital systems of care delivery. Community physicians are more likely to follow their patients personally after hospital discharge, whereas the academic and private hospitalists frequently transition such care to other physicians. The information available in this data set is not sufficient to examine these potential influences. Further research is needed to determine which, if any, of these effects contribute in a significant and modifiable way to heart failure outcomes.

Choice of model specification and priors
We now illustrate the effects of selecting an alternative model specification or different priors when analyzing the same data set. Instead of the penalized spline specification selected by the DIC criterion in Section 5.2, we fit an adaptive spline model with a parametric component, as described in Section 2.3.1. Figure 7 shows the curves and frailties obtained using adaptive splines specified per the Adapt(20)+Par specification introduced in Section 4.2. This model specification consists of a Poisson(20) prior on the number of spline knots and models the baseline hazard and frailty density as a mixture of B-splines and a  Table 3) and frailty variance (denoted by fvar) of a penalized spline fit to the congestive heart failure data. parametric curve, here being a a Weibull hazard function and lognormal density, respectively. The curves in the top panel show that the specification has a substantial smoothing effect on the hazard and frailty density estimates. In particular, the parametric component dominates the frailty density, making it appear nearly lognormal, but has a smaller effect on the estimated hazard curve. The lower panel in Figure 7 shows that the marginal posterior distributions of the frailties exhibit greater variation in comparison with those in Figure 5, an effect that seems particularly noticeable with smaller clusters.
We next use the congestive heart failure data to illustrate the effect of choosing different priors on the number of knots. Figure 7 used a Poisson(20) prior on the number of knots. The top panel of Figure 8 compares this fit to those resulting from different Poisson prior choices. As expected, setting the prior to Poisson(1) leads to excessively smooth fits, whereas the Poisson(50) curve is considerably more variable and shows potentially undesirable detail. The effect on the survival curve is relatively small, however, as local bumps in the hazard are smoothed out by the integration.
The second and third panels show the effect of using geometric and negative binomial priors for the number of knots. As noted by Biller (2000), these priors universally encourage smoother fits, and are relatively insensitive to the choice of parameters. This is a desirable property if a more robust fit is preferred, but if control over the smoothness of the curve is desired, the Poisson prior is preferable. DIC might be used to select the mean of the Poisson prior.
The relative prominence of the parametric and spline components can be controlled through the prior on the weights φ. The bottom panel of Figure 8 shows the effects of changing the prior to Beta(1, 10) and Beta(10, 1), which respectively place more and less emphasis on the parametric component. The frailty density is more sensitive to changes in the prior weight than the hazard curve, because with only 31 clusters, the data contain much less information regarding the shape of the frailty density.

Discussion
The proposed Bayesian methodology permits the analysis of clustered survival data when the underlying frailty distribution is unknown, helping to mitigate the impact associated with the specification of an incorrect parametric model on the assessment of heterogeneity. Our simulation results demonstrate that the method generally performs well, particularly so in cases where the data contain many clusters of reasonable size. From a frequentist perspective, levels of bias and also coverage (results not shown) for the posterior mean and median regression coefficients are generally comparable to those obtained under the gamma frailty model; in addition, substantial reductions in the bias of the frailty variance are frequently achieved. The results further demonstrate that unusual baseline hazards and frailty distributions can be correctly identified in the presence of sufficient information; the resulting gain in flexibility and information available from the proposed methodology may help to offset its concomitant (and potentially substantial) computational cost. Extensions of the methodology to accomodate stratification and time-dependent covariates are conceptually simple; however, at the time of this writing, such extensions are not supported by the accompanying R package.
The Bayesian approach provides a wealth of information in the form of the joint posterior distribution of all parameters of interest. MCMC sampling provides the joint posterior of the hazard and survival, as well as for cluster-level frailties; in addition, unlike standard frequentist methods of estimation for the proportional hazards frailty model, uncertainty in the frailty variance (i.e., density) is naturally reflected in the posterior behavior of all parameter estimates. Through the deliberate specification of priors, the Bayesian approach permits great flexibility in model specification. For example, with appropriate priors and hyperparameters, the smoothness of the baseline hazard and frailty density are mainly controlled through the prior distribution. In addition, in settings where either curve can be well-modeled using a parametric family of distributions, the inclusion of an appropriate parametric component results in smoother, less variable fits. Importantly, one is also not limited to prior specifications that are primarily designed to control the degree of smoothness in the baseline hazard function and frailty density; where available, practitioners can easily incorporate prior information on model parameters and the general shape of the baseline hazard function, decreasing the possibility of obscuring important effects.
The statistical methodology developed here may also prove to be a useful tool for public policy audiences. Re-hospitalizations of Medicare patients within 30 days (as a binary outcome) are receiving increasing scrutiny due to their frequency and costs (Jencks, Williams and Coleman, 2009). In addition, there is on-going dialogue regarding the current payment system and how it can be redesigned to align better with improved patient outcomes (Averill et al., 2009). Despite the interest in outcomes reported at the cluster (eg. hospital, physician) level, existing multivariable models in the clinical literature provide only a modest amount of explanatory information Keenan et al., 2008), with hospital performance typically represented by normally-distributed ran-dom intercepts. Implicit to such "profiling" approaches is that the unexplained variability may signal important quality-related performance differences at the cluster level . Our flexible modeling approach may help to accelerate understanding of the nature and influence of the providers in such applications, which often involve large databases, through modeling provider-level effects in a more nonparametric way. Understanding the provider characteristics associated with the best (and worst) outcomes in such a distribution may help in identifying the best practices to replicate.

Acknowledgements
We would like to thank our two referees for their helpful and constructive comments, and also our "guest editor" Larry Wasserman for his assistance in handling this paper. Emmanuel Sharef and Robert Strawderman gratefully acknowledge the support of NIH grant GM056182; David Ruppert gratefully acknowledges the support of NIH grant CA57030 and NSF grant DMS-0805975.

A.1. Gaussian penalty
The penalty function may be chosen to yield a Gaussian prior on the parameter set, for example p λ (θ λ ) = θ T λ θ λ , and analogously for p u (θ u ). This penalty function is recommended when adaptive knot selection is used, since in that case, the smoothness of the curve is controlled through the prior on the number of knots, and does not need to be explicitly penalized.

A.2. Penalty on second differences
Let D be a matrix so that Dy computes the second difference in y, and let P = D T D. Then, for analogously defined matrices P λ , P u of the appropriate dimensions, the following functions penalize the second differences in the spline parameters: p λ (θ λ ) = θ T λ P λ θ λ , p u (θ u ) = θ T u P u θ u . While this choice of penalty function is appropriate when the knots are equally spaced, it does not necessarily result in smooth behavior otherwise. When knots are unevenly spaced, it is possible to adjust the matrix D to compute appropriately scaled differences.

A.3. Penalty on the second derivative
In order to ensure smoothness even when knots are not equally spaced, we can construct a penalty on the second derivative of the spline. In the case of the baseline hazard spline, that is where P λ is a matrix whose (j, k) entry is (2) λk (t) dt .
We can construct an analogous penalty matrix for the frailty density, keeping in mind that the frailty density uses normalized splines, that is, p u (θ u ) = e θu TP u e θu (1 T Ku e θu ) 2 , whereP u,jk = ∞