Bayesian Nonparametric Erlang Mixture Modeling for Survival Analysis

We develop a flexible Erlang mixture model for survival analysis. The model for the survival density is built from a structured mixture of Erlang densities, mixing on the integer shape parameter with a common scale parameter. The mixture weights are constructed through increments of a distribution function on the positive real line, which is assigned a Dirichlet process prior. The model has a relatively simple structure, balancing flexibility with efficient posterior computation. Moreover, it implies a mixture representation for the hazard function that involves time-dependent mixture weights, thus offering a general approach to hazard estimation. We extend the model to handle survival responses corresponding to multiple experimental groups, using a dependent Dirichlet process prior for the group-specific distributions that define the mixture weights. Model properties, prior specification, and posterior simulation are discussed, and the methodology is illustrated with synthetic and real data examples.


Introduction
The Erlang mixture model is defined as a weighted combination of gamma densities, M m=1 ω m Ga(t | m, θ), with each gamma density Ga(t | m, θ) indexed by its integer shape parameter, m, and with all densities sharing scale parameter θ. Hence, in contrast to traditional mixture models, Erlang mixtures comprise identifiable mixture components and a parsimonious model formulation built from kernels that involve a single parameter that needs to be estimated. Indeed, it is more natural to view the model as a basis representation for densities on R + , where the Ga(t | m, θ) play the role of the basis densities and the ω m provide the corresponding weights. The key result for Erlang mixtures stems from the construction of the weights as increments of a distribution function G on R + , in particular, ω m = G(mθ) − G((m − 1)θ) (with the last weight adjusted such that the ω m form a probability vector). Then, as M → ∞ and θ → 0, the Erlang mixture density converges pointwise to the density of G (e.g., Butzer 1954, Lee & Lin 2010. The Erlang mixture structure, in conjunction with the theoretical support from the convergence result, provide an appealing setting for nonparametric Bayesian modeling and inference. The key ingredient for such modeling is a nonparametric prior for distribution G, which, along with priors for θ and M , yields the full Bayesian model. Regarding relevant existing approaches, we are only aware of Xiao et al. (2021) where the Erlang mixture is used as a prior model for inter-arrival densities of homogeneous renewal processes. Also related is the prior model for Poisson process intensities in Kim & Kottas (2022), although for that model the weights are defined as increments of a cumulative intensity function.
Finally, we note that the Erlang mixture model was used for density estimation in Venturini et al. (2008), albeit with fixed M and with a Dirichlet prior distribution for the vector of weights, i.e., without exploiting the construction through distribution G.
To our knowledge, Erlang mixtures have not been explored as a general methodological tool for nonparametric Bayesian survival analysis, and this is our motivation for the work in this article. The nonparametric Bayesian model is built from a Dirichlet process (DP) prior (Ferguson 1973) for distribution G, which defines the mixture weights, and from parametric priors for θ and M , which control the effective support and smoothness in the shape of the Erlang mixture density. The modeling approach is sufficiently flexible to handle non-standard shapes for important functionals of the survival distribution, including the survival function and the hazard function. We discuss prior specification for the model hyperparameters, and design an efficient posterior simulation method that draws from well-established techniques for DP mixture models. The model is extended to incorporate survival responses from multiple experimental groups, using a dependent Dirichlet process prior (MacEachern 2000, Quintana et al. 2022) for the group-specific distributions that define the mixture weights. The model extension retains the flexibility in the group-specific survival densities, and it also allows for general relationships between groups that bypass restrictive assumptions, such as proportional hazards.
Survival analysis is among the earliest application areas of Bayesian nonparametrics.
The literature includes modeling and inference methods based on priors on the space of survival functions, survival densities, cumulative hazard functions, or hazard functions.
Reviews can be found, for instance, in Ibrahim et al. (2001), Phadia (2013), Müller et al. (2015), and Mitra & Müller (2015). The part of this literature that is more closely related to our proposed methodology involves DP mixture models for the survival density. Such mixture models have been developed using kernels that include the Weibull distribution (e.g., Kottas 2006), log-normal distribution (e.g., De Iorio et al. 2009), and gamma distribution (e.g., Hanson 2006, Poynor & Kottas 2019. The convergence property for Erlang mixtures is the only mathematical result we are aware of that supports the choice of a particular parametric kernel in mixture modeling for densities on R + . Our main objective is to add a new practical tool to the collection of nonparametric Bayesian survival analysis methods. The DP-based Erlang mixture model may be attractive for its modeling perspective that involves a basis densities representation, its parsimonious mixture structure, and efficient posterior simulation algorithms (comparable to the ones for standard DP mixtures).
The rest of the article is organized as follows. Section 2 introduces the methodology, including approaches to prior specification and posterior simulation (with details for the latter given in the Appendixes). Sections 3 and Section 4 present results from synthetic and real data examples, respectively. Finally, Section 5 concludes with a summary.
2 Methodology 2.1 The modeling approach Erlang Mixture Model. We propose a structured mixture model of Erlang densities for the density, f (t), of the survival distribution, aiming at more general inference for survival functionals than what specific parametric distributions can provide. Specifically, let where ω = {ω m : m = 1, . . . , M } denotes the vector of mixture weights, and Ga(· | m, θ) the density of the Erlang distribution, that is, the gamma distribution with integer shape parameter m and scale parameter θ, such that the mean is mθ and the variance mθ 2 . Given the number of the Erlang mixture components, M , the kernel densities in (1) are fully specified up to the common scale parameter θ. Hence, compared with standard mixture models, for which the number of unknown parameters increases with M , the model in (1) offers a parsimonious mixture representation.
A key component of the model specification revolves around the mixture weights. These are defined through increments of a distribution function G with support on R + , such This formulation for the mixture weights provides appealing theoretical results for the Erlang mixture model in (1). In particular, as M → ∞ and θ → 0, f (t | M, θ, ω) converges pointwise to the density function of G. The convergence property for the density can be derived from more general probabilistic results (e.g., Butzer 1954); a proof of the convergence of the distribution function of f (t | M, θ, ω) to G can be found in Lee & Lin (2010). This result highlights that using a prior with wide support for G is crucial to achieve the generality of the model in (1) required to capture non-standard shapes of a survival distribution. We provide details below on the nonparametric prior for G, as well as on the priors for parameters θ and M .
The model in (1) also offers a flexible, albeit parsimonious mixture representation for the survival function, S(t | M, θ, G), and the hazard function, h(t | M, θ, G). Note that, having defined the mixture weights ω through distribution G, we use the latter in the notation for model parameters. Denote by S Ga (· | m, θ) and h Ga (· | m, θ) the survival and hazard function, respectively, of the Erlang distribution with parameters m and θ. Then, the survival function associated with the model in (1) is given by that is, it has the same weighted combination representation as the density, replacing the Erlang basis densities by the corresponding survival functions. Moreover, the hazard function under the Erlang mixture model can be expressed as where We place a DP prior on G, i.e., G | α, G 0 ∼ DP(α, G 0 ), where α > 0 is the total mass parameter and G 0 the centering distribution (Ferguson 1973). We work with an exponential distribution, Exp(ζ), for G 0 , with random mean ζ assigned an inverse-gamma hyperprior, ζ ∼ inv-Ga(a ζ , b ζ ). We further assume a gamma hyperprior for the total mass parameter, α ∼ Ga(a α , b α ). Given M , the DP prior for G implies a Dirichlet prior distribution for the vector of mixture weights, ω | M, α, ζ ∼ Dir(αG 0 (B 1 ), . . . , αG 0 (B M )).
The nonparametric prior for G is of primary importance. The DP prior allows the corresponding distribution function realizations to admit general shapes that can concentrate probability mass on different time intervals B m , thus favoring different Erlang basis densities through the associated ω m . The key parameter in this respect is α, as it controls the extent of discreteness for realizations of G and the variability of such realizations around G 0 . As an illustration, Figure 1 plots prior realizations for the mixture weights and the For panel (c), M θ = 5, resulting in noticeably smaller effective support for the realized densities relative to panels (a) and (b).
We work with a joint prior for θ and M , p(θ, M ) = p(θ)p(M | θ). We assume θ ∼ Ga(a θ , b θ ), and conditional on θ, assign to M a discrete uniform distribution, M | θ ∼ Unif( M 1 /θ , . . . , M 2 /θ ), where a is the smallest integer that is larger or equal to a. To specify the hyperparameters a θ , b θ , M 1 and M 2 , we use a relatively conservative approach, based on the range of the data. For M 1 , we choose a value greater than the largest value in the data, and set M 2 = cM 1 for a relatively small integer c. The motivation for this choice is to ensure that the effective support of the Erlang mixture model is sufficiently large for the particular data application. To specify the prior hyperparameters for θ, we notice that , based on which we recommend selecting values for a θ and b θ such that E(M 1 /θ) is between 10 and 50.
Posterior Simulation. The data point for the i th subject is recorded as where t i is the survival time and c i the (independent) administrative censoring time, for Then, the likelihood function can be written as where f (· | M, θ, G) ≡ f (· | M, θ, ω) and S(· | M, θ, G) are given in (1) and (2), respectively.
We implement posterior inference via Markov chain Monte Carlo (MCMC) simulation, using standard posterior simulation methods for DP mixture models (e.g., Escobar & West 1995, Neal 2000. The Erlang mixture density in (1) can be expressed as a DP mixture by exploiting the definition of the weights ω m through distribution G, resulting in the following alternative mixture representation: Here For posterior simulation, we augment the likelihood in (4) with subject-specific latent ∼ G, which indicate the mixture component for the associated observations. In particular, if φ i falls into interval B m , the i th observation corresponds to the m th Erlang basis density. The posterior distribution involves G, M , θ, the set of latent variables φ = {φ i : i = 1, ..., n}, and the DP hyperparameters (α, ζ). We marginalize G over its DP prior and work with the prior full conditionals for the φ i , implied by the DP Pólya urn representation (Blackwell & MacQueen 1973), to sample from the marginal posterior distribution for all model parameters except G. To this end, we employ the MCMC method in Escobar & West (1995); the details are given in Appendix A.
Although we do not sample the mixture weights ω during the MCMC simulation, it is straightforward to obtain posterior samples for ω, using their definition in terms of distribution G. The conditional posterior distribution for G, given (α, ζ) and φ, is characterized by a DP with updated total mass parameter α = α + n, and centering Hence, using the DP definition, the conditional posterior distribution for ω, given M , (α, ζ), and φ, is a Dirichlet distribution with parameter vector (α G 0 (B 1 ), . . . , α G 0 (B M )).
Two points about the posterior simulation method are worth making. First, note that the model parameters do not explicitly contain the vector of mixture weights. The mixture weights are estimated through the posterior distribution of G, which plays the role of the relevant parameter. This is practically important in that the dimension of the parameter space does not change with M , and we thus do not need to resort to more complex transdimensional MCMC algorithms. Second, the DP-based Erlang mixture model offers an interesting example where full posterior inference can be obtained from a DP mixture model without the need to truncate or approximate the DP prior. This is a result of the use of a marginal MCMC method, as well as of the fact that distribution G enters the model only through increments of its distribution function, which define the mixture weights.

Model extension for control-treatment studies
A practically relevant scenario in studies where survival responses are collected involves data from multiple experimental groups, typically associated with different treatments.
Evidently, it is of interest in these settings to compare survival distributions across different groups. We develop an extension of the Erlang mixture model in this direction, focusing on the case of two groups for, say, a generic control-treatment study. Our objective is to retain the flexible modeling approach for the survival distributions, avoiding restrictions to specific parametric shapes or rigid relationships, such as proportional hazards. We also seek a prior probability model that allows for dependence, and thus borrowing of information, between the two distributions.
We use the dependent DP (DDP) prior structure (MacEachern 2000) that extends the DP prior for distribution G to a prior model for a collection of covariate-dependent distributions, G x , where x indexes the distributions in terms of values in the covariate space.
Our context involves a binary covariate x ∈ X = {C, T }, where C and T represent control and treatment groups, respectively. The DDP prior builds from the DP stick-breaking representation (Sethuraman 1994) by utilizing covariate-dependent weights and/or atoms.
We work with a common-weights DDP prior model: distribution, and the atoms ϕ = (ϕ C , ϕ T ) arise i.i.d. from a bivariate distribution G 0 .
Note that, under this construction, G x follows marginally a DP(α, G 0x ) prior, where G 0x , for x ∈ X , are the marginals of G 0 associated with the control and treatment groups. For G 0 , we consider a bivariate log-normal distribution, such that ϕ | µ i.i.d.
Allowing also for group-specific number of Erlang basis densities, M x , as well as groupspecific Erlang scale parameter, θ x , the extension of the Erlang mixture model in (1) can be expressed as where Similar to the model in (1), the group-specific Erlang basis densities are fully specified given M x and θ x . Furthermore, the survival functions, S x (t), and hazard functions, h x (t), under the extended model have a mixture representation similar to (2) and (3), where Note that both the mixture components and weights are indexed by x. Again, the time-dependent weights in the hazard mixture form allow for local adjustment, and thus for flexible group-specific hazard rate shapes. Importantly, the prior model allows for general relationships between the control and treatment group hazard functions. In particular, inference is not restricted by the proportional hazards assumption, implied by several commonly used parametric or semiparametric survival regression models.
To complete the full Bayesian model, we place priors on θ x and M x , using again the role of these parameters (discussed in Section 2.1). More specifically, for each x, the joint prior, p(θ x , M x ) = p(θ x )p(M x | θ x ). We further assume θ x ind ∼ Ga(a xθ , b xθ ), and M x | θ x ind.
∼ Unif ( M x1 /θ x , . . . , M x2 /θ x ). We use an approach similar to the one described in Section 2.1 to specify M x1 and M x2 , and the hyperparameters for θ x .
Posterior simulation for the DDP-based Erlang mixture model proceeds with a relatively straightforward extension of the MCMC simulation method in Section 2.1. The details are provided in Appendix B.
The primary focus of this paper is on the DP-based Erlang mixture model for survival analysis and its extension for the control-treatment setting. We note however that the  (5)

Simulation Study
We use three simulation scenarios to illustrate the models developed in Section 2. For the Erlang mixture model for a single distribution, we consider simulated data from: a two-component log-normal mixture to demonstrate the model's capacity to estimate nonstandard density and hazard function shapes (Section 3.1); and, a log-normal distribution sampled with different levels of censoring (Section 3.2). The DDP-based extension of the model is illustrated in Section 3.3 with a synthetic data example based on a log-normal control distribution and a two-component log-normal mixture treatment distribution, specified We examined convergence and mixing of the MCMC algorithms using standard diagnostic techniques. In our experiments, we observed that parameters θ and M are highly correlated, and moderate thinning was used to improve efficiency. A general approach we take is to run the MCMC chain for 100,000 iterations, then discard the first 25% posterior samples and keep every 38th iteration for posterior inference.

Example 1: Bimodal density
We simulate n = 200 survival times from a mixture of two log-normal distributions, 0.4 LN(1, 0.4) + 0.6 LN(2, 0.2), which yields a bimodal density and a non-monotonic hazard function. The true underlying functions f (t), S(t) and h(t) are plotted in Figure 3.
Posterior inference is summarized in Figure 3.  As shown in Figure 5, the model estimates well the density, survival and hazard function.

Example 2: Unimodal density with censoring
The point estimate for the hazard function is less accurate beyond t = 400, which is to be expected given the very few observations that are greater than that time point, although the interval estimate contains the true function throughout the observation time window.
In addition, we examine the model's performance for data with censored observations.
We simulate censoring times c i from an exponential distribution with mean parameter κ, and define the observed times as y i = min(t i , c i ), with binary censoring indicators ν i = 1(y i ≤ c i ). We generate the c i under two different values of κ, resulting in two datasets with different proportions of censored observations, g = 12% and 33.5%. Figure 6 plots posterior mean and 95% interval estimates for the density, survival and hazard functionals.
We note that censoring does not substantially affect the quality of the inference results, with the increase more noticeable for the hazard function estimates.

Example 3: A control-treatment synthetic data set
Here, we examine the performance of the DDP-based Erlang mixture model of Section 2.2.
We consider a binary covariate,  Figure 7. The control group density is unimodal, whereas the treatment group has a bimodal density and a non-standard, non-monotonic hazard function. The truth is specified such that we have crossing hazard functions for the two groups, a scenario that traditional proportional hazards models can not accommodate.

Small cell lung cancer data
To illustrate the DDP-based Erlang mixture model with real data, we consider the data set from Ying & Wei (1995) on survival times (in days) of patients with small cell lung cancer.
The data correspond to a study designed to evaluate two treatment regimens of drugs, etoposide (E) and cisplatin (P), given with a different sequence, with Arm A denoting the regimen where P is followed by E, and Arm B the regimen where E is followed by P. ∼ Unif ( 2500/θ x , . . . , 10000/θ x ); µ ∼ N 2 ((6.7, 6.3) , 10 I 2 ); and, Σ = 3 I 2 . Here, (6.7, 6.3) are the averages of the observed survival times for each treatment after logarithmic transformation.
Posterior mean and interval estimates for the density, survival, and hazard function are compared across the two treatments in Figure 9. The Arm B density estimate is more Red and blue color is used for the Arm A and Arm B group, respectively. hazard rate estimate. Nonetheless, the estimates strongly suggest that the proportional hazards assumption is not suitable for this study.
For a more focused comparison of the two treatments, Figure 10 plots the entire posterior can thus be contrasted with the horizontal reference line at 0. Based on the 95% interval estimate, treatment A outperforms treatment B at t = 300, 500 and 700 days with respect to survival probability, and at t = 300 days according to hazard rate.

Summary
We Appendix A: MCMC algorithm for the DP-based

α
We use the augmentation method in (Escobar & West 1995) to update α. We first introduce an auxiliary variable η, η | α, n ∼ Be(α + 1, n), and sample α from a mixture of two gamma distributions; . . , φ n ) and n − is the number of elements in φ − i . Let n − j be the number of with G Exp (· | ζ) denoting the exponential distribution function with mean ζ, and Here, h(φ i | y i , θ, M, ζ) is a mixture of truncated exponential distributions, and T-Exp m (φ | ζ) is the density function of the truncated exponential distribution with inverse-cdf sampling method can be used to draw a sample from h(φ i | t i , θ, M, ζ).

model
We present here the posterior simulation details for the model developed in Section 2.2.
The augmented model using latent variables ϕ i = (ϕ Ci , ϕ T i ) is written as . . , n, and x i ∈ {C, T }, The likelihood function for the augmented model for observation i is where D = {( † , ν , § ), = ∞, . . . , \} denotes data. Similar to the algorithm in Appendix A, we use an adaptive Metropolis-within-Gibbs algorithm in Roberts & Rosenthal (2009) for the Metropolis-Hastings updates. Mixing and convergence of Markov chain are checked and no evidence is found of converging to a wrong distribution. The full conditionals are given below. , where L C (j M , θ C , ϕ; D) = i:x i =C L i (j M , θ C , ϕ Ci ; D). We then draw M T in a similar way.

ϕ
Let ϕ − i = (ϕ − 1 , . . . , ϕ − n − ) be the set of distinct values in ϕ −i = (ϕ 1 , . . . , ϕ i−1 , ϕ i+1 , . . . , ϕ n ), where n − is the number of elements in ϕ − i . Let n − j be number of elements in ϕ −i that is equal to ϕ − j . The full conditional of ϕ i is where, for x i = C, Σ C|T = Σ 11 − Σ 12 Σ 21 /Σ 22 with G LN (· | µ C|T , Σ C|T ) denoting a lognormal distribution function with mean µ C|T and variance Σ C|T , and h(ϕ i | y i , µ, Σ, θ, M ) = LN(ϕ T i | µ 1 , Σ 11 ) × for m = 1, . . . , M C − 1, Similar to the algorithm of updating ϕ i for the DP-based Erlang mixture model, we let ϕ i = ϕ − j with probability n − j q j /A, where A = αq 0 + n − h=1 n − h q h , or draw a new ϕ i from h(ϕ i | y i , µ, Σ, θ, M ) with probability αq 0 /A. To draw a sample from h(ϕ i | y i , µ, Σ, θ, M ), we first draw ϕ T i from LN(µ 1 , Σ 11 ) and then, conditional on ϕ T i , draw ϕ Ci from a mixture of truncated lognormal distributions using an inversecdf sampling method, where each component, T-LN m is a lognormal distribution with support of ((m − 1)θ C , mθ C ]. The same method is applied for the observations with x i = T by simply switching C with T .