Modelling and computation using NCoRM mixtures for density regression

Normalized compound random measures are flexible nonparametric priors for related distributions. We consider building general nonparametric regression models using normalized compound random measure mixture models. Posterior inference is made using a novel pseudo-marginal Metropolis-Hastings sampler for normalized compound random measure mixture models. The algorithm makes use of a new general approach to the unbiased estimation of Laplace functionals of compound random measures (which includes completely random measures as a special case). The approach is illustrated on problems of density regression.


Introduction
The problem of Bayesian nonparametric inference for distributions at different regressor values has been an extremely active area of research. Many approaches use dependent nonparametric mixture models and build on the idea of dependent Dirichlet process mixture models (MacEachern, 1999), which generalized the commonly-used Dirichlet process mixture model. A generic dependent nonparametric mixture model assumes that a sample y 1 , . . . , y n observed at regressor values x 1 , . . . , x n (where x ∈ X for some measureable space X) is modelled as where q(y|θ) is a distribution for y (where y ∈ Y for some measureable space Y) with parameter θ, w k (x) ≥ 0 for all k and x ∈ X, ∞ k=1 w k (x) = 1 almost surely for all x ∈ X and θ 1 (x), θ 2 (x), θ 3 (x), . . . are independent realisations of a stochastic process.
We refer to θ 1 (x), θ 2 (x), θ 3 (x), . . . as the locations of the mixture components. The model simplifies to a nonparametric mixture model if the sample is observed at a single regressor value.
Many approaches to constructing specific models in the form of (1.1) generalize the stick-breaking construction of the Dirichlet process (Sethuraman, 1994) and these were reviewed in Dunson (2010). Alternatively, models can be constructed by normalising dependent random measures. This generalizes the approach introduced by Regazzini, Lijoi and Prünster (2003) to an arbitrary dimension. These constructions have several advantages. Firstly, the weights w 1 (x), w 2 (x), . . . are not ordered, as is the case with many stick-breaking constructions. Secondly, dependence is defined at the level of the weights w k (x) rather than, as is typical in stick-breaking constructions, through a non-linear transformation of the weights. Foti and Williamson (2012) defined a wide-class of such process using normalized kernel-weighted random measures, which generalize the approach to time-dependent random measures in Griffin (2011). Griffin et al. (2013) developed an approach to modelling a finite set of dependent random measures using superpositions of completely random measure (see also Lijoi and Nipoti, 2014;Lijoi, Nipoti and Prünster, 2014a,b;Chen et al, 2013). Alternatively, dependence can be modelled through a Lévy copula (Leisen and Lijoi, 2011;Leisen, Lijoi and Spano, 2013;Zhu and Leisen , 2015). Compound random measures (CoRM) (Griffin and Leisen, 2017) are a unifying framework for many dependent random mea-sures including many of the superposition and Lévy copula approaches. They have been applied to modelling graphs for overlapping communities by Todeschini and Caron (2016). Griffin and Leisen (2017) described posterior sampling methods for a particular class of normalized compound random measure mixtures which exploits a representation of the Laplace transform of a CoRM through a univariate integral of a moment generating function. Ranganath and Blei (2015) independently developed a normalized CoRM model where the weights depend on a Gaussian process and described a variational Bayesian algorithm for inference.
In this paper, we will consider extending the class of compound random measures (CoRM) from finite collections of distributions to infinite collections of distributions.
This allows us to define CoRM models where the weights follow a time series model, the weights follow a regression model or the weights are defined through a hierarchical model. The computational algorithms in Griffin and Leisen (2017) cannot be used in this wider class of models since moment generating functions are not available in closed form. Therefore, we develop a new MCMC algorithm for CRM-based nonparametric mixture models which uses a novel pseudo-marginal MCMC method (Andrieu and Roberts, 2009 For simplicity, we will consider mixture models of the form in (1.1) with θ k (x) = θ k for all x ∈ X, leading to a mixture model with weights which vary over X (many of the ideas in this paper could be extended to the model where θ k (x) follows a stochastic process, such as a Gaussian process, over X). The model is We consider the weights where m k (x) is a random function on X for which m k (x) ≥ 0 for all x ∈ X and the function m k is independent of m l , J 1 , J 2 , J 3 . . . , are the jumps of the process with directing Lévy process ν and θ 1 , θ 2 , θ 3 , . . . are i.i.d. We will refer to m k (x) as a score or score function and to ν as the directing Lévy process. The model reduces to the NCoRM models considered by Griffin and Leisen (2017) if X is a finite set. In particular, they introduced a class of dependent random probability measures p 1 ,p 2 , . . . ,p d which can be represented as We will concentrate on models where m l (x) = exp{r l (x)} and r l (x) is a random function on X taking value on R. Griffin and Leisen (2017) considered using the variance of the ratio of the same jump at values x, x ∈ X as a simple measure of the strength of dependence between the (unnormalized) random measure at values x and x . In this case, the ratio is ζ(x, x ) = m l (x)/m l (x ) = exp{r l (x) − r l (x )} and the distribution of ζ(x, x ) will often be easy to work with. For example, ζ(x, x ) will be log normally distributed if r l (x) and r l (x ) have a bivariate normal marginal distribution.
In this paper, we will consider models in which r l (x) is a stochastic process for which E[r l (x)] = 0 for all x ∈ X. This gives CoRM models a high degree of flexibility.
To illustrate the use of NCoRM mixtures in a regression context, we will consider a choice of r l (x) which is suitable for continuous regressors and a choice of r l (x) which is suitable for categorical regressors: • Continuous regressors: In this case, we define r 1 (x), r 2 (x), r 3 (x), . . . to be independent Gaussian processes with covariance function σ 2 0 κ(·, ·) where κ(·, ·) is a correlation function. This implies that log ζ(·, ·) follows a normal distribution with mean zero and variance 2σ 2 0 .
• Categorical regressors: Suppose that we have two categorical regressors then we could assume a different parameter for each combination of levels so that x i,1 ,x i,2 . Alternatively, we could use the specification r l (x i ) = α (l) Then, the α (l) and β (l) parameters act as main effects and γ (l) as interactions which can be interpreted in a similar way to a logistic regression model. For example, log ζ(x, x ) is normally distributed with mean 0 and variance 2(σ 2 1 + σ 2 2 + σ 2 1,2 ) if both levels of x are different to the levels of x . Whereas, log ζ(x, x ) is normally distributed with mean 0 and variance 2(σ 2 2 + σ 2 1,2 ) if only the second level of x and x are different. This shows how the dependence of jump sizes depends on the levels of the regressors.
Posterior inference is impossible using existing methods and the following section describes a general purpose algorithm for NCoRM mixture models.

Computational methods
Posterior inference for nonparametric mixture models is challenging due to the infinitedimensional random probability measure in the model. To address this problem, two main MCMC approaches to defining a finite-dimensional target have been developed.
Firstly, marginal methods integrate the random probability measure from the posterior. Secondly, conditional methods truncate the random probability measure. These methods can be further divided into exact methods which use a random truncation to sample exactly from the posterior and methods which fix the level of truncation leading to some truncation error. Griffin and Leisen (2017) suggest a marginal method and an exact conditional method (a slice sampler). The availability of an analytical expression for the moment generating function for the score distribution is key to their sampling methods but this is impossible to evaluate in closed form for the more general NCoRM models described in this paper. We propose a hybrid conditionalmarginal sampler using a pseudo-marginal Metropolis-Hastings algorithm (Andrieu and Roberts, 2009).
We assume that we observe data (x 1 , y 1 ), . . . , (x n , y n ) and wish to fit the model in (2.1). Without loss of generality, we also assume that the values x 1 , x 2 , . . . , x n are distinct and write m k,i = m k (x i ) and m k = (m k,1 , . . . , m k,n ). Following Griffin and Leisen (2017), it is convenient to use an augmented form of the likelihood which introduces an allocation variable for each observation. Let n k be the number of observations allocated to the k-th jump, we order the jumps so that J 1 , . . . , J K have points allocated to them (i.e. n k > 0 for 1 ≤ k ≤ K) and J K+1 , J K+2 , . . . have no points allocated to them (i.e. n k = 0 for k > K). Marginalizing over jumps which have no points allocated and the location of all atoms and writing M = α(Y) and likelihood of the data. Griffin and Leisen (2017) use the analytical expression for L and integrals over J 1 , . . . , J K to define a marginal sampler. In general, these integrals are not analytically available to us. We replace L by an unbiased estimateL (a possible unbiased estimator is discussed in the next Section) to define the following Finally, we assume that h has parameters τ and ν has parameters ξ on which we want to make inference and define the target We propose a novel sampling strategy for the variable s in a nonparametric mixture model and a novel computational algorithm to deal with the Laplace transform component of the target above. This algorithm can be applied to posterior inference for a wide variety of Bayesian nonparametric processes beyond NCoRM processes.

Updating c
To update c i , we write the full conditional distribution as proportional to .
A new value of J K − i +1 is sampled from this full conditional distribution leading to an algorithm which is similar to Algorithm 8 of Neal (2000). See James et al. (2009), Lijoi and Prünster (2010) and Favaro and Teh (2013) for extension to non-conjugate normalized random measure mixtures.
If the i-th observation was allocated to a singleton cluster in the previous iteration, If the i-th observation was not allocated to a singleton cluster in the previous In Appendix B we provide the details of the full conditional distributions for the variables J k , m k , v i , ξ, M and τ . The next Section will introduce the novel pseudomarginal Metropolis-Hastings algorithm used to address the intractability of the Laplace transform part of the target distribution.

Unbiased estimation of the Laplace functional
Andrieu and Roberts (2009) introduced a sampling scheme, called pseudo-marginal Metropolis-Hastings, which allows sampling from distributions which cannot be evaluated pointwise. The main idea of the method is to replace the target distribution with a nonnegative unbiased estimator.
In our framework, we are often interested in evaluating objects such as the expectation in (3.2), We will use the Poisson estimator (Papaspiliopoulos, 2011) which has been successfully used in MCMC approaches for diffusions (see e.g. Fearnhead et al., 2010). Consider, the equation The Poisson estimator of (3.3) is introduced in the following Theorem where some properties are described.
The proof of the Theorem can be found in the Appendix. We denote the Poisson distribution with parameter λ by P n(λ).
Theorem 3.1. Consider the following estimator, Then, The estimator has the useful property that it is always positive. This contrasts with other approaches which define unbiased estimators of infinite sums using random truncation where it is difficult to ensure that estimates are always positive (see e.g. Rhee and Glynn, 2015;Lyne et al., 2015).
Returning to the expression in (3.2) and, again, assuming that x 1 , x 2 , . . . , x n are distinct, this can be re-expressed as dz is the tail mass function for the Lévy process with Lévy intensity ν and The expression for L k has the form of (3.
To use the Poisson estimator, a suitable density is all z. Suitable forms of κ ν for some popular nonparametric processes are given in Section 3.1.2.
In computation for more usual normalized random measures (Griffin and Walker, 2011;Favaro and Teh, 2013), we are interested in where ν(z) is a Lévy process and the expectation is taken over all jumps on R + . This expectation can, similarly, be re-expressed as provides an unbiased estimator of (3.5) and (3.7) which can be used in the sampler described in this section.

Controlling the variability ofL φ
Pseudo-marginal Metropolis-Hastings algorithms converge to the correct distribution but the asymptotic variance of an average calculated using the algorithm depends on the variance of the unbiased estimator. For example, suppose that the unbiased estimator is an importance sampler. Andrieu and Vihola (2016) show that the asymptotic variance of the pseudo-marginal sampler decreases as the number of samples in the importance sampler increases (leading to an importance sampler with a lower asymptotic variance). Although we do not use an importance sampler, the Poisson estimator is closely related and it is intuitively reasonable that the asymptotic variance of averages calculated using the pseudo-marginal Metropolis-Hastings sampler will decrease as the variance of the Poisson estimator in (3.4) decreases.
The variability of the Poisson estimator is controlled by a with larger values of a leading to a smaller variance. However, larger values of a will also lead to longer computational times since the mean number of terms inL φ is aC. In this section, we will assume that the expected number of evaluations of the ratio φ(x)/κ(x) is d. Therefore, d = aC for the estimator in (3.4). An alternative method for controlling the variability involves defining the estimatorL AV E and so the variance of the estimator grows with N for fixed d. This suggests that we should useL AV E φ with N = 1 which is the Poisson estimator in (3.4). In this case, the choice of κ(x) which minimizes the variance for fixed d is κ(x) = φ(x) − log L φ which provides a criterion for choosing κ(x).
As we have already mentioned the asymptotic variance of averages calculated using the pseudo-marginal algorithm will typically decreases as a increases but the computational time will increase. Therefore, there is an optimal value of a which is able to provide the lowest asymptotic variance for a fixed computational budget (number of evalulations of φ(x)/κ(x)). Doucet et al. (2015) established an upper bound for the asymptotic variance under certain assumptions which allows this optimal value to be derived. They demonstrated that this value can be close to optimal when the assumptions are violated. In our context, their main assumption iŝ where log ∼ N(−σ 2 /2, σ 2 ). They refer to σ 2 as the noise variance and it is straightforward to show that Doucet et al. (2015) showed that the optimal value of the noise variance (in terms of asymptotic variance), σ 2 opt , depends on the properties of the chain but provide guidelines on how this can be approximated. Following the derivation of Doucet et al. (2015), the optimal value of a, for fixed κ, is In practice, we have found that the value a = 8 works well for the processes considered in this paper.

Examples
Brix (1999) provided a bound for the tail-mass integral of the generalized gamma process which is extended to the stable-Beta process by Arbel and Prünster (2016).
However, both bounds are not tight and we suggest tighter bounds for both processes. Indeed, the estimator introduced in Theorem 3.1 with the proposal in (3.6) requires draws from a Poisson distribution whose mean is proportional to where T ν (z) κ ν (z) < B for all z. Therefore, better choices of κ ν (x) can improve the computational efficiency of the method by requiring a smaller value of B.
As σ → 0 for λ = 1, the generalized gamma process converges to the gamma process which has Lévy density ν(z) = z −1 exp{−z} and T ν (t) = ∞ t ν(y) dy = E 1 (t) where E 1 (t) is the exponential-integral function. Both the bounding p.d.f. and simulation scheme for t < b also converge. It is straightforward to show that the limit isκ which has p.d.f. z exp{−z}, i.e. z ∼ Γ(2, 1) truncated to y > − log b.
If λ = 0, the generalized gamma process is stable process. However, the tail mass function is infinite for a stable process and this simulation scheme is not possible.

Example 1: Discrete regressors
The algorithms developed in this paper are illustrated using an analysis of hematological data arising from a dose-escalation study which has previously been analysed by Müller and Rosner (1997). The data are white blood cell counts over time for a sample of 52 patients receiving different levels of two treatments: cyclophosphamide (CTX) and a second drug (GM-CSF). The data for each patient is summarized as the maximum likelihood estimates from a non-linear regression model with seven parameters fitted to that patient's time profile. The model assumes that the mean response at time t with parameters θ = (z 1 , z 2 , z 3 , τ 1 , τ 2 , β 1 ) is given by where r = (τ 2 −t)/(τ 2 −τ 1 ) and g(θ, t) = z 2 +z 3 /[1+exp{2.0−β 1 (t−τ 2 )}]. The model implies that the white blood cell count is constant (at level z 1 ) before τ 1 followed by a linear progression between τ 1 and τ 2 and a logistic recovery after τ 2 . The parameters z 2 and z 3 control the white blood cell count at the start and end of recovery. De Iorio et al. (2004) applied an ANOVA-DDP model to these data which assumes a mixture model with constant weights and an ANOVA model for the locations for each treatment. In contrast, we fitted a mixture model with weights that vary with the treatment combination but with locations that do not depend on the treatment level. Specifically, we assume that y i are the estimated parameters for the i-th patient and that x i,1 is the level of CTX and x i,2 is the level of GM-CSF. The model is The directing Lévy process is taken to be a gamma process. The model assumes a twoway ANOVA model with interaction for the logarithm of the weights. This does not place restriction on the combination of weights but does encourage similar weights for similar combinations of levels. The priors were σ 2 1 ∼ Ga(1, 2), σ 2 2 ∼ Ga(1, 2), σ 2 1,2 ∼ Ga(1, 2) and M ∼ Ga(1, 1). For the purposes of illustration, we set µ 0 equal to the sample mean of the data, Ψ = 1 9(ν−8)Σ whereΣ is the covariance of the data which implies that the prior mean of Σ is 1 9Σ and we choose λ = 0.01. The MCMC algorithms was run for a total of 35 000 iterations. The first 5 000 were used as a burn-in with the subsequent values thinned every fifth sample. This gave a sample of 6 000 values.
The inference about the marginal probability of two parameters z 1 and z 2 are shown in Figures 1 and 2. The parameter z 1 is the initial white blood cell count. The distribution is bi-modal with the size of the smaller mode increasing with GM-CSF.

Example 2: Continuous Regressors
A regression model is used to define an infinite mixture model with regressor dependent weights. We observe pairs (x 1 , y 1 ), . . . , (x n , y n ) where x i ∈ R p and y i ∈ R and use the model r 3 (x), . . . are independent Gaussian processes. A generalized gamma directing Lévy process is used with λ = 1 and three values of σ: σ = 0 (a gamma process), σ = 0.1 and σ = 0.5. We apply the model to data from a simulated motorcycle accident used to test crash helmets (Silverman, 1985), which are available as the mcycle data frame in the R package MASS. The data are head accelerations (in g) measured at different times in milliseconds after impact. We assume that the Gaussian processes have covariance where · is Euclidean distance and L is the lengthscale. The priors are α ∼ U(0, 1), p(µ, σ 2 ) ∝ σ −2 , L ∼ Ga(1, 1), M ∼ Ga(1, 1), and φ −1 ∼ Ga(1, 4). The prior for φ is chosen so that log m k (x i ) typically takes values in (−4, 4). The MCMC algorithms was a total of 33 000 iterations. The first 3 000 were used as a burn-in with the subsequent values thinned every third sample. This gave a sample of 10 000 values.   Figure 4 shows the posterior mean of the conditional density of head acceleration given time from impact for the three values of σ with the data superimposed. In each case, the model was able to follow the data and capture the changing the heterogeneity in the variance. The inference seems robust to the choice of σ.
Trace plots for the three parameters M , φ and L for the case σ = 0 (the gamma process) are shown in figure 5. These clearly show that the parameters are mixing well across the MCMC chain.

Comparison of predictive performance
We ran a simulation exercise to understand how the NCoRM regression model developed in this paper compared to two commonly used dependent nonparametric priors: the single-p dependent Dirichlet process (De Iorio et al., 2004) and a probit stick-breaking process mixture (Rodriguez and Dunson, 2011). The methods were compared by 10-fold cross-validation using simulated data sets of size 100 (leading to training data sets with 90 observations) according to the out-of-sample log-predictive scores, i.e. .

LPS
where y test i,1 , . . . , y test i,10 is the i-th testing sample and y train i,1 , . . . , y pred i,90 is the i-th training sample. Three sets of data were simulated to cover different modelling situations.
The first two data sets used regressor-dependent mixture models covering the simple case of a two component mixture and a more complicated scenario with four components. The third dataset used a non-linear regression model which depends on four parameters a, b, c and d. Different values of the parameters lead to different features of the data such as homoscedascity or heteroscedascity, jumps or different levels of smoothness. In all cases, there was a single regressor which was generated uniformly on (0, 1). The detailed descriptions of the data sets are given below.
We consider two values of σ (0.1 and 0.5) which allow for different levels of separation between the two mixture components and two values of r (1 and 2) which control the rate at p(s = 1) changes over the range of x.
• Simulated Data Sets III The responses were simulated as    the methods were ranked in the same order with the DDP giving the best performance and the NCoRM mixture outperforming the probit stick-breaking mixtures.
The difference between the DDP and NCoRM was largest for the model without jumps and with homoscedasticity (c = 0 and d = 0). This is not surprising since a Gaussian process with normal errors would provide good approximation of the sine curve and models which only allow dependence through the weights can only approximate the curve using piecewise constant fits. If there are jumps, the advantage of the DDP over the NCoRM was reduced. In all case, the NCoRM mixture substantially outperformed the probit stick-breaking mixture. This reflects the construction of the probit stick-breaking processes. In all stick-breaking processes, the weights are stochastically ordered and the probit stick-breaking process assumes that the atom with largest a priori expected weight does not depend on x. Although the data can change the order a posteriori, this ordering persists in data sets of the size considered in these simulated examples.
The results of these simulations suggest some guidelines which can be used in more general situations. The model displayed in equation (1.1) is very general but in this form is rarely used in real situations. By allowing the parameter θ and the weights w k to depend on the regressor, the model becomes extremely flexible and prone to overfit the data. Therefore, the models considered are typically used as special cases of the model displayed in equation (

Conclusions
Normalized compound random measures are a large class of dependent nonparametric processes. The jumps of the processes are expressed as the product of a jump from a Lévy process and a random variable. This allows the dependence of the nonparametric processes to be modelled through the dependence in the random

A Proofs
Proof of Theorem 3.1.
B Additional details of computational methods Updating J k Update J k from the full conditional distribution proportional to This variable can be updated in closed form if ν ξ is the Lévy density of a generalized gamma process or by an adaptive Metropolis-Hastings random walk (Atchadé and Rosenthal, 2005) if the full conditional does not have closed form.
Updating m k Update m k from the full conditional distribution proportional to This variable can be updated using an adaptive Metropolis-Hastings random walk (Atchadé and Rosenthal, 2005) if the full conditional does not have closed form.
Updating v i The full conditional distribution is The parameter v i is updated using an interweaving step (Yu and Meng, 2011). The first part of the step updates using an adaptive Metropolis-Hastings random walk (Atchadé and Rosenthal, 2005). The second part of the step uses the re-parameterizationJ i = v 1 J i andṽ i = v i v 1 for i > 1 which implies thatJ jṽi = J j v i . The full conditional of v 1 (conditioning onṽ i andJ i ) has density proportional to The parameter is updated using an adaptive Metropolis-Hastings random walk (Atchadé and Rosenthal, 2005) where L is calculated conditional on v 1 andṽ i for i > 1. The proposed values are accepted with probability

Updating ξ
The full conditional distributions of M has density proportional tô This parameter can be updated using an adaptive Metropolis-Hastings random walk (Atchadé and Rosenthal, 2005).

Updating M
The full conditional distributions of M has density proportional to This parameter can be updated using an adaptive Metropolis-Hastings random walk (Atchadé and Rosenthal, 2005).

Updating τ
The full conditional distribution of τ has density proportional to p(τ )L K k=1 p(m k |τ ) This parameter can be updated using an adaptive Metropolis-Hastings random walk (Atchadé and Rosenthal, 2005). We have found that an interweaved update (Yu and Meng, 2011) can lead to much better mixing. If m k (x) is a Gaussian process and τ is the stationary variance then we can write m k,i = √ τ m (s) k,i . The interweaved update uses an adaptive Metropolis-Hastings step whereL is calculated for the proposed value τ and the full conditional density is