Posterior sampling from ε -approximation of normalized completely random measure mixtures

: This paper adopts a Bayesian nonparametric mixture model where the mixing distribution belongs to the wide class of normalized homogeneous completely random measures. We propose a truncation method for the mixing distribution by discarding the weights of the unnormalized measure smaller than a threshold. We prove convergence in law of our approximation, provide some theoretical properties, and characterize its posterior distribution so that a blocked Gibbs sampler is devised. The versatility of the approximation is illustrated by two diﬀerent applications. In the ﬁrst the normalized Bessel random measure, encompassing the Dirichlet process, is introduced; goodness of ﬁt indexes show its good performances as mixing measure for density estimation. The second describes how to incorporate covariates in the support of the normalized measure, leading to a linear dependent model for regression and clustering.


Introduction
One of the livelier topic in Bayesian Nonparametrics concerns mixtures of parametric densities where the mixing measure is an almost surely discrete random probability measure. The basic model is what is known as the Dirichlet process mixture model, appeared first in [33], where the mixing measure is indeed the Dirichlet process. Dating back to [25] and [32], many alternative mixing measures have been proposed; the former paper replaced the Dirichlet process with stick-breaking random probability measures, while the latter focused on normalized completely random measures. These hierarchical mixtures play a pivotal role in modern Bayesian Nonparametrics, and their popularity is mainly due to the high flexibility in density estimation problems as well as in clustering, which is naturally embedded in the model.
In some statistical applications, the clustering induced by the Dirichlet process as mixing measure may be restrictive. In fact, it is well-know that the latter allocates observations to clusters with probabilities depending only on the cluster sizes, leading to the "the rich gets richer" behavior. Within some classes of more general processes, as, for instance, stick-breaking and normalized processes, the probability of allocating an observation to a specific cluster depends also on extra parameters, as well as on the number of groups and on the cluster size. We refer to [4] for a recent review of the state of the art on Bayesian nonparametric mixture models and clustering.
Since posterior inference for Bayesian nonparametric mixtures involves an infinite-dimensional parameter, this may lead to computational issues. However, there is a recent and lively literature focusing mainly on two different classes of MCMC algorithms, namely marginal and conditional Gibbs samplers. The former integrates out the infinite dimensional parameter (i.e. the random probability), resorting to generalized Polya urn schemes; see [17] or [34]. The latter includes the nonparametric mixing measure in the state space of the Gibbs sampler, updating it as a component of the algorithm; this class includes the slice sampler [see 23]. Among conditional algorithms there are truncation methods, where the infinite parameter (i.e. the mixing measure) is approximated by truncating the infinite sums defining the process, either a posteriori [5,9] or a priori [3,24].
In this work we introduce an almost surely finite dimensional class of random probability measures that approximates the wide family of homogeneous normalized completely random measures [41,29]; we use this class as the building block in mixture models and provide a simple but general truncation algorithm to perform posterior inference. Our approximation is based on the constructive definition of the weights of the completely random measure as the points of a Poisson process on R + . In particular, we consider only points larger than a threshold ε, controlling the degree of approximation. Conditionally on ε, our process is finite dimensional both a priori and a posteriori.
Here we illustrate two applications. In the first one, a new choice for the Lévy intensity ρ, characterizing the normalized completely random measure, is proposed: the Bessel intensity function that, up to our knowledge, has never been applied in a statistical framework, but known in finance [see 7, for instance]. We call this new process normalized Bessel random measure. In the second application, we set ρ to be the well-known generalized gamma intensity and consider a centering measure P 0x depending on a set of covariates x, yielding a linear dependent normalized completely random measure. For a recent survey on dependent nonparametric processes in the Statistics and Machine Learning literature see [20].
In this paper, since the main objective is the approximation of the nonparametric process arising from the normalization of completely random measures, we fix ε to a small value. However, it is worth mentioning that it is possible to choose a prior for ε, but the computational cost might greatly increase for some intensity ρ.
The new achievements of this paper can be summarized as follows: (i) a generalization of the ε-approximation given in [3] for the NGG process to the whole family of normalized homogeneous completely random measures, (ii) a different technique providing the posterior distribution (and the exchangeable partition probability function) of this new random probability measure, making use of Palm's formula, and (iii) the introduction of the normalized Bessel random measure as mixing measure in Bayesian nonparametric mixtures.
In particular, after the introduction of the finite dimensional ε-approximation of a normalized completely random measure, we derive its posterior and show that the ε-approximation converges to its infinite dimensional counterpart (Section 3). Then we provide a Gibbs sampler for the ε-approximation hierarchical mixture model (Section 4). Section 5 illustrates some criteria to choose the approximation parameter ε. Section 6.1 is devoted to the introduction of the normalized Bessel random measure, and some of its properties; on the other hand, Section 6.2 discusses an application of the ε-Bessel mixture models to both simulated and real data. Section 7 defines the linear dependent ε−NGG's, and considers linear dependent ε − NGG mixtures to fit the AIS data set. To complete the set-up of the paper, Section 2 is devoted to a summary of basic notions about homogeneous normalized completely random measures, and Section 8 contains a conclusive discussion.

Preliminaries on normalized completely random measures
Let us briefly recall the definition of a homogeneous normalized completely random measure. Let Θ ⊂ R m for some positive integer m. A random measure μ on Θ is completely random if for any finite sequence B 1 , B 2 , . . . , B k of disjoint sets in B(Θ), μ(B 1 ), μ(B 2 ), . . . , μ(B k ) are independent. A purely atomic (with no fixed points) completely random measure (c.r.m.) is defined [see 30, Section 8.2] by where N is a Poisson process on R + × Θ with mean intensity ν(ds, dτ ). A completely random measure is homogeneous if ν(ds, dτ ) = ρ(s)dsκP 0 (dτ ), where ρ(s) is the density of a non-negative measure on R + , and κP 0 is a finite measure on Θ with total mass κ > 0. If +∞ 0 min{1, s}ρ(s)ds < +∞, (2.2) then μ is characterized by the so-called Lévy-Khintchine representation: for any non-negative function f , the Laplace functional Ψ μ of μ is given by in this case ν(ds, dτ ) is called Lévy intensity measure. Furthermore, we assume that ρ satisfies the following regularity condition: +∞ 0 ρ(s)ds = +∞, (2.4) so that the total number of points of the process, N (R + × Θ), is Poisson distributed with mean R + ×Θ ν(ds, dτ ) = κ R + ρ(s)ds = +∞. This implies that any homogeneous completely random measure under (2.2) and (2.4) can be represented as μ(·) = j≥1 J j δ τj (·). Since μ is homogeneous, the support points {τ j } and the jumps {J j } of μ are independent, and the τ j 's are independent identically distributed (iid) random variables from P 0 , while {J j } are the points of a Poisson process on R + with mean intensity ρ. Moreover, if T := μ(Θ) = j≥1 J i , by (2.2) and (2.4), we have P(0 < T < +∞) = 1.
Therefore, a random probability measure (r.p.m.) P can be defined through normalization of μ: We refer to P in (2.5) as a (homogeneous) normalized completely random measure with parameter (ρ, κP 0 ). As an alternative notation, following [27], P is referred to as a homogeneous normalized measure with independent increments. The definition of normalized completely random measures appeared in [41] first. An alternative construction of normalized completely random measures can be given in terms of Poisson-Kingman models as in [39].

ε-approximation of normalized completely random measures
The goal of this section is the definition of a finite dimensional random probability measure that is an approximation of a general normalized completely random measure with Lévy intensity given by ν(ds, dτ ) = ρ(ds)κP 0 (dτ ), introduced above. First of all, by the Restriction Theorem for Poisson processes, for any ε > 0, all the jumps {J j } of μ larger than a threshold ε are still a Poisson process, with mean intensity γ ε (s) := κρ(s)I (ε,+∞) (s). Moreover, the total number of these points is Poisson distributed, i.e. N ε ∼ P 0 (Λ ε ) where Λ ε := κ +∞ ε ρ(s)ds. Since Λ ε < +∞ for any ε > 0 by (2.2), N ε is almost surely finite. In addition, conditionally to N ε , the points {J 1 , . . . , J Nε } are iid from the density This implies thatμ ε = Nε j=1 J j δ τj . However, it is not worth trying to normalizẽ μ ε , sinceμ ε (B) = 0 for any B if N ε = 0. We consider, instead, the c.r.m. μ ε so defined: where (J 0 , τ 0 ) is independent from {(J j , τ j ), j ≥ 1}, J 0 and τ 0 are independent with density ρ ε and P 0 , respectively. Thus Summing up, we define: Increasing Lévy processes are completely random measures for Θ = R (or R + ). Therefore, it is worth mentioning some literature on ε-approximation of such processes in the financial context. In particular, the book by Asmussen and Glynn [6, Chapter XII] provides a justification for the approximation of infinite ε-approximation of Bayesian nonparametric mixtures 3521 activity Lévy processes by compound Poisson processes: any Lévy jump process J on R can be represented as the sum of two independent Lévy processes where the Lévy measures of J 1 and J 2 are restrictions of the "whole" Lévy measure on (−ε, ε) and (−∞, −ε] ∪ [ε, +∞), respectively. When considering the homogeneous completely random measure μ under (2.2) and (2.4) as here, this theory yields that μ is the sum of two independent homogeneous completely random measures μ (0,ε] andμ ε , corresponding to mean intensities ρ(s)I (0,ε] (s) and ρ ε as in (3.1), respectively. Note thatμ ε is the c.r.m. in the right hand-side of (3.3). The basic idea of the ε-approximation is that, if ε is small enough, μ (0,ε] can be neglected and μ can be approximated byμ ε ; see [6,Chapter XII] and [43]. The approach to ε-approximation taken here is similar, though not identical, since we first add the random mass J 0 in the random point τ 0 toμ ε to define the c.r.m. μ ε as in (3.3). The r.p.m. P ε in (3.4) is then defined by normalization of μ ε . We will show in Proposition 3.3 that P ε converges in distribution to P as ε goes to 0, but the basic idea of the approximation is that the point mass we add toμ ε is negligible; see Section 5.
Several other methods have been proposed in order to approximate a normalized measure; first of all, we mention the inverse Lévy measure method, referred to as Ferguson-Klass representation [19] in this context, representing the Poisson process of the jumps of a subordinator as a series of trasformed (via the survival function of the Lévy intensity) points of a unit rate Poisson process. Of course, to get implementable simulation algorithms, the series expansion has to be truncated at a fixed and large integer N , or whenever the new jump to be added to the series is smaller that a threshold ε. In the latter case, the truncation rule would yield only jumps of size greater than ε, obtaining an algorithm that is similar to that proposed here; see [6,Chapter XII]. On the other hand, [2] proposes a truncation rule of the series representation at a fixed integer N quantifying the error through a moment-matching criterion, i.e. evaluating a measure of discrepancy between actual moments of the whole series and moments of the truncated sum based on the simulation output. More series representations of the jump process can be considered, with corresponding truncation rules; see [11] and [42]. Alternatively, [43] proposed a novel class of r.p.m.'s, that is dense in the class of homogeneous normalized completely random measures. These authors first approximate any c.r.m. μ withμ ε which, as we have already mentioned, has finite Lévy measure. Then, resorting to the "denseness" of the novel class, they approximateμ ε with an element of this class, with Lévy intensity given by the weighted sum of a finite number of intensities of finite activity processes, plus the intensity of the gamma process.
The remaining values are non-allocated jumps. We use the superscript (na) for random variables related to non-allocated jumps. The first result is a characterization of the posterior law of the random measure μ ε , not yet normalized; however, we need introducing two more ingredients first. We consider an auxiliary random variable U such that U |μ ε ∼ Gamma(n, T ε ), so that the marginal density of U is 5) and the last equality follows easily from the definition of T ε and (2.3), using notation defined in (3.8). We also formulate the following lemma, whose proof is straightforward.
with Lévy intensity ν ε as in (3.2), and let μ ε be defined as in (3.3). Consider a c.r.m. μ such that for any positive f , where is the Laplace functional of the random measure J 0 δ τ0 .
The posterior distribution of μ ε has the following characterization.
, then the conditional distribution of P ε , given θ * and U = u, is obtained by normalization of the following random measure where ε-approximation of Bayesian nonparametric mixtures 3523

the law of the process of non-allocated jumps μ (na)
ε,u (·) is distributed as the c.r.m. μ defined in (3.6), corresponding to Lévy intensity in (3.7) given by e −us ν ε (ds, dτ ) and probability p of ε,u (·) are independent, conditionally to l * = (l * 1 , . . . , l * k ), the vector of locations of the allocated jumps; 4. the posterior law of U given θ * has density on the positive real numbers given by The proof of the above proposition, as well as of all the others in this section, is in Appendix B. An immediate consequence of Theorem 3.1 is the next proposition.
Both the infinite and finite dimensional processes defined in (2.5) and (3.4), respectively, belong to the wide class of species sampling models, deeply investigated in [38], and we use some of the results there to derive ours. Let (θ 1 , . . . , θ n ) be a sample from (2.5) or (3.4) (or, more generally, from a species sampling model); since it is a sample from a discrete probability, it induces a random partition p n : the marginal law of (θ 1 , . . . , θ n ) has unique characterization: where p is the exchangeable partition probability function (eppf) associated to the random probability. The eppf p is a probability law on the set of the partitions of N n . The following proposition provides an expression for the eppf of a general ε − NormCRM.
Convergence of the sequence of eppfs yields convergence of the sequences of ε − NormCRMs, generalizing a result obtained for ε − NGG processes.
The proof of the above proposition is along the same lines as the proof of Proposition 1 in [3], and therefore it is omitted here.
Furthermore, the m-th moment of P ε , m = 1, 2, . . . , is equal to: where B ∈ B(Θ) and K m is the number of distinct values in a sample of size m from P ε . In particular, when m = 2, K m assumes values in {1, 2}, and the probability that K 2 = 1 is the probability that, in a sample of size 2 from P ε , the sample values coincide, i.e. p ε (2). Therefore E(P ε (B) 2 ) = P 0 (B)p ε (2) + (P 0 (B)) 2 (1 − p ε (2)), and consequently Analogously, the covariance structure of P ε is as follows: for any B 1 , B 2 ∈ B(Θ). Proofs of (3.12) and (3.14) are given in Appendix B.

ε − N ormCRM process mixtures
We consider mixtures of parametric kernels as the distribution of data, where the mixing measure is the ε − NormCRM(ρ, κP 0 ). The model we assume is the following: where f (·; θ i ) is a parametric family of densities on Y ⊂ R p , for all θ ∈ Θ ⊂ R m . Remember that P 0 is a non-atomic probability measure on Θ, such that E(P ε (A)) = P 0 (A) for all A ∈ B(Θ) and ε ≥ 0. Model (4.1) will be addressed here as ε−NormCRM hierarchical mixture model. The design of a Gibbs scheme to sample from the posterior distribution of model (4.1) is straightforward, once we have augmented the state space with the variable u, by using the posterior characterization in Theorem 3.1. The Gibbs sampler generalizes that one provided in [3] for ε − NGG mixtures, but it is designed for any Lévy intensity ρ under (2.2) and (2.4). Description of the full-conditionals is below, and further details can be found in Appendix A.
1. Sampling from L(u|Y , θ, P ε , ε): it is clear that, conditionally to P ε , u is independent from the other variables and distributed according to gamma with parameters (n, T ε ).

Sampling from
3. Sampling from L(P ε , ε|u, θ, Y ): this step is not straightforward and can be split into two consecutive substeps: 3.a Sampling from L(ε|u, θ, Y ): see Appendix A.
To put into practice, we have to sample (i) the number N na of nonallocated jumps, (ii) the vector of the unnormalized non-allocated jumps J (na) , (iii) the vector of the unnormalized allocated jumps J (a) , the support of the allocated (iv) and non-allocated (v) jumps. See Appendix A for a wider description.
We highlight that, when sampling from non-standard distributions, Accept-Reject or Metropolis-Hastings algorithms have been exploited.

Some ideas on the choice of ε
We believe that a brief discussion on the choice of the approximation parameter ε is worth doing. We could also consider it random, as we did in [3], where the ε-NGG mixture model was proposed. In our general view, this parameter can be considered either as a true parameter, and then it should be fixed on the ground of the prior information we have, or as a tuning parameter to approximate the "exact" model (normalized completely random measure mixtures). If we prefer the latter alternative as we did here, ε has to be small. However, since the result on ε-approximation (Theorem 3.3) concerns the prior distribution in (4.1), the only suggestions we can give refer to a priori criteria. Here we suggest to set ε such that the sum of the masses μ((0, ε]) and J 0 we perturb μ with, obtaining μ ε , is small. In particular, since the interest is in normalized random measures, "small" is fixed with respect to the expectation E(T ) of the total mass of μ, i.e. we choose ε such that where ν is typically a small value. Rather, alternative criteria are available; for instance, as in [3], we could choose ε to achieve a prefixed value for E(N ε ) or Var(N ε ). As far as (5.1) is concerned, observe that i.e. the r.v. μ(0, ε] converges to 0 in L 2 and this implies convergence in probability. Besides, we have that Consequently, when ε → 0, E(N ε ) → +∞ and thus E(J 0 ) converges to 0.

Normalized Bessel random measure mixtures: density estimation
In this section we introduce a new normalized process, called normalized Bessel random measure. Section 6.1 describes theoretical results: in particular, we show that this family encompasses the well-known Dirichlet process. Then we fit the mixture model to synthetic and real datasets in Section 6.2. Results are illustrated through a density estimation problem.

Definition
Let us consider a normalized completely random measure corresponding to mean intensity where ω ≥ 1 and is the modified Bessel function of order ν > 0 [see 15, Sect 7.2.2]. It is straightforward to see that, for s > 0,  } are the points of a Poisson process on R + with intensity κρ m . By this notation we mean that T m is equal to 0 when N m = 0, while, conditionally to N m > 0, J (m) j iid ∼ gamma(2m, ω). We can write down the density function of T , via (2.3): The same expression is obtained when Bessel function density [18]. By (3.10), the eppf of the normalized Bessel random measure is: is the hypergeometric series [see 22, formula (9.100)].
The following proposition shows that the eppf of the normalized Bessel random measure converges to the eppf of the Dirichlet process as the parameter ω increases. The proof is given in Appendix B. The prior distribution of K n , the number of distinct values in a sample of size n from the normalized Bessel random measure, could be derived from its eppf in (6.4). However, this is not an easy task from a computational point of view, so that we prefer to use a Monte Carlo strategy to simulate from the prior of the K n . The simulation strategy is also useful to understanding the meaning of the parameters of the normalized Bessel random measure: κ has the usual interpretation of the mass parameter, since, when fixing ω, E(K n ) increases with κ. On the other hand, the effect of ω is quite peculiar: decreasing ω (thus drifting apart from the Dirichlet process), with κ fixed, the prior distribution of K n shifts towards smaller values. However, when E(K n ) is kept fixed, the distribution has heavier tails if ω is small (see Figures 2 and 4 (a)).
The Lévy intensity (6.1) of the normalized Bessel completely random measure has a similar expression as the intensity corresponding to an element of the class C in [43]. Both intensities are linear combinations of intensities of the gamma process and of the type s i−1 e −ωs I (0,+∞) (s), corresponding to finite activity Poisson processes. Here, the intensity of the Bessel r.p.m. corresponds to an infinite mixture with fixed weights, where the indexes i are even integers (see (6.2)), while [43] assume a linear combination of a finite number of components, through a vector of parameters.

Application
In this section let us consider the hierarchical mixture model (4.1), where the mixing measure is P ε , the ε-approximation of the normalized Bessel random measure, as introduced above (here ε-NB(ω, κP 0 ) mixture model). Of course, when ε is small, this model approximates the corresponding mixture when the mixing measure is P ; to the best of our knowledge, this normalized Bessel completely random measure has never been considered in the Bayesian nonparametric literature. By decomposition (6.3), we argue that this model is suitable when the unknown density shows many different components, where a few of them are very spiky (they should correspond to Lévy intensities (6.2)), while there is a folk of flatter components which are explained by the intensity (1/s)e −ωs of the Gamma process. For this reason, we consider a simulated dataset which is a sample from a mixture of 5 Gaussian distributions with means and standard deviations equal to {(15, 1.1), (50, 1), (20,4), (30,5), (40,5)}, and weights proportional to {10, 9, 4, 5, 5}. The histogram of the simulated data, for n = 1000, is reported in Figure 3.
We report posterior estimates for different sets of hyperparameters of the ε-NB mixture model when f (·; θ) is the Gaussian density on R and θ = (μ, σ 2 ) stands for its mean and variance. Moreover, P 0 (dμ, dσ 2 ) = N (dμ;ȳ n , σ 2 /κ 0 ) × inv − gamma(dσ 2 ; a, b); here N (ȳ n , σ 2 /κ 0 ) is the Gaussian distribution with meanȳ n (the empirical mean) and variance σ 2 /κ 0 , and inv − gamma(dσ 2 ; a, b) is the inverse-gamma distribution with mean b/(a − 1) (if a > 1). We set κ 0 = 0.01, a = 2 and b = 1 as proposed first in [16]. We shed light on three sets of hyperparameters in order to understand sensitivity of the estimates under different conditions of variability; indeed, each set has a different value of p ε (2), which tunes the a-priori variance of P ε , as reported in (3.13). We tested three different values for p ε (2): p ε (2) = 0.9 in set (A), p ε (2) = 0.5 in set (B) and p ε (2) = 0.1 in set (C). Moreover, in each scenario we let the parameter 1/ω range in {0.01, 0.25, 0.5, 0.75, 0.95}; note that the extreme case of ω = 100 (or equivalently 1/ω = 0.01) corresponds to an approximation of the DPM model. The mass parameter κ is then fixed to achieve the desired level of p ε (2). As far as the choice of ε concerns, we set it equal to 10 −6 : this provides pretty good approximation a priori (see Section 5); moreover, posterior inference proved to be fairly robust with respect to ε. At the end, we got 15 tests, listed in Table 1. As mentioned before, it is possible to choose a prior for ε, even if, for the ρ in (6.1), the computational cost would greatly increase due to the evaluation of functions 2 F 1 in (6.4).
We have implemented our Gibbs sampler in C++. All the tests in Sections 6 and 7 were made on a laptop with Intel Core i7 2670QM processor, with 6GB of RAM. Every run produced a final sample size of 5000 iterations, after a thinning of 10 and an initial burn-in of 5000 iterations. Every time the convergence was checked by standard R package CODA tools.
Here, we focus on density estimation: all the tests provide similar estimates, quite faithful to the true density. Figure 3 shows density estimate and pointwise 90% credibility intervals for case A5; the true density is superimposed as dashed line. Figure 4 (a) and (b) display prior and posterior distributions, respectively, of the number K n of groups, i.e. the number of unique values among (θ 1 , . . . , θ n ) in (4.1) under two sets of hyperparameters, A1, representing an approximation of the DPM model, and A5, where the parameter ω is nearly 1. From Figure 4 it is clear that A5 is more flexible than A1: for case A5, a priori the variance of K n is larger, and, on the other hand, the posterior probability mass in 5 (the true value) is larger.
In order to compare different priors, we consider five different predictive goodness-of-fit indexes: (i) the sum of squared errors (SSE) , i.e. the sum of the squared differences between the y i and the predictive mean E(Y i |data) (yes, we are using data twice!); (ii) the sum of standardized absolute errors (SSAE), given by the sum of the standardized error |y i − E(Y i |data)|/ Var(Y i |data); (iii) log-pseudo marginal likelihood (LPML), quite standard in the Bayesian literature, defined as the sum of log(CP O i ), where CP O i is the conditional predictive ordinate of y i , the value of the predictive distribution evaluated at y i , conditioning on the training sample given by all data except y i . The last two indexes, (iv) W AIC 1 and (v) W AIC 2 , as denoted here, were proposed in [44] and deeply analyzed in [21]: they are generalizations of the AIC, adding two types of penalization, both accounting for the "effective number of parameters". The bias correction in W AIC 1 is similar to the bias correction in the definition of the DIC, while W AIC 2 is the sum of the posterior variances of the conditional density of the data. See [21] for their precise definition. Table 1 shows the values of the five indexes for each test: the optimal (according to each index) tests are

highlighted in bold for the experiments (A), (B) and (C)
. It is apparent that the different tests provide similar values of the indexes, but SSE, indicating that, from a predictive viewpoint, there are no significant differences among the priors. However, especially when the value of κ is small, i.e. in all tests A and B, a model with a smaller ω tends to outperform the Dirichlet process case (approximately, when ω = 100). On the other hand, the SSE index shows quite different values among the tests: it is well-known that this is a index favoring complex models and leading to better results when data are over-fitted. Therefore, tests with an higher value of κ are always preferable according to this criterion. We fitted our model also to a real dataset, the Hidalgo stamps data of [45] consisting of n = 485 measurements of stamp thickness in millimeters (here multiplied by 10 3 ). The stamps have been printed between 1872 and 1874 on different paper types, see data histogram in Figure 5. This dataset has been analyzed by different authors in the context of mixture models: see, for instance, [37].
We report posterior inference for the set of hyperparameters which is most in agreement with our prior belief: the mean distribution is P 0 (dμ, dσ 2 ) = N (dμ;ȳ n , σ 2 /κ 0 ) × inv − gamma(dσ 2 ; a, b) as before, and κ 0 = 0.005, a = 2 and b = 0.1. The approximation parameter ε of the ε-NB(ω, κP 0 ) random measure is fixed to 10 −6 ; on the other hand, in order to set parameters ω and κ, we argue as follows: ω ranges in {1.05, 5, 10, 1000} and we choose the mass parameter κ such that the prior mean of the number of clusters, i.e. E(K n ), is the desired one. As noted in Section 6.1, a closed form of the prior distribution of K n is not available, so we resort to Monte Carlo simulation to estimate it. Table 2 shows the four couples of (ω, κ) yielding E(K n ) = 7: indeed, according to [26] and [36] and references therein, there are at least 7 different groups (but the true number is unknown), corresponding to the number of types of paper used. For an in-depth discussion about the appropriate number of groups in Hidalgo stamps data, we refer the reader to [10]. Table 2 also reports prior stan- dard deviations of K n : even if the a-priori differences are small, the posteriors appear to be quite different among the 4 tests. All the posterior distributions on K n support the conjecture of at least seven distinct modes in the data; in particular, Figure 5 (b) displays the posterior distribution of K n for Test 4. A modest amount of mass is given to less than 7 groups, and the mode is in 11. Even Test 1, corresponding to the Dirichlet process case, does not give mass to less than 7 groups, where 9 is the mode. Density estimates seem pretty good; an example is given in Figure 5 (a), with 90% credibility band for Test 4. As in the simulated data example, some predictive goodness-of-fit indexes are reported in Table 2: the optimal value for each index is indicated in bold. The SSE is significantly lower when ω is small, thus suggesting a greater flexibility of the model with small values of ω. The other indexes assume the optimal value in Test 4 as well, even if those values are similar along the tests.
Our ε-approximation method turned out to be accurate and fast when compared with competitors (the slice sampler and an a-posteriori truncation method) when the mixing r.p.m is the NGG process and the kernel is Gaussian; see [3], Section 5.

Linear dependent N GG mixtures: application to sports data
Let us consider a regression problem, where the response Y is univariate and continuous, for ease of notation. We model the relationship (in distributional terms) between the vector of covariates x = (x 1 , . . . , x p ) and the response Y through a mixture density, where the mixing measure is a collection {P x , x ∈ X } of ε − NormCRMs, being X the space of all possible covariates. We follow the same approach as in [35] and [14] for the dependent Dirichlet process. We define the dependent ε − NormCRM process {P x , x ∈ X }, conditionally to x, as: The weights P j are the normalized jumps as in (3.4), while the locations γ j (x), j = 1, 2, . . ., are independent stochastic processes with index set X and P 0x marginal distributions. Model (7.1) is such that, marginally, P x follows a ε − NormCRM process, with parameter (ρ, κP 0x ), where ρ is the intensity of a Poisson process on R + , κ > 0, and P 0x is a probability on R. Observe that, since N ε and P j do not depend on x, (7.1) is a generalization of the single weights dependent Dirichlet process [see 8, for this terminology]. We also assume the functions x → γ j (x) to be continuous. The dependent ε − NormCRM process in (7.1) takes into account the vector of covariates x only through γ j (x). In particular, when the kernel of the mixture (4.1) belongs to the exponential family, for each j, γ j (x) = γ(x; τ j ) can be assumed as the link function of a generalized linear model, so that (4.1) specializes to This last formulation is convenient because it facilitates parameters interpretation as well as numerical posterior computation. We analyze the Australian Institute of Sport (AIS) data set [12], which consists of 11 physical measurements on 202 athletes (100 females and 102 males).
Here the response is the lean body mass (lbm), while three covariates are considered, the red cell count (rcc), the height in cm (Ht) and the weight in Kg (Wt). The data set is contained in the R package DPpackage [28]. The actual model (7.2) we consider here is when f (·; μ, η 2 ) is the Gaussian distribution with μ mean and η 2 variance; moreover, μ = γ(x, θ) = x t θ, and the mixing measure P ε is the ε-NGG(κ, σ, P 0 ), as introduced in [3]. We have considered two cases, when mixing the variance η 2 with respect to the NGG process or when the variance η 2 is given a parametric density; in both cases, by linearity of the mean x t θ, the model (here called linear dependent NGG mixture) can be interpreted as a NGG process mixture model, and inference can be achieved via an algorithm similar to that in Section 4. We set ε = 10 −6 , which provides a moderate value for the ratio r(ε) in (5.1), and σ ∈ {0.001, 0.125, 0.25}, κ such that E(K n ) 5 or 10. When the variance η 2 is included in the location points of the ε − NGG process, then P 0 is N 4 (b 0 , Σ 0 ) × inv − gamma(ν 0 /2, ν 0 η 2 0 /2); on the other hand, when η 2 is given a parametric density, then η 2 ∼ inv − gamma(ν 0 /2, ν 0 η 2 0 /2). We fixed hyperparameters in agreement with the least squares estimate: b 0 = (−50, 5, 0, 0), Σ 0 = diag(100, 10, 10, 10), ν 0 = 4, η 2 0 = 1. For all the experiments, we computed the posterior of the number of groups, the predictive densities at different values of the covariate vectors and the cluster estimate via posterior maximization of Binder's loss function [see 31]. Moreover, we compared the different prior settings computing predictive goodness-of-fit tools, specifically log pseudo-marginal likelihood (LPML) and the sum of squared errors (SSE), as introduced in Section 6.2. The minimum value of SSE, among our experiments, was achieved when η 2 is included in the location of the ε − NGG process, σ = 0.001 and κ = 0.8 so that E(K n ) 5. On the other hand, the optimal LPML was achieved when σ = 0.125, κ = 0.4, and E(K n ) 5. Posterior of K n and cluster estimate under this last hyperparameter setting are in Figure 6 ((a) and (b), respectively); in particular the cluster estimate is displayed in the scatterplot of the Wt vs lbm. In spite of the vague prior, the posterior of K n is almost degenerate on 2, giving evidence to the existence of two linear relationships between lbm and Wt.  Finally, Figure 7 displays predictive densities and 95% credibility bands for 3 athletes, a female (Wt=60, rcc=3.9, Ht=176 and lbm=53.71), and two males (Wt=67. 1,113.7,rcc=5.34,5.17,Ht=178.6,209.4 and lbm=62,97) respectively, under the same hyperparameter setting of Figure 6; the dashed lines are observed values of the response. Depending on the covariate values, the distribution shows one or two peaks: this reflects the dependence of the grouping of the data on the value of x. This figure highlights the versatility of nonparametric priors in a linear regression setting with respect to the customary parametric priors: indeed, the model is able to capture in detail the behavior of the data, even when several clusters are present.

Discussion
We have proposed a new model for density and cluster estimation in the Bayesian nonparametric framework. In particular, a finite dimensional process, the ε − NormCRM, has been defined, which converges in distribution to the corresponding normalized completely random measure, when ε tends to 0. Here, the ε−NormCRM is the mixing measure in a mixture model. In this paper we have fixed ε very small, but we could choose a prior for ε and include this parameter into the Gibbs sampler scheme. Among the achievements of the work, we have generalized all the theoretical results obtained in the special case of NGG in [3], including the expression of the eppf for an ε − NormCRM process, its convergence to the corresponding eppf of the nonparametric underlying process and the posterior characterization of P ε . Moreover, we have provided a general Gibbs Sampler scheme to sample from the posterior of the mixture model. To show the performance of our algorithm and the flexibility of the model, we have illustrated two examples via normalized completely random measure mixtures: in the first application, we have introduced a new normalized completely random measure, named normalized Bessel random measure; we have studied its theoretical properties and used it as the mixing measure in a model to fit simulated and real datasets. The second example we have dealt with is a linear dependent ε − NGG mixture, where the dependence lies on the support points of the mixing random probability, to fit a well known dataset. Current and future research is devoted on the use of our approximation on more complex dependence structures.

Appendix A: Details on full-conditionals for the Gibbs sampler
Here, we provide some details about Step 3 of the Gibbs Sampler in Section 4. As far as Step 3a is concerned, the full-conditional L(ε|u, θ, Y ) is obtained integrating out N ε (or equivalently N na ) from the law L(N ε , u, θ, Y ), as follows: where we used the identity +∞ Nna=0 Λ Nna ε,u (N na + k)/(N na !) = e Λε,u (Λ ε,u + k). Moreover, f ε (u; n 1 , . . . , n k ) is defined in (B.7). This step depends explicitly on the expression of ρ(s).
Step 3.b consists in sampling from L(P ε |ε, u, θ) as reported in Corollary 3.1. In order to sample a draw from the posterior distribution of the (unnormalized) measure, we follow Theorem 3.1. The component μ ε,u satisfies the distributional identity described at point 1 of the proposition, and therefore we simulate it as follows: 1. Draw x from the Bernoulli distribution with parameter p = Λ ε,u /(Λ ε,u +k). Conditionally to the unnormalized measure μ ε (see (3.3)), the law of θ is given by By considering the variable U in (3.5), we express the joint conditional distribution of θ and U as The posterior distribution of μ ε can be characterized by its Laplace functional; we have Let us focus on the numerator in (B.2); by (B.1) we obtain: Moreover, if P 0 is an absolutely continuous probability, then, for each j = 1, . . . , k, Therefore, the expected value on the right hand side of (B.3) is: Representation (2.1) can be extended toμ ε (dθ * j ) nj = R + ×Θ s nj δ τ (dθ * j )N (ds, dτ ) where N is a Poisson process with mean intensity ν ε (ds, dτ ). If we apply Palm's formula [see 13, Proposition 13.1.IV] toμ ε (dθ * k ) n k , we have that In other words, the numerator of (B.2) is equal to u n−1 Γ(n) R + ×Θ e −s(f (τ )+u) ν ε (ds, dτ ) + k Λ ε e − R + ×Θ (1−e −s(f (τ )+u) )νε(ds,dτ) e −(f (θ * j )+u)s s nj κρ(s)ds. (B.4)

B.4. Proof of formula 3.12
First of all, observe that We identify this last expression as E m k=1 P 0 (B) k P(K m = k|N ε ) , where K m is the number of distinct values in a sample of size m from P ε . Hence, we have proved that