Determinantal point process mixtures via spectral density approach

We consider mixture models where location parameters are a priori encouraged to be well separated. We explore a class of determinantal point process (DPP) mixture models, which provide the desired notion of separation or repulsion. Instead of using the rather restrictive case where analytical results are available, we adopt a spectral representation from which approximations to the DPP intensity functions can be readily computed. For the sake of concreteness the presentation focuses on a power exponential spectral density, but the proposed approach is in fact quite general. We later extend our model to incorporate covariate information in the likelihood and also in the assignment to mixture components, yielding a trade-off between repulsiveness of locations in the mixtures and attraction among subjects with similar covariates. We develop full Bayesian inference, and explore model properties and posterior behavior using several simulation scenarios and data illustrations.

from some suitable prior p 0 . However, the weights π may be constructed differently, e.g. using a stick-breaking representation (finite or infinite), which poses a well-known connection with more general models, including nonparametric ones. See, e.g., Ishwaran and James (2001) and Miller and Harrison (2017). A popular class of Bayesian nonparametric models is the Dirichlet process mixture (DPM) model, introduced in Ferguson (1983) and Lo (1984). It is well-known that this class of mixtures usually overestimates the number of clusters, mainly because of the "rich gets richer" property of the Dirichlet process. By this we mean that both prior and posterior distributions are concentrated on a relatively large number of clusters, but a few are very large, and the rest of them have very small sample sizes. Mixture models may even be inconsistent; see Rousseau and Mengersen (2011), where concerns about over-fitted mixtures are illustrated, and Miller and Harrison (2013), for inconsistency features of DPMs.
Despite their success, mixture models like (1) tend to use excessively many mixture components. As pointed out in Xu et al. (2016), this is due to the fact that the componentspecific parameters are a priori i.i.d., and therefore, free to move. This motivated Petralia et al. (2012), Fúquene et al. (2016) and Quinlan et al. (2017) to explicitly define joint distributions for θ having the property of repulsion among its components, i.e. that p(θ 1 , . . . , θ k ) puts higher mass on configurations such that components are well separated. For a different approach, via sparsity in the prior, see Malsiner-Walli et al. (2016). Xu et al. (2016) explored a similar way to accomplish separation of mixture components, by means of a Determinantal Point Process (DPP) acting on the parameter space. DPPs have recently received increased attention in the statistical literature (Lavancier et al., 2015). DPPs are point processes having a product density function expressed as the determinant of a certain matrix constructed using a covariance function evaluated at the pairwise distances among points, in such a way that higher mass is assigned to configurations of well-separated points. We give details below. DPPs have been used in a number of different modeling efforts. Bardenet and Titsias (2015) and Affandi et al. (2014) applied DPPs to model spatial patterns of nerve fibers in diabetic patients, a basic motivation being that such fibers become more clustered as diabetes progresses. The latter discussed also applications to image search, showing how such processes could be used to study human perception of diversity in different image categories. Similarly, Kulesza and Taskar (2012) show how DPPs can be applied to various problems such as finding diverse sets of highquality search results, building informative summaries by selecting diverse sentences from documents, modeling non-overlapping human poses in images or video, and automatically building timelines of important news stories.
We discuss full Bayesian inference for a class of mixture densities where the locations follow stationary DPPs. The approach is based on the spectral representation of the covariance function defining the determinant as the joint distribution of component-specific parameters. While our methods can be used with any such valid spectral representation, we focus for the sake of concreteness on the power exponential spectral case; see examples with different spectral densities in the on-line Supplementary Material. This particular specification allows for flexible repulsion patterns, and we discuss how to set up different types of prior behavior, shedding light on the practical use of our approach in that particular scenario. We discuss applications in the context of synthetic and real data applications.
We later extend our model to incorporate covariate information in the likelihood and also in the assignment to mixture components. In particular, subjects with similar covariates are a priori more likely to co-cluster, just as in mixtures of experts models (see, e.g., McLachlan and Peel, 2005), where weights are defined as normalized exponential functions.
We implement a reversible jump (RJ) MCMC posterior simulation scheme for each of the models we propose. In all cases, we estimate clusters for the subjects in the sample, by considering the partition that minimizes the posterior expectation of Binder's loss function (Binder, 1978) under equal misclassification costs. This is a common choice in the applied Bayesian nonparametric literature (Lau and Green, 2007).
The first mixture model we propose is as in Xu et al. (2016). They consider, as a prior for the location points, a particular case of DPPs, called L-ensambles. These are defined as a direct generalization from the same definition but for the finite state space case. Our proposal does not use L-ensembles. Instead, we stick to the more general case, following the spectral approach discussed in Lavancier et al. (2015). Although we limit ourselves to the case of isotropic DPPs, inhomogeneous DPPs can be obtained by transforming or thinning a stationary process; see Lavancier et al. (2015), Section 3 and Appendix A. A crucial point in our models and algorithms is the DPP density expression, which is only defined for DPPs restricted to compact subsets S of the state space, with respect to the unit rate Poisson process. When this density exists, it explicitly depends on S. A sufficient condition for the existence of the density is that all the eigenvalues of the covariance function, restricted to S, are smaller than 1. We follow the spectral approach and assume that the covariance function defining the DPP has a spectral representation. A basic motivation for our choice is that conditions for the existence of a density become easier to check. We review here the basic theory on DPPs, making an effort to be as clear and concise as possible in the presentation of our subsequent models.
We acknowledge that the RJ scheme for our first mixture model is as in Xu et al. (2016).
In both cases the algorithm requires computing the DPP density with respect to the unit rate Poisson process. We explain how to adapt the calculations to our case, and discuss the need to restrict the process to (any) compact subset. When extending the model to incorporate covariate information in both, likelihood and prior assignment to mixture components, the RJ MCMC algorithm requires modifications, as discussed below.
The rest of this article is organized as follows. Section 2 presents notation and theoretical background necessary to understand how DPP mixture models are constructed. Section 3 illustrates the covariate-dependent extension. Here we build on regular mixture models, incorporating covariate dependence in the mixture weights and optionally in the likelihood, which still allows for repulsion among components after correcting for the regression effect.
Section 4 presents results from a simulation study and for the Galaxy dataset, while data applications are discussed in Section 5. We conclude in Section 6 with final comments and discussion.

Using DPPs to induce repulsion
We review here the basic theory on DPPs to the extent required to explain our mixture model. We use the same notation as in Lavancier et al. (2015), where further details on this theory may be found.

Basic theory on DPPs
Let B ⊆ R d ; we mainly consider the cases B = R d and B = S, a compact subset in R d . By X we denote a simple locally finite spatial point process defined on B, i.e. the number of points of the process in any bounded region is a finite random variable, and there is at most one point at any location. See Daley and Vere-Jones (2003; for a general presentation on point processes. The class of determinantal point processes we consider is defined in terms of their moments, expressed by their product density functions ρ (n) : B n → [0, +∞), n = 1, 2, . . .. Intuitively, for any pairwise distinct points x 1 , . . . , x n ∈ B, ρ (n) (x 1 , . . . , x n )dx 1 · · · dx n is the probability that X has a point in an infinitesimal small region around x i of volume dx i , for each i = 1, . . . , n. More formally, X has n-th order product density function ρ (n) : B n → [0, +∞) if this function is locally integrable (i.e. S |ρ (n) (x)|dx < +∞ for any compact S) and, for any Borel-measurable function where the = sign over the summation means that x 1 , . . . , x n are pairwise distinct. See also Møller and Waagepetersen (2007). Let us consider a covariance function C : B × B → R.
The theory recalled here is valid also for complex-valued covariance functions, but such extensions are not needed in what follows.

Definition.
A simple locally finite spatial point process X on B is called a determinantal point process with kernel C if its product density functions are where [C](x 1 , . . . , x n ) is the n × n matrix with entries C(x i , x j ). We write X ∼ DP P B (C); Remark. If A is a Borel subset of B, then the restriction X A := X ∩ A of X to A is a DPP with kernel given by the restriction of C to A × A.
By Theorem 2.3 in Lavancier et al. (2015), first proved by Macchi (1975), such DPP's exist under the two following conditions: • C is a continuous covariance function; hence, by Mercer's Theorem, where {λ S k } and {φ k (x)} are the eigenvalues and eigenfunctions of C restricted to S × S, respectively.
• λ S k ≤ 1 for all compact S in R d and all k.
Formula (2.9) in Lavancier et al. (2015) reports the distribution of the number N(S) of points of X in S, for any compact S: where B k ind ∼ Be(λ S k ), i.e. the Bernoulli random variable with mean λ S k . When restricted to any compact subset S, the DPP has a density with respect to the unit rate Poisson process which, when λ S k < 1 for all k = 1, 2, . . ., has the following expression: where |S| = S dx, D S = − +∞ 1 log(1 − λ S k ) and When n = 0 the density (as well as the determinant) is defined to be equal to 0. See Møller and Waagepetersen (2007) for a thorough definition of absolute continuity of a spatial process with respect to the unit rate Poisson process. However, note that from the first part of (3) we have P(N(S) = 0) = +∞ k=1 (1 − λ S k ); this probability could be positive due to the assumption λ S k < 1 for all k = 1, 2, . . .. From now on we restrict our attention to stationary DPP's, that is, when C(x, y) = C 0 (x − y), where C 0 ∈ L 2 (R d ) is such that its spectral density ϕ exists, i.e. C 0 (x) = R d ϕ(y) cos(2πx · y)dy, x ∈ R d and x · y is the scalar product in R d . If ϕ ∈ L 1 (R d ) and 0 ≤ ϕ ≤ 1, then the DP P (C) process exists. Summing up, the distribution of a stationary DPP can be assigned by its spectral density; see Corollary 3.3 in Lavancier et al. (2015).
To explicitly evaluate (4) over S = − 1 2 , 1 2 d , we approximate C as suggested in Lavancier et al. (2015). In other words, we approximate the density of X on S by where The approximation C(x, y) ≈ C app,0 (x − y) (x − y ∈ S) follows because the exact Fourier expansion of C 0 (x − y) in S is as in (7) with the real part of S C 0 (y)e −2πik·y dy instead of ϕ(k); if we assume C 0 such that C 0 (t) ≈ 0 for t ∈ S, then  When S = R is a rectangle in R d , we can always find an affine transformation T such is the approximate density of Y as in (6), we can then approximate the density of X R by f app ({x 1 , . . . , x n }) = |R| −n e |R|−|S| f app Y (T ({x 1 , . . . , x n })), {x 1 , . . . , x n } ⊂ R.
In practice, the summation over Z d in (7)  One particular example of spectral density that we found useful is for fixed s ∈ (0, 1) (e.g. s = 1 2 ) and x is the Euclidean norm of x ∈ R d . This function is the spectral density of a power exponential spectral model (see (3.22) in Lavancier et al. (2015) when α = s α max (ρ, ν)). In this case, we write X ∼ P ES − DP P (ρ, ν). The corresponding spatial process is isotropic. When ν = 2, the spectral density is corresponding to the Gaussian spectral density.
Remark. Note that ϕ(x; ρ, ν) < 1 when 0 < s < 1 for any x ∈ R d , ρ, ν > 0, so that X S has a density as described in (4). Figure S.8 in the Supplementary material shows a plot of the power exponential spectral density (10) for different values of parameters ρ, ν. Note that ν controls the shape of ϕ(x; ρ, ν), which ranges from a slowly decreasing function of x to an indicator function. On the other hand ρ plays the role of a centering parameter, with higher values retarding the decay speed of ϕ(x; ρ, ν). As discussed above, knowledge about the spectral density is all that is needed for the approximations to work. Moreover, even if the analytic expression of C(x, y), (x, y) ∈ R d × R d is known, as in, e.g. (10), we still need to compute the eigenvalues and eigenfunctions of C restricted to S × S for any compact S, and this may be analytically impossible. A potential disadvantage derived from this is that parameter interpretation in the spectral domain becomes unclear. A possible way to alleviate this difficulty consists of conducting extensive simulations that help understanding the role of such parameters. We do that for the case of (10); see Section 4.
Typically, DPPs have been used to make inference mostly on spatial data (including images, videos and point patterns related to towns, trees and cells); see, for instance, Shirota and Gelfand (2016) who describe an approximate Bayesian computation method to fit DPPs to spatial point pattern data. Historically, the first paper where DPPs were adopted as a prior for statistical inference in mixture models is Affandi et al. (2013).
The statistical literature includes a number of papers illustrating theoretical properties for estimators of DPPs from a non-Bayesian viewpoint. Biscio and Lavancier (2017) study asymptotic properties of minimum contrast estimators for parametric stationary determinantal point processes. Biscio and Lavancier (2016) quantify the repulsiveness of a DPP, based on its second-order properties, and address the question of how repulsive a stationary DPP can be. Moreover, Bardenet and Titsias (2015) derive bounds on the likelihood of a DPP, and Kulesza and Taskar (2012) provide an introduction to DPPs, focusing on the intuitions, algorithms, and extensions that are most relevant to the machine learning community.

The mixture model with repulsive means
To deal with limitations of model (1) or DPMs, we consider repulsive mixtures. Our aim is to estimate a random partition of the available subjects, and we want to do so using "few" groups. By repulsion we mean that cluster locations are a priori encouraged to be well separated, thus inducing fewer clusters than if they were allowed to be independently selected. We start from parametric densities f (·; θ), which we take to be Gaussian, and assume that the collection of location parameters follows a DPP. We specify a hierarchical model that achieves the goals previously described. Concretely, we propose: where the PES-DPP(ρ, ν) assumption (12) is regarded as a default choice that could be replaced by any other valid DPP alternative. We note that, as stated, the prior model may assign a positive probability to the case K = 0. This case of course makes no sense from the viewpoint of the model described above. Nevertheless, we adopt the working convention of redefining the prior to condition on K ≥ 1, i.e., truncating the DPP to having at least one point. In practice, the posterior simulation scheme later described simply ignores the case K = 0, which produces the desired result. Note also that we have assumed prior independence among blocks of parameters not involving the locations µ k .
Indeed, we both use DPPs as priors for location points in the mixture of parametric densities. However, the specific DPP priors are different, as they restrict to a particular case of DPPs (L-ensambles), and choose a Gaussian covariance function for which eigenvalues and eigenfunctions are analytically available. We adopt instead a spectral approach for assigning the prior (12). Similar to Xu et al. (2016), we carry out posterior simulation using a reversible jump step as part of the Gibbs sampler. However, when updating the location points µ 1 , . . . , µ K we refer to formulas (6)

Generalization to covariate-dependent models
The methods discussed in Section 2 were devised for density estimation-like problems. We now extend the previous modeling to the case where p-dimensional covariates x 1 , . . . , x n are recorded as well. We do so by allowing the mixture weights to depend on such covariates. In this case, there is a trade-off between repulsiveness of locations in the mixtures and attraction among subjects with similar covariates. We also entertain the case where covariate dependence is added to the likelihood part of the model. Our modeling choice here is akin to mixtures of experts models (see, e.g., McLachlan and Peel, 2005), i.e., the weights are defined by means of normalized exponential function.
Building on the model from Section 2.2, we assume the same likelihood (11) and the DPP prior for X = {µ 1 , µ 2 , . . . , µ K , K} in (12)-(13), but change (14) and (15) to where the β 1 = 0 assumption is to ensure identifiability. To complete the model, we assume (16) as the conditional marginal for σ 2 k ; the prior for (ρ, ν) in (13) is later specified. Here β 0 ∈ R p , and to choose Σ 0 , we use a g-prior approach, φ is fixed, typically of the same order of magnitude of the sample size (see Zellner, 1986).
Assuming now (11) on top of (17)-(18) rules out the case of a likelihood explicitly depending on covariates, which instead would generally achieve a better fit than otherwise.
Of course, there are many ways in which such dependence may be added. For the sake of concreteness, we assume here a Gaussian regression likelihood, where only the intercept parameters arise from the DPP prior. More precisely, we assume where the γ k 's are p-dimensional regression coefficients. The notation in (20) means that , where γ 0 ∈ R p and Λ 0 is a covariance matrix. The prior for {s i } and β j 's is given in (17)-(18) as in the previous model. Note that (19) implies that only the intercept term is distributed according to the repulsive prior. Thus, we allow the response mean to be corrected by a linear combination of the covariates with cluster-specific coefficients, with the repulsion acting only on the residual of this regression. The result is a more flexible model than the repulsive mixture (11)-(16).
Observe that there is no need to assume the same covariate vector in (19) and (17), but we do so for illustration purposes only.
The Gibbs sampler algorithm employed to carry out posterior inference for this model is detailed in Section S.4 of the Supplementary material. However, it is worth noting that the reversible jump step related to updating the number of mixture components K and the update of the coefficients {β 2 , β 3 , . . . , β K } are complicated by the presence of the covariates.
For the β coefficients, we resort to a Metropolis-Hastings step, with a multivariate Gaussian proposal centered in the current value. For K, we employ an ad hoc Reversible Jump move.

Simulated data and reference datasets
Before illustrating the application of our models to specific datasets, we discuss some general choices that apply to all examples. Every run of the Gibbs sampler (implemented in R) produced a final sample size of 5,000 or 10,000 iterations (unless otherwise specified), after a thinning of 10 and initial burn-in of 5,000 iterations. In all cases, convergence was checked using both visual inspection of the chains and standard diagnostics available in the CODA package. Elicitation of the prior for (ρ, ν) requires some care, as the role of these parameters is difficult to understand. Therefore, an extensive robustness analysis with respect to π(ρ, ν) for those datasets was carried out. See Sections 4.1 and S.2 of the Supplementary material. We point out that an initial prior independence assumption π(ρ, ν) = π(ρ) π(ν) produced bad mixing of the chain. In particular, when ρ is small with respect to ν, the spectral function ϕ(·) has a very narrow support, concentrated near the origin, forcing the covariance function C app (x, y) to become nearly constant for x, y ∈ S and thus producing nearly singular matrices. We next investigated the case π(ρ, ν) Here, M(s, ε, ν) is a constant that is the minimum value of ρ such that ϕ(2) > ε (here ϕ (2) is a reference value chosen to avoid a small support), and ε is a threshold value, assumed to be small (0.05, for instance). From Figure S.8 in the Supplementary Material, it is clear that ϕ(·; ρ, ν) goes to 0 too fast when ν is small relative to ρ. It follows that On the other hand, two different choices for π ν were considered: a gamma distribution, which gave a bad chain mixing, and a discrete distribution on V 2 = {0.5, 1, 2, 3, 5, 10, 15, 20, 30, 50} (or on one of its subsets). In this case, the mixing of the chain was better, but the posterior for ν did not discriminate among the values in the support. For this reason, in Sections 5.3, 6 and 7, we assume ν = 2, s = 1/2 and

Galaxy data
This popular dataset contains n = 82 measured velocities of different galaxies from six well-separated conic sections of space. Values are expressed in Km/s, scaled by a factor of 10 −3 . We set the hyperparameters in this way: for the variance σ 2 k of the components, (a 0 , b 0 ) = (3, 3) (such that the mean is 1.5 and the variance is 9/4) and for the weights {w k } the Dirichlet has parameter (1, 1, . . . , 1). The other hyperparameters are modified in the tests, as in Table 1, where we report summaries of interest, such as the prior and posterior mean and variance for the number of components K. In addition, we also display the mean squared error (MSE) and the log-pseudo marginal likelihood (LPML) as indices , whereŷ i is the posterior predictive mean and f (y i | y (−i) ) is the i−th conditional predictive ordinate, that is the predictive distribution obtained using the dataset without the i−th observation. Figure 1 (left) shows density estimates and the estimated partition of the data, obtained as the partition that minimizes the posterior expectation of Binder's loss function under equal misclassification costs (see Lau and Green, 2007). The points at the bottom of the plots represent observations, while colors refer to the corresponding cluster.  Table 1. As a comparison, the same posterior quantities than in Table 1 were computed using  Table 1, including 90% credibility bands (light blue). Posterior distribution (right) of the number K of components under Test 4 (black) and 6 (blue) in Table 1 and under the DPM model (red) as in Test 7 in Table 2. the DPM, fitted via the function DPdensity available from DPpackage (Jara et al., 2011); see Table 2. The same prior for σ 2 k as in our model was assumed. Note that α is fixed in Test 8 in such a way that a priori E(K) = 2, while for Test 3, a 0 and b 0 are chosen so that E(α) = 2 and Var(α) = 1. Clearly, Test 10 (DPM) shows the best indexes of goodness of fit but at the cost of overestimating the number of clusters. It is well-known that, in general, clustering in the context of nonparametric mixture models as DPMs is strongly affected by the base measure (see, e.g. Miller and Harrison, 2017). Our model, on the other hand, avoids the delicate choice of the base measure leading to more robust estimates of K. See also Figure 1 (right) which displays the posterior distribution of K under the DPM mixture and our models.
Finally, we recall that in Section S.3 in the Supplementary material we report some further tests on the Galaxy dataset to show the influence of various choices of spectral

Simulated data with covariates
We consider the same simulated dataset as in Müller et al. (2011), Section 5.2; the simulation "truth" consists of 12 different distributions, corresponding to different covariate settings (see Figure 1 of that paper). Model (11)-(13), (16)-(18) was fitted to the dataset, assuming β 0 = 0, Σ 0 = 400 × X T X −1 , a ρ = 1, b ρ = 1.2, and a 0 , b 0 such that the prior mean of σ 2 k is 50 and variance is 300. Recall also that here we assume ν = 2. As an initial step, inference for the complete dataset (1000 observations Computational burden over multiple repetitions was controlled by limiting the posterior sample sizes to 2,000. Table 3 displays the root MSE for estimating E(y | x 1 , x 2 , x 3 ) for each of the 12 covariate combinations defining the different clusters for our model and for the PPM x , as in Table 1 of Müller et al. (2011). The computations also include evaluation of the root MSE and LPML for all the 100 datasets for estimating the data used to train the model, with MSE train = n i=1 (y i −ŷ i ) 2 , whereŷ i is the expected value of the estimated predictive distribution, and for a test dataset of 100 new data, In addition, we report LP ML train , value of the Log Pseudo Marginal Likelihood for the training dataset.   In summary, our extensive simulations suggest that the proposed approach tends to require less mixture components than other well-known alternative models, while still providing a reasonably good fit to the data. vary as determined in Table 5, where m and v denote the prior mean b 0 /(a 0 − 1) and variance b 2 0 /((a 0 − 1) 2 (a 0 − 2)), respectively, of the inverse gamma distribution for σ 2 k as in (20). We also assume γ 0 equal to the vector of all 0's, while Λ 0 is such that the marginal a priori variance of γ k is equal to diag(0.01, 0.1, 0.1, 0.1), in accordance to the variances of the corresponding frequentist estimators.  Table 5: Prior specification for β k 's and σ 2 k 's parameters and posterior summaries for the Biopics dataset; m and v are prior mean and variance, respectively, of σ 2 k . Posterior mean and variance of the number K of mixture components are in the fifth and sixth columns, respectively, while the last two columns report MSE and LPML, respectively.

Biopic movies dataset
The posterior of K is robust with respect to the choice of prior hyperparameters; on the other hand, our results show that by not including covariates in the likelihood, i.e. setting all γ k 's are equal to 0, inference on K is much more sensitive to the choice of (a 0 , b 0 ) (results not shown here).
Predictive inference was also considered, by evaluating the posterior predictive distribution at the following combinations of covariate values: (i) (mean value for covariate year, US, white); (ii) (mean value for covariate year, US, color); (iii) (mean value for covariate year, UK, white); (iv) (mean value for covariate year, UK, color); (v) (mean value for covariate year, "other", white); and (vi) (mean value for covariate year, "other", color).
Corresponding plots are shown in Figure S.11 in the Supplementary material. These distributions appear to be quite different in the six cases: in particular, we can observe that  in cases (i) and (ii), the posterior is shifted towards higher values. This is quite easy to interpret, since the measurements are given by the earnings in the US box offices; therefore, we expect that in general US movies will be more profitable in that market. The difference due to the race is, on the other hand, less evident. However, the predictive densities show slightly higher earnings for movies where the subject is a person of color, if the origin is "other" ((v) and (vi)). Movies from the UK, on the other hand, exhibit the opposite behavior ((iii) and (iv)).
We report here the posterior cluster estimate for Test B in Table 5. We found three groups, with sizes 10, 193, 234, respectively; see Figure  a random covariance matrix which is given a non-informative prior and the inverse-gamma distribution for the variances of the mixture components has parameters such that mean and variance are equal to 5, 1, respectively similarly as in Table 5. Posterior summaries can be found in Table 6.
As a comparison between the estimated partitions under our model ( Figure 2) and the LDDP mixture model, Figure S.

Conclusion
This work deals with mixture models where the prior has the property of repulsion across location parameters. Specifically, the discussion is centered on mixtures built on determinantal point processes (DPPs), that can be constructed using a general spectral representation. The methods work with any valid spectral density, but for the sake of concreteness, illustrations were discussed in the context of the power exponential case.
Though we limit ourselves to the case of isotropic DPPs, inhomogeneous DPPs can be obtained by transforming or thinning a stationary process. However, we believe that this case is not very interesting, unless there is a strong reason to assume non-homogeneous locations a priori.
Our computational experiments and data illustrations show that the repulsion induced by the DPP priors indeed tends to eliminate the annoying case of very small clusters that commonly arises when using models that do not constrain location/centering parameters.
This happens with very small sacrifice of model fit compared to the usual mixture models.
Another advantage of our model over DPMs is that we avoid the delicate choice of the base measure of the Dirichlet process, leading to more robust estimates on the number K of components in the mixture.
• The variances in each component of the mixture {σ 2 1 , . . . , σ 2 K } are generated independently according to the following distribution: • Sampling the means {µ 1 , . . . , µ K } needs more care: following the reasoning in Xu et al. (2016), this full-conditional can be written as thanks to the Schur determinant identity. Note that det[C]({μ 1 , . . . ,μ K }; ρ, ν) in the above expression follows from the expression of the density of a DPP on a compact set; see (9). Then, b is a vector defined as is the transformed variable that takes values on the set S = [−1/2, 1/2] d . Typically, the rectangle R such that T (R) = S is fixed in such a way that it is large and contains abundantly all the datapoints.
We update each mean µ k separately for k = 1, . . . , K using a Metropolis-Hastings update.
• The full-conditional for the parameters (ρ, ν) is as follows (1 +φ(k; ρ, ν)) π(ρ, ν). (2009) is employed in this case, in order obtain a better mixing of the chains and to avoid the annoying choice of the parameters for the proposal distribution.

The adaptive Metropolis-Hastings algorithm of Roberts and Rosenthal
• In order to sample K we need a Reversible Jump step: standard proposals to estimate mixtures of densities with a variable number of components are based on moment matching (Richardson and Green, 1997) and have been relatively often used in the literature. The idea is to build a proposal that preserves the first two moments before and after the move, as in Xu et al. (2016). In particular, the only possible moves are the splitting move, passing from K to K +1, and the combine move, from K to K −1.
(i) Choose move type: uniformly choose among split and combine move (however, if K = 1 the only possibility is to split) (ii.a) Combine: randomly select a pair (j 1 , j 2 ) to merge into a new parameter indexed with j 1 . The following relations must hold: (ii.b) Split: randomly select a component j to split into two new components. In this case, we need to impose the following relationships: where α ∼ Beta(1, 1), β ∼ Beta(1, 1) and r ∼ Beta(2, 2).
posterior density of ρ under Tests S 2 and S 7 is shown in Figure S.5.

Galaxy dataset
We consider the proposed model with different spectral densities, to check its robustness on the inference. All the models presented in this paper are, as a matter of fact, general and in principle any spectral density ϕ(·) satisfying the conditions for the existence of the DPP process can be employed. The choice of the spectral density in (10) is motivated by its strong repulsiveness (see Lavancier et al., 2015). However, in this section we show inference on the Galaxy dataset obtained when spectral representations other than the power spectral density, drive the DPP.
We choose isotropic covariance functions that are well-known in the spatial statistics literature: the Whittle-Matérn and the generalized Cauchy. Both densities depend on three parameters: intensity ρ > 0, scale α > 0 and shape ν > 0. In order to assure ϕ(x) < 1 for all x, ρ must be smaller than ρ max = α −d M, where M needs to be specified for each of the and for the generalized Cauchy where d is the dimension of the space where x lives (d = 1 in what follows) and K ν (·) is the modified Bessel function of the second kind.

S.4 Gibbs sampler in presence of covariates
The Gibbs sampler algorithm employed to carry out posterior inference for model (19)- (20), (12)-(13), (17)- (18) is different from the one in Section S.1 except for the full conditionals of (ρ, ν). The sampling of labels {s i } n i=1 differs from (S.22), since now The sampling of the {µ k } K k=1 is similar as the same step in Section S.1, but now However the substantial change from the model without covariates to the model with covariates is due to the update of K, the number of components in the mixture, and of {β 2 , . . . , β K } (recall that β 1 = 0 for identifiability reasons); these are indeed complicated by the presence of the covariates. Moreover, the update of {σ 2 k } is now replaced by where n k = #{i : s i = k}; here we assume the vector of the prior mean of γ k , γ 0 , to be equal to the 0-vector. This full-conditional is the posterior of the standard conjugate normal likelihood, normal -inverse gamma regression model. In particular, we have that The full-conditional for the coefficients β k , k = 2, . . . , K is as follows: which has no known form. Therefore we resort to a Metropolis Hastings step with a multivariate Gaussian proposal, centered in the current value of the vector and with a diagonal covariance matrix, i.e. ζI p×p , where ζ is a tuning parameter chosen to guarantee convergence of the chains.
On the other hand, the update of K requires a Reversible Jump-type move. However, the approach used in Section S.1 above is difficult to implement when mixing weights depend on covariates, as in this case, so that we need to find another way to define a transition probability. Our approach is similar to that of Norets (2015), with some differences that will be highlighted in the next lines.
As before, there are two available moves: split or combine. The probability of proposing one of them is 0.5, except if K = 1, when only the split move can be proposed.
Split: if this move is picked, K prop = K + 1, so that we need to create a new group and its corresponding parameters (the other parameters are kept fixed): (i) randomly pick one cluster, say j, containing at least two items (ii) randomly divide data associated to this group,ỹ j , into two subgroups,ỹ j 1 andỹ j 2 ; (iii) set γ j 1 = γ j , σ 2 j 1 = σ 2 j , β j 1 = β j , µ j 1 = µ j . Now we need to choose a value for γ j 2 , σ 2 j 2 , β j 2 and µ j 2 . In Norets (2015), this is done by sampling the new values from the posterior, conditioning also on the other parameters (even if, for practical purposes, Gaussian approximations for conditional posteriors are used in the implementation of the algorithm). Instead, we sample µ j 2 , γ j 2 , σ 2 j 2 from the posterior of the following auxiliary modelỹ wherex j 2 andỹ j 2 represent covariates and responses in the new group with label j 2 , respectively. Parameter β j 2 is sampled from a p-dimensional Gaussian distribution with mean β mode and variance covariance matrix Σ mode . In particular, β mode is the argmax of the following expression which corresponds to an approximation of the full-conditional of the β k (we dropped the dependence on the other β j 's by considering the expected value in the denominator). Note that E exp β T j x i is not other than the moment generating function, thus it is equal to exp For the purpose of this illustration, we investigate the spatial relations in measurements of the AQI made on September 13th, 2015, at 16pm. We consider 1147 locations scattered around North and South America (the values of AQI have been standardized).
We ran the MCMC algorithm to fit model (11)-(13), (16)- (18), with a burn-in of 10,000, a thinning of 10 and a final sample size of 5,000. As before, β 0 = 0 and ν = 2.  Table S.9: Prior specification for the Air quality index dataset. The scale parameter φ appears in the g-prior specification of (18), while m and v denote prior mean and variance, respectively, of σ 2 k as in (16).    Table S.9 for the Air Quality Index dataset.
Sacramento, which shows the lowest predicted values of AQI, New York, where the environmental conditions are worse, and Monterrey, that presents an intermediate situation.
Similarly as for the Biopics dataset, we compare the inference under our model with the linear dependent Dirichlet process mixture model introduced in De Iorio et al. (2004). Prior information is fixed as follows: α is distributed according to the gamma(1, 1) distribution for Test AQ 5 , so that the prior mean and variance of K are 7.15 and 36, respectively, i.e.
the prior of K is vague. On the other hand, in Test AQ 6 the mass parameter α of the Dirichlet process is set equal to 0.15 such that E(K) = 2.09 and V ar(K) = 1.02, which approximately matches the prior information given on K (mean 1.996 and variance 1.29).
The baseline distribution is a multivariate Gaussian with mean vector 0 and a random covariance matrix which is given a non-informative prior and the hyperparameters of the inverse-gamma distribution for the variances of the mixture components are such that prior mean and variance are equal to 5 and 1, respectively. Posterior summaries can be found in S.6 Additional plots • Figure S.8 displays the graph of the power exponential spectral density ϕ(x; ρ, ν) in (10) when ρ is 2 (left) and 100 (right) and ν varies.
• Figure S.9 displays the value of C 0 (t) corresponding to the Gaussian spectral density where s = 0.5 and ρ varies as in the legend. The vertical dashed line represents the upper limit of the set S = − 1 2 , 1 2 . The approximation C 0 (t) ≃ 0 for t / ∈ S is perfectly adhered to when ρ is small. The higher the ρ is, the slower is the decay rate of the function C 0 (t).
• Figure  • Figure S.11 shows the predictive distribution for the response "box-office earnings" in the Biopics dataset application for cases (i) − (vi) with parameter setting E in Table 5.
• Figure S.12 diplays the cluster estimate for the Biopics dataset obtained under a linear dependent Dirichlet process model with prior specification G in Table 6 of the paper.  Table 5 for the Biopics dataset.  Table 6 of the paper.