Simple approximate MAP inference for Dirichlet processes mixtures

: The Dirichlet process mixture model (DPMM) is a ubiquitous, ﬂexible Bayesian nonparametric statistical model. However, full probabilistic inference in this model is analytically intractable, so that computa- tionally intensive techniques such as Gibbs sampling are required. As a result, DPMM-based methods, which have considerable potential, are re- stricted to applications in which computational resources and time for inference is plentiful. For example, they would not be practical for digital signal processing on embedded hardware, where computational resources are at a serious premium. Here, we develop a simpliﬁed yet statistically rigorous approximate maximum a-posteriori (MAP) inference algorithm for DPMMs. This algorithm is as simple as DP-means clustering, solves the MAP problem as well as Gibbs sampling, while requiring only a frac-tion of the computational eﬀort. † Unlike related small variance asymptotics (SVA), our method is non-degenerate and so inherits the “rich get richer” property of the Dirichlet process. It also retains a non-degenerate closed- form likelihood which enables out-of-sample calculations and the use of standard tools such as cross-validation. We illustrate the beneﬁts of our algorithm on a range of examples and contrast it to variational, SVA and sampling approaches from both a computational complexity perspective as well as in terms of clustering performance. We demonstrate the wide ap-plicabiity of our approach by presenting an approximate MAP inference method for the inﬁnite hidden Markov model whose performance contrasts favorably with a recently proposed hybrid SVA approach. Similarly, we show how our algorithm can applied to a semiparametric MAP inference processes mixtures MSC


Introduction
Bayesian nonparametric (BNP) models have been successfully applied to a wide range of domains but despite significant improvements in computational hardware, statistical inference in most BNP models remains infeasible in the context of large datasets, or for moderate-sized datasets where computational resources are limited. The flexibility gained by such models is paid for with severe decreases in computational efficiency, and this makes these models somewhat impractical. This is an important example of the emerging need for approaches to inference that simultaneously minimize both empirical risk and computational complexity (Bousquet and Bottou, 2008). Towards that end we study a simple, statistically rigorous and computationally efficient approach for the estimation of BNP models that significantly reduces the computational burden involved, while keeping most of the model properties intact. In this work, we concentrate on inference for the Dirichlet process mixture model (DPMM) and for the infinite hidden Markov model (iHMM) (Beal et al., 2002) but our arguments are more general and can be extended to many BNP models.
DPMMs are mixture models which use the Dirichlet process (DP) (Ferguson, 1973) as a prior over the mixing distribution of the model parameters. The data is modeled with a distribution with potentially infinitely many mixture components. The DP is an adaptation of the discrete Dirichlet distribution to the infinite, uncountable sample space. Where the Dirichlet distribution is formed over a continuous K-element sample space, if K → ∞ we obtain the DP. A draw from a DP is itself a probability distribution. A DP is the Bayesian conjugate prior to the empirical probability distribution, much as the discrete Dirichlet distribution is conjugate to the categorical distribution. Hence, DPs have value in Bayesian probabilistic models because they are priors over completely general probability distributions. DPs can be also used as building blocks for more complex hierarchical models; an example being the the hierarchical DP hidden Markov model (HDP-HMM) for time series data, obtained by modeling the transition density in a standard HMM with a hierarchical Dirichlet process (HDP) (Teh et al., 2006).
An interesting property of DP-distributed functions is that they are discrete in the following sense: they are formed of an infinite, but countable mixture of Dirac delta functions. Since the Dirac has zero measure everywhere but at a single point, the support of the function is also a set of discrete points. This discreteness means that draws from such distributions have a non-zero probability of being repeats of previous draws. Furthermore, the more often a sample is repeated, the higher the probability of that sample being drawn again -an effect known as the "rich get richer" property (known as preferential attachment in the network science literature (Barabási and Albert, 1999)). This repetition, coupled with preferential attachment, leads to another valuable property of DPs: samples from DP-distributed densities have a strong clustering property whereby N draws can be partitioned into K representative draws, where K ≤ N and K is not fixed a-priori.
Inference in probabilistic models for which closed-form statistical estimation is intractable, is often performed using computationally demanding Markovchain Monte Carlo (MCMC) techniques (Neal, 2000a;Teh et al., 2006;Van Gael et al., 2008), which generate samples from the distribution of the model parameters given the data. Despite the asymptotic convergence guarantees of MCMC, in practice MCMC often takes too long to converge and this can severely limit the range of applications. A popular alternative is to cast the inference problem as an optimization problem for which variational Bayes (VB) techniques can be used. Blei and Jordan (2004) first introduced VB inference for the DPMM, but their approach involves truncating the variational distribution of the joint DPMM posterior. Subsequently, collapsed variational methods (Teh et al., 2008) reduced the inevitable truncation error by working in a reduced-dimensional parameter space, but they are based on a sophisticated family of marginal likelihood bounds for which optimization is challenging. Streaming variational methods (Broderick et al., 2013a) obtain significant scaling by optimizing local variational bounds on batches of data visiting data points only once, but as a result they can easily become trapped in poor local fixed points. Similarly, stochastic variational methods (Chong et al., 2011) also allow for a single pass through the data, but sensitivity to initial conditions increases substantially. Alternatively, methods which learn memoized statistics of the data in a single pass (Hughes and Sudderth, 2013;Hughes et al., 2015) have recently shown significant promise. Daumé (2007) describe a related approach for inference in DPMM based on a combinatorial search that is guaranteed to find the optimum for objective functions which have a specific computationally tractability property. As the DPMM complete data likelihood does not have this particular tractability property, their algorithm is only approximate for the DPMM, and this also makes it sampleorder dependent. On the other hand, Dahl (2009) describe an algorithm that is guaranteed to find the global optimum in N (N + 1) computations, but only in the case of univariate product partition models with non-overlapping clusters. By contrast, our approach does not make any further assumptions beyond the model structure and being derived from the Gibbs sampler does not suffer from sample-order dependency. Wang and Dunson (2011) present another approach for fast inference in DPMMs which discards the exchangeability assumption of the data partitioning and instead assumes the data is in the correct ordering. Then a greedy, repeated "uniform resequencing" is proposed to maximize a pseudo-likelihood that approximates the DPMM complete data likelihood. This procedure does not have any guarantees for convergence even to a local optima. Zhang et al. (2014) extend the SUGS algorithm introducing a variational approximation of the cluster allocation probabilities. This allows replacement of the greedy allocation updates with updates of an approximation of the alloca-tion distribution. However, this extension also lacks optimality guarantees and is mostly useful in streaming data applications. Broderick et al. (2013b) propose a general approach to solving the MAP problem for a wide set of BNP models by forcing the spread of the likelihood of BNP models to zero. By making some additional simplifying assumptions, this approach reduces MCMC updates to a fast optimization algorithm that converges quickly to an approximate MAP solution. However, this small variance asymptotic (SVA) reasoning breaks many of the key properties of the underlying probabilistic model: SVA applied to the DPMM Jiang et al., 2012) loses the rich-get-richer effect of the infinite clustering, as the prior term over the partition drops from the likelihood; and degeneracy in the likelihood forbids any kind of rigorous out-of-sample prediction and thus, for example, cross-validation. Roychowdhury et al. (2013) impose somewhat more flexible SVA assumptions to derive an optimization algorithm for inference in the infinite hidden Markov model (iHMM). Although this approach overcomes some of the drawbacks of SVA Broderick et al. (2013b), the algorithm departs from the assumptions of the underlying probabilistic graphical model. The method is shown to be efficient for clustering time dependent data, but essentially no longer has an underlying probabilistic model. Furthermore, Roychowdhury et al. (2013) demonstrate that there is more than one way of applying the SVA concept to a given probabilistic model, and therefore, under different choices of SVA assumptions, one obtains entirely different inference algorithms that find different structures in the data, even though the underlying probabilistic model remains the same. For example, HDP-means (Jiang et al., 2012) in the context of time series, and the alternative SVA approach of Roychowdhury et al. (2013) optimize different objective functions, even though they address inference for identical probabilistic models. To clarify this and other issues, we present a novel, unified exposition of the SVA approach in Section 4, highlighting some of its deficiencies and we show how these can be overcome using the non-degenerate MAP inference algorithms proposed in this paper.
In Section 2 we review the collapsed Gibbs sampler for DPMMs and in Section 3 we show how the collapsed Gibbs sampler may be exploited to produce simplified MAP inference algorithms for DPMMs. As with DP-means it provides only point estimates of the joint posterior. However, while DP-means follows the close relationship between K-means and the (finite) Gaussian mixture model (GMM) to derive a "nonparametric K-means", we exploit the concept of iterated conditional modes (ICM) (Kittler and Föglein, 1984). Experiments on both synthetic and real-world datasets are used to contrast the MAP-DP, collapsed Gibbs, DP-means and variational DP approaches in Section 5. In Section 6 we demonstrate how the MAP DPMM approach can be extended to the iHMM and contrast it to the hybrid SVA approach of Roychowdhury et al. (2013) in a simulation study. Finally, we demonstrate an application of our new algorithm to a hierarchical model of longitudinal health data in Section 7 and conclude with a discussion of future directions for this MAP approach for BNP models in Section 8.

Collapsed Gibbs sampling for Dirichlet process mixtures
The DPMM is arguably the most popular Bayesian nonparametric model which extends finite mixture models to the infinite setting by use of the DP prior. In this work we will restrict ourselves to mixture models with exponential family distribution data likelihoods. We will denote by X the full data matrix formed of the observed data points x i which are D-dimensional vectors . . . , x i,d , . . . , x i,D ), N 0 is the concentration parameter of the DP prior and G 0 is its base measure. The DPMM is then often written as: where G is a mixing distribution drawn from a DP; ϑ are the atoms of G which take repeated values and F is the distribution of each data point given its atom. We can also write the mixing distribution G in terms of mixture weights π and the distrinct values taken from ϑdenoted with θ, G = ∞ k=1 π k δ θ k and x i ∼ ∞ k=1 π k F (θ k ), where δ (·) denotes the Dirac delta function. The probability of the data follows an infinite mixture distribution and because this likelihood is not available in closed form, a Gibbs sampling procedure is not tractable. A widely used approach to overcome this issue is to collapse the mixture weights and model the data in terms of the cluster indicator variables z 1 , . . . , z N : where to simplify notation we denote z = (z 1 , . . . , z N ) and θ = (θ 1 , . . . , θ K ); CRP stands for the Chinese restaurant process which is a discrete stochastic process over the space of partitions, or equivalently a probability distribution over cluster indicator variables. It is strictly defined by the integer N (number of observed data points) and a positive, real concentration parameter N 0 . A draw from a CRP has probability: where K is the unknown number of items and N k = |{i : z i = k}| is the number of indicators taking value k, with K k=1 N k = N . For any finite N we will have K ≤ N and usually K will be much smaller than N , so the CRP returns a partition of N elements into some smaller number of groups K. The probability over indicators is constructed in a sequential manner using the following conditional probability: By increasing the value of n from 1 to N and using the corresponding conditional probabilities, we obtain the joint distribution over indicators from Equa- The probability density function of x i ∼ F (x i ; θ zi ) associated with the component indicated by the value of z i , is an exponential family distribution: where g (.) is the sufficient statistic function, ψ(θ zi ) = log exp( x i , θ zi − h(x i ))dx i is the log partition function and h (x i ) the base measure of the distribution. An important property of exponential family distributions is that the conjugate prior over the natural parameters θ k ∼ G 0 exists and can be obtained in closed form: where (τ , η) are the prior hyperparameters of the base measure G 0 , ψ 0 is base measure of the parameter distribution. From Bayesian conjugacy, the posterior p (θ k |X, τ k , η k ) will take the same form as the prior where the prior hyperparameters τ and η will be updated to τ k = τ + j:zj =k g (x j ) and η k = η + N k . Inference can be accomplished via collapsed Gibbs sampling, presented as Algorithm 3 in Neal (2000b). This MCMC algorithm iteratively samples each component indicator z i for i = 1, . . . , N, conditional on all others, until convergence: for some new k = K + 1 (2.7) where the subscript −i denotes the removal of point i from consideration, z −i = {z 1 , . . . , z i−1 , z i+1 , . . . , z N } and p (x i |τ , η ) is the posterior predictive density of point i obtained after integrating out the cluster parameters, p (x i |τ , η ) = p (x i |θ ) p (θ |τ , η ) dθ. Examples of posterior predictive densities for different exponential family likelihoods are presented in Appendix B.

Introducing MAP-DP: A novel approximate MAP algorithm for collapsed DPMMs
In this section we propose a novel DPMM inference algorithm based on iteratively updating the cluster indicators with the values that maximize their posterior (MAP values). The cluster parameters are integrated out. This algorithm can be also seen as an an "exact" version of the maximization-expectation (M-E) algorithm presented in Welling and Kurihara (2006). It is exact in the following sense: while the M-E algorithm is a kind of VB and as with VB makes a factorization assumption which departs from the underlying probabilistic model purely for computational simplicity and tractability purposes, our algorithm is derived directly from the Gibbs sampler for the probabilistic model. Therefore, our algorithm does not introduce or require this simplifying factorization assumption. In Section 3.1 we describe how parameter inference for the DPMM is accomplished and in Section 3.2 we consider out-of-sample prediction.

Inference
As a starting point, we consider the DPMM introduced in Section 2. In our algorithm, we iterate through each of the cluster indicators z i and update them with their respective MAP values. For each observation x i , we compute the negative log probability for each existing cluster k and for a new cluster K + 1: where terms independent of k may be omitted as they do not change with k.
For each observation x i we compute the above K + 1-dimensional vector q i and select the cluster number according to the following: where N k,−i is the number of data points assigned to cluster k, excluding data point x i and, for notational convenience, we define N K+1,−i ≡ N 0 . The algorithm proceeds to the next observation x i+1 by updating the cluster component statistics to reflect the new value of the cluster assignment z i and remove the effect of data point x i+1 . To check convergence of the algorithm we compute the negative log of the complete data likelihood: where δ (z i , k) is the Kronecker delta and p (z 1 , . . . z N ) is the CRP partition function (Pitman, 1995) given in Equation (2.3). We show in Algorithm 1 all the steps involved in approximately maximizing this complete data likelihood. It is worth pointing out that unlike MCMC approaches, MAP-DP does not increase the negative log of the complete data likelihood at each step and as a result is guaranteed to converge to a fixed point. Where the convergence reached by MCMC sampling is convergence in distribution to the stationary posterior measure, the convergence of MAP-DP is only to local maxima of the posterior measure and so it is much quicker to reach. The main disadvantages with this are that the solution at convergence is only guaranteed to be a local maximum and that information about the whole distribution of the posterior is lost. Multiple restarts using random permutations of the data can be used to overcome poor local maximum. With MAP-DP it is possible to learn all model hyperparameters as we discuss in Appendix A and this is a strong advantage over the fast SVA approaches.

Out-of-sample prediction
To compute the out-of-sample likelihood for a new observation x N +1 we consider two approaches that differ in how the indicator z N +1 is treated: 1. Mixture predictive density. The unknown indicator z N +1 can be integrated out resulting in a mixture density: The assignment probability p (z N +1 = k|z, N 0 , X) is N k N0+N for an existing cluster and N0 N0+N for a new cluster. The second term corresponds to the predictive distribution of point N + 1 according to the predictive densities p (x N +1 |z, X, τ k , η k , z N +1 = k) and p (x N +1 |τ , η, z N +1 = K + 1) for an existing and new cluster respectively. 2. MAP cluster assignment. We can also use a point estimate for z N +1 by picking the minimum negative log posterior of the indicator p (z N +1 |x N +1 , N 0 , z, X), equivalently: where p (x N +1 |z, X, z N +1 = k) and p (z N +1 = k|N 0 , z, X) are exactly as above. This approach is useful for clustering applications when we are interested in estimating z MAP N +1 explicitly. Once the MAP assignment for point N + 1 is updated, we estimate the probability of x N +1 given that it belongs to the component pointed by The first (marginalization) approach is used in Blei and Jordan (2004) and is more "robust" as it incorporates the probability of all cluster components while the second (modal) approach can be useful in cases where only a point cluster assignment is needed. Integrating over variable z N +1 for more robust estimation of x N +1 is an example of the well studied process known as Rao-Blackwellization (Blackwell, 1947) which is often used in Bayesian inference for improving the quality of statistical estimation of uncertainty. Even when using the first approach however, the mixture density is still computed assuming point assignments for the training data z 1 , . . . , z N . Therefore the predictive density obtained using MAP-DP will be comparable to the one obtained using Gibbs sampler inference only when the sufficient statistics N 1 , . . . N K of the categorical likelihood for the assignment variables estimated from a Gibbs chain are similar to the ones estimated from the modal estimates for z 1 , . . . , z N . Empirically, we have observed this often to be the case. Furthermore, we have noticed that the predictive density for popular (with a lot of points) cluster components tend to be well approximated by MAP-DP where the effect of the smaller cluster components diminishes when using only modal estimates for z. Note that the DPMM usually models data with a lot of inconsistent small spurious components (Miller and Harrison, 2013), those and any consistent components with small effect are likely to be ignored when using MAP-DP as we later show in Section 5.1. To summarize, using only modal estimates for the cluster assignments we are likely to infer correctly only larger components which have a large effect on the model likelihood and which will also affect the estimated predictive density accordingly.

Another look at small variance asymptotics (SVA)
The novel MAP-DP algorithm presented here has the "flavor" of an SVA-like algorithm, but there are critical differences and advantages, which we discuss in detail in this section.
Firstly, some background to the SVA approach is required. There exists a well known connection between the expectation-maximization (E-M) algorithm for the finite Gaussian Mixture Model (GMM) and K-means. That is, by assuming a GMM with equal variance, spherical (diagonal) component covariance matrices, we can obtain K-means from the E-M algorithm for the corresponding GMM by shrinking the component variances in each dimension to 0. This approach is more recently referred to as the small variance asymptotic (SVA) derivation of the K-means algorithm (Bishop, 2006, page 423).
Using Bregman divergences D φ (·) (see Appendix C), Banerjee et al. (2005) has extended the SVA reasoning to any exponential family finite mixtures and K-means like clustering procedures can be derived. Banerjee et al. (2005) showed that the likelihood of point x i from component k given the component parameter θ k and the posterior of parameter θ k given its posterior hyperparameters can be rewritten using Bregman divergences as: with h (·) and ψ 0 (·) denoting the base measure of the corresponding distributions and μ is the expectation parameter satisfying μ = ∇ψ (θ).  extended this more compact form to the nonparametric DPMM and with some further assumptions derived a nonparametric K-means like algorithm that we now review in detail. Consider the DPMM above (with non-integrated component parameters), but with a scaled exponential family likelihoodF θ that is parameterized by a scaled natural parameterθ = ξθ and log-partition functionψ θ = ξψ θ /ξ for some ξ > 0. Further assume that the prior parameters of the natural parameter are also scaled appropriately, such thatτ = τ ξ andη = η ξ . It is then straightforward to see that the conjugate prior ofψ will be also scaled and soφ = ξφ. Then Jiang et al. (2012) have shown thatF θ has the same mean as F (θ), but scaled covariance, cov θ = cov (θ) /ξ. Let us also assume that N 0 is a function of ξ, η and τ , taking the form: for some free parameter λ that will replace the concentration parameter in the new formulation; D denotes dimension of the data and is unrelated to D φ (·). Then we can write the scaled exponential family DPMM as ξ → ∞ and as a consequence cov θ → 0. Following Jiang et al. (2012) we can write out the Gibbs sampler probabilities in terms of D φ (·) (see Appendix C) after canceling out fφ (x i ) terms from all probabilities: approaches a positive, finite constant for a given x i as ξ → ∞ and we have used the fact that for a Bregman divergence, D ξφ ( , ) = ξD φ ( , ). Now, both of the above probabilities will become binary (take on the values 0 or 1) as ξ → ∞ and so all K + 1 values will be increasingly dominated by the smallest That is, the data point x i will be assigned to the nearest cluster with Bregman divergence at most λ. If the closest mean has a divergence greater then λ, we create a new cluster containing only x i .
The posterior distribution over the cluster parameters for some component k is concentrated around the sample mean of points assigned to that component x i as ξ → ∞, so we update the cluster means with the sample mean of data points in each cluster, as with the corresponding parameter update step in K-means. The resulting algorithm approximately minimizes the following objective function over (z, μ): Similar objective function omitting the penalty term λK was utilized in Banerjee et al. (2005) in the context of finite mixture models. Although this algorithm is straightforward, it has various drawbacks in practice. The most troublesome is that the functional dependency between the concentration parameter and the covariances destroys the rich-get-richer property of the DPMM because the counts of assignments to components N k,−i no longer influence which component gets assigned to an observed data point. Only the geometry in the data space matters. A new cluster is created by comparing the parameter λ against the distances between cluster centers and data points so that the number of clusters is controlled by the geometry alone, and not by the number of data points already assigned to each cluster. So, for highdimensional datasets, it is not clear how to choose the parameter λ. By contrast, in the DPMM Gibbs sampler, the concentration parameter N 0 controls the rate at which new clusters are produced in a way which is, for fixed geometries, independent of the geometry. Another problem is that shrinking diagonal covariances to zero variance means that the component likelihoods become degenerate Dirac point masses which causes likelihood comparisons to be meaningless since the likelihood becomes infinite. So, we cannot choose parameters such as λ using standard model selection methods such as crossvalidation.
While Jiang et al. (2012) can be seen as a nonparametric extension of the well known derivation of K-means from the (E-M) algorithm, Roychowdhury et al. (2013) have suggested an alternative SVA approach in the context of the iHMM. Herein we review their approach in the context of DPMMs and discuss their original formulation in Section 6.3. Instead of simply reducing the diagonal likelihood covariance to 0 variance in each dimension, Roychowdhury et al. (2013) represent the categorical distribution over the latent variables z 1 , . . . , z N in the more general exponential family form. The conditional distribution of the cluster indicator for point i given the mixture weights is given by: where π = (π k ) K k=1 is the vector of mixture weights. Written in this form now we can also scale the variance of the categorical distribution over z. Furthermore, Roychowdhury et al. (2013) assume an additional dependency (which is not part of the DPMM) between the distribution of the cluster indicators and the component mixture distribution, in order for their diagonal variances to approach 0 simultaneously. That is, while Jiang et al. (2012) change the underlying DPMM structure only by assuming shrinking covariance, Roychowdhury et al. (2013) modify the underlying DPMM such that the conditional independence of the cluster parameters and cluster indicators no longer holds. Let us replace the distribution from Equation (4.4) with a scaled one: whereφ =ξφ which will keep the same mean as in Equation (4.4). Then following Roychowdhury et al. (2013) we assume that the likelihoodF θ is scaled with ξ for which the equalityξ = λ 1 ξ holds for some real λ 1 . Now, taking ξ → ∞ would result in the appropriate scaling. After taking the limit and removing the constant terms we obtain the objective function of this new SVA approach: which is optimized with respect to (z, μ, π), and where D φ (z i , π k ) − log π k . Optimization with respect to the mixture weights results in the empirical probability for the cluster weights π k = N k N . So, this objective function then can be rewritten as: The E-M procedure that tries to optimize this objective function computes, for each observation x i , the K divergences to each of the existing clusters: D φ (x i , μ k ) for k = 1, . . . , K. Then, it takes into account the number of data points in each component by adjusting the corresponding divergence for cluster k by subtracting λ 1 log N k N . After computing these adjusted distances, observation x i is assigned to the closest cluster unless λ is smaller than all of these adjusted distances, in which case a new cluster is created. Then the cluster means are updated with the sample mean of observations assigned to each cluster, and in addition we now have to update the counts N 1 , . . . , N K .
By contrast to the SVA algorithm proposed by Jiang et al. (2012), the SVA algoritm of Roychowdhury et al. (2013) no longer clusters the data purely on geometric considerations, but also takes into account the number of data points in each cluster. In this respect the method has greater flexibility, but at the same time, unlike MAP-DP, we can see that SVA algorithms do not actuallyoptimize the complete data likelihood of the original underlying probabilistic DPMM which motivates their derivation. By assuming additional ad-hoc dependencies between the likelihood distribution and the distribution over the indicator variables, SVA algorithms effectively start from a different underlying probabilistic model which is not explicitly given. This makes them less principled and more heuristic than the MAP-DP algorithm we present here. So, while SVA algorithms are quite simple, they sacrifice several key statistical principals including structural interpretability and the existence of an underlying probabilistic generative model.

DPMM experiments
This section provides some empirical results so that we can compare the performance of MAP-DP against existing approaches.

Synthetic CRP parameter estimation
We examine the performance of the MAP-DP, collapsed Gibbs, DP-means  and variational DP (Blei and Jordan, 2004) on CRP-partitioned, non-spherical Gaussian data in terms of estimation error and computational effort. We generate 100 samples from a two-dimensional DPMM. The partitions are sampled from a CRP with fixed concentration parameter N 0 = 3 and data size N = 600. Gaussian component parameters are sampled from a normal-Wishart (NW) prior with parameters μ 0 = [2, 3], c 0 = 0.5, ν 0 = 30, Λ 0 = 2 1 1 3 . This prior ensures a combination of both well-separated and overlapping clusters. We fit the model using MAP-DP, variational DP and Gibbs algorithms using the ground truth model values for the NW prior and the N 0 used to generate the data. Convergence for the Gibbs algorithm is tested using the Raftery diagnostic (q = 0.025, r = 0.1, s = 0.95) (Raftery and Lewis, 1992). We use a high convergence acceptance tolerance of r = 0.1 to obtain less conservative estimates for the number of iterations required. We use the most likely value from the Gibbs chain after burn-in samples (1/3 of the samples) have been removed. Clustering estimation accuracy is measured using the normalized mutual information (NMI) metric (Vinh et al., 2010). The parameter λ for DP-means is set using a binary search procedure such that the algorithm gives rise to the correct number of partitions (see Appendix D). This approach favours DP-means as it is given knowledge of the true number of clusters which is not available to the other algorithms. For variational DP we set the truncation limit to ten times the number of clusters in the current CRP sample.
Both MAP-DP and Gibbs achieve similar clustering performance in terms of NMI whilst variational DP and DP-means have lower scores (Table 1). MAP-DP  (7) 45 (18) requires the smallest number of iterations to converge with the Gibbs sampler requiring, on average, 140 more iterations and DP-means 1.8 times. In Figure  5.1(a) the median partitioning 1 is shown in terms of the partitioning N k /N and the number of clusters. As expected, when using a CRP prior, the sizes of the different clusters vary significantly with many small clusters containing only a few observations. MAP-DP and variational DP fail to identify the smaller clusters whereas the Gibbs sampler is able to do so to a greater extent. This is a form of underfitting where the algorithm captures the mode of the partitioning distribution but fails to put enough mass on the tails (the smaller clusters). The NMI scores do not reflect this effect as the impact of the smaller clusters on the overall measure is minimal. The poorer performance of the DP-means algorithm can be attributed to the non-spherical nature of the data as well as the lack of reinforcement effect that leads to underestimation of the larger clusters and overestimation of the smaller clusters. This poor performance of DP-means is confirmed by modifying the CRP experiment to sample from spherical clusters ( Figure 5.1(b)). The CRP is again sampled 100 times and the MAP-DP algorithm attains NMI scores of 0.88 (0.1) and DP-means scores NMI 0.73 (0.1). As the clusters are spherical, the lower performance of the DP-means algorithms is solely explained by the lack of reinforcement effect.

UCI datasets
Next, we compare DP-means, MAP-DP and Gibbs sampling on six UCI machine learning repository datasets (Blake and Merz, 1998): Wine; Iris; Breast cancer ; Soybean; Pima and Vechicle. We assess the performance of the methods using the same NMI measure as in Section 5.1. Class labels in the datasets are treated as cluster numbers. 2 There is either no or a negligibly small number of missing values in each of the data sets. The data types vary between datasets and features: Wine consists of integer and real data; Iris contains real data; Breast cancer consists of integer and categorical data; Soybean is categorical data; Pima is real data and Vehicle consists of integer data. As in Section 5.1 we stop the Gibbs sampler using the Raftery diagnostic (Raftery and Lewis, 1992). For DP-means, we choose λ to give the true number of clusters in the corresponding dataset . For the Gibbs algorithm, we report the NMI of the most likely clustering from the whole chain of samples (Table 2). We also report the two standard deviations of the NMI computed at each sample of the chain after burn-in.
On almost all of the datasets (5 out of 6), MAP-DP is comparable to, or even better than, the Gibbs sampler, and on 4 out of 6 datasets it performs as well as or better than DP-means (Table 2). DP-means performs well on lowerdimensional datasets with a small number of clusters. In higher dimensions, it is more likely for the clusters to be elliptical rather than spherical and in such cases the other algorithms outperform DP-means because of the more flexible model assumptions. In addition, for higher dimensional data it is more often the case that the different features have different numerical scaling, so the squared Euclidean distance used in DP-means is inappropriate. Furthermore, MAP-DP and the Gibbs sampler are more robust to smaller clusters due to the longer tails of the Student-T predictive distribution (Appendix B) and the rich-get-richer effect of existing clusters assigned many observations. DP-means is particularly sensitive to geometric outliers and can easily produce excessive numbers of spurious clusters for poor choices of λ. Even though MAP-DP only gives a point estimate of the full joint posterior distribution, MAP-DP can in practice achieve higher NMI scores than for Gibbs due to MCMC convergence issues.
We emphasize that these algorithms attempt to maximize the model fit rather than maximize NMI. The true labels would not be available in practice and it is not always the case that maximizing the likelihood also maximizes NMI. Furthermore, if we choose the model hyperparameters for each dataset separately, by minimizing the negative log likelihood with respect to each parameter, higher NMI can been achieved, but choosing empirical estimates for the model parameters simplifies the computations.
In all cases, the MAP-DP algorithm converges more rapidly than the other algorithms (Table 3). The Gibbs sampler takes, typically, greater than 1000 more iterations than MAP-DP to achieve comparable NMI scores. The computational complexity per iteration for Gibbs and MAP-DP is comparable, requiring the computation of the same quantities. This makes the Gibbs sampler significantly less efficient than MAP-DP in finding a good labeling for the data. The computational price per iteration for DP-means can often be considerably smaller than MAP-DP or the Gibbs sampler, as one iteration often does not include a scan through all N data points. This occurs because the scan ends when a new cluster has to be created, unlike MAP-DP and Gibbs. But, this also implies that DP-means requires more iterations to converge than MAP-DP.

MAP-DP for infinite hidden Markov models
The simplicity of MAP-DP makes it straightforward to extend to more complex nonparametric models, such as the popular time series hidden Markov model. Mirroring the approach taken in Section 3, we can obtain an approximate MAP algorithm for the infinite HMM (Beal et al., 2002) (also known as the HDP-HMM (Teh et al., 2006)) for modeling sequential, time-series data.
HMMs can be seen as a generalization of finite mixture models where the cluster indicators that denote mixture component assignments are not independent of each other, but related through a Markov process. That is, each data point x t of a sequence observations (x 1 , . . . , x T ) is drawn independently of the other observations when conditioned on the state variable for time t, z t . The state variables are linked through a state transition matrix where each row de-fines a mixture model for one of the values of the categorical distribution on the states. The current state z t indexes a specific row of the transition matrix, with probabilities in this row serving as the mixing proportions for the choice of the next state z t+1 . Therefore the HMM does not involve a single mixture model, but rather a set of different mixture models, one for each value of the current state. (Beal et al., 2002) showed that if we replace the mixture models with a set of DPs, one for each value of the current state, a nonparametric variant of HMM is obtained that allows an unbounded set of states. In order to obtain sharing of available states across the sequence, the atoms associated with the state-conditional DPs are shared and so the transition matrix is modeled with an HDP (Teh et al., 2006).
Let us denote the base measure of the HDP by H where we restrict H to an exponential family distribution (for Bayesian conjugacy); N 0 and M 0 are local and global concentration parameters; z t−1 indicates the state chosen at time t − 1. The HDP-HMM can then be written as: where distribution over the data F ϑ tzt−1 is a mixture distribution with mixing distribution G zt−1 over ϑ determined by the state pointed by z t−1 . We can also write the mixing distribution in terms of transition matrix π and base measure atoms θ, G zt−1 = ∞ k=1 π zt−1k δ θ k and x t ∼ ∞ k=1 π zt−1 F (θ k ) for all G 1 , . . . , G K . The rows of the transition matrix π denote the mixture weights for each of the local DPs, while θ 1 , . . . , θ K are the shared atoms which are the same across G 1 , . . . , G K , where K denotes the number of represented components, which in a nonparametric setting is unknown. The global DP is then characterized with G 0 = ∞ k=1 κ k δ θ k where the elements of κ are the mixture weights of this global DP.

Gibbs sampler
As in the case of DPMMs (Section 2), exact inference is not available and Gibbs sampling is a standard choice for inference. A popular approach is based upon the direct assignment sampling approach for the HDP (Teh et al., 2006), where the cluster parameters θ and the transition matrix π are integrated out. The resulting Gibbs sampler then iterates between sampling the state indicators z and the global mixture weights κ. The mixture weights can be sampled from the corresponding Dirichlet posterior: where M k counts how many times the transition to state k has been drawn from the global DP. In this representation, we do not keep the assignment variables for the global DP and some extra effort is needed to compute the global counts; we only use the assignments z t of a point to its corresponding state and those assignments are used to compute the local DP counts, N pk = |{t : z t−1 = p, z t = k}|, that count how many times a transition occurred from state p to state k. Let m pk be the count of how many times transitions from state p to state k have been drawn as a new transition from the global DP. We then have M k = p m pk . It is straightforward to see that if N pk = 0 then m pk = 0, because the first time the transition occurs from state p to state k, it is sampled from the global DP. Furthermore, the count m pk will be limited by N pk , as at most all transitions from state p to state k are sampled from the global DP. We can then use the Polya urn sampling scheme underlying the transition probability from state p to state k to sample m pk : − 1 for an existing transition from p to k N 0 κ k for a new transition from p to k (6.2) Due to the exchangeability of rows in the transition matrix, we can sample from Equation (6.2) sequentially N pk times, gradually increasing N pk , and keeping count of how many times the transition from the second term has been sampled. The recorded count m pk is unbiased and can be marginalized over p to obtain M k (Van Gael, 2012).
In order to update the state indicators, for t = 1, ..., T we sample z t from the corresponding probability: where z −t denotes all indicators excluding the one for data point t. The first term is the posterior predictive distribution of data point x t given its state and it will be obtained after integrating out the state parameters θ assuming a conjugate prior for the exponential family likelihood. The resulting distribution is computed in the same way as in Section 3.1: To compute the second term recall that DPs for each row of the transition matrix are independent, given the shared base measure κ. Then if N k· denotes number of transitions from state k and N ·k number of transitions to state k we can write: for z t−1 = k = z t+1 N 0 κ k κ zt+1 for k = K + 1 (6.5)

MAP-DP for iHMMs
In our proposed iterative MAP approach, we sweep through the latent variables and at each iteration update them one at a time with their respective MAP values. Following the direct assignment construction presented in the previous section, the random variables to be updated are the global DP mixture weights κ and the state indicators z. The mode of the Dirichlet posterior is available in closed form for concentration parameter M 0 ≥ 1, so we can update κ using: The counts m pk can be computed by numerical optimization of the corresponding distribution over partitions provided in (Antoniak, 1974). However, the expression involves Stirling numbers of the first kind which may be numerically challenging to compute when the method is applied to time series with a large number of observations. One approach to avoid this issue is using Equation (6.2) to sequentially compute the corresponding counts.
In order to update the state indicators, we sweep through each observation x t and then compute the negative log probability for each existing state k and for a new state K + 1: where again and without losing generality, we can omit the terms independent of k. For each observation in the time series, we compute the K + 1-dimensional vector q t and select the state number according to: q t,k (6.7) As with the MAP-DP case, the algorithm proceeds to the next point in the time series x t+1 by updating the state sufficient statistics to reflect the new value of the cluster assignment z t and remove the effect of data point x t+1 . The scheme is still guaranteed to converge to a fixed point solution as each step does not increase the NLL. However, when the Polya urn scheme in Equation (6.2) is used to compute the global DP counts its stochastic nature will result in minor fluctuations in the model NLL. The algorithm still falls into a local optima, but minor fluctuations occur between iterations. As the NLL is significantly more influenced by the likelihood terms log p (x t |z t = k, X −t ) and the assignment probabilities, stopping the scheme at the local minima is straightforward. Jiang et al. (2012) propose an alternative inference algorithm for efficient estimation of iHMMs based on the SVA approach (Section 4). Mirroring the simplified inference assumptions explained in Section 4, Jiang et al. (2012) have extended the SVA approach to HDP mixtures and so can be readily applied for inference in the HDP-HMM model. The resulting algorithm extends the objective function derived for the simpler DPMM with an additional term penalizing the number of transitions leading out of each state:

SVA for iHMMs
In the case of the iHMM, K local p denotes the number of states for which a transition from state p exists, and K global denotes the total number of represented states in the finite time series. Then λ 2 penalizes the creation of new states, while λ 1 controls how likely new transitions are between existing states. As in the DPMM SVA algorithm (Section 4), standard model selection techniques cannot be applied to select the values of λ 1 and λ 2 and reinforcement terms have been stripped away. Alternatively, introducing some additional assumptions which imply an entirely ad-hoc coupling between the state indicator distribution and the data likelihood probabilities, (Roychowdhury et al., 2013) derives an SVA algorithm that optimizes the following objective function: (Roychowdhury et al., 2013) also introduces a further algorithm, unrelated to Gibbs sampling, that attempts to optimize this objective function whilst also taking advantage of dynamic programming. The method is referred to as asymp-iHMM in our synthetic experiment below.

Synthetic study
We compare the performance of the Gibbs, SVA and MAP-DP algorithms for iHMMs on synthetically generated 3-dimensional data from a five-state HMM with Gaussian data models. The five Gaussian components are parameterized  1.8, 3.4, 4.8] for d = 1, 2, 3, each with shared spherical covariance σI 3 where σ = 0.9. At each time step the HMM has 0.96 probability of self-transition and equal probability to transition to any of the remaining four states. 4000 data points are generated and performance of the algorithms is measured using the NMI between the true and estimated state assignments. The Gibbs sampler is evaluated for the state assignments that maximize the model likelihood. As shown in Table 4, MAP-iHMM performs similarly to asymp-iHMM from (Roychowdhury et al., 2013) whilst keeping the underlying probabilistic iHMM model intact and retaining a non-degenerate complete data likelihood.
Inferring states in this data is difficult due to overlapping observation likelihood models, and we find that the Gibbs sampler significantly outperforms both point estimation approaches, but at an even greater computational cost than for the simpler DPMM. This example illustrates the computational challenges of using MCMC inference for more complex hierarchical models. In Figure 6.1 we observe that the state sequence obtained by MAP and SVA is similar and underestimates the true number of states.

MAP-DP for semiparametric mixed effects models
Hierarchical modeling is commonly used in the analysis of longitudinal health data. A particular model that is widely used in practice is the linear mixed effects model : where y i the observation vector for individual i ∈ {1, . . . , N}, i ∼ N 0, τ −1 σ I is the subject-specific observation noise with τ σ the within-subject precision and P the distribution of the random effects β i (Dunson, 2010). X i are the inputs for the random effects β i and the fixed effect regression parameters are equal to the mean of the distribution P . The distribution P is commonly specified to be Gaussian due to analytical tractability and computational simplicity. However, the assumption of normality is seldom justified and the assumptions of symmetry and unimodality are often found to be inappropriate (Dunson, 2010).
Semiparametric mixed effects models have been proposed to relax the normality assumption by placing a DPMM prior on P (Kleinman and Ibrahim, 1998). However, inference for such models is usually performed using MCMC requiring large computational resources and careful tuning of algorithmic parameters. This makes MCMC approaches particularly difficult to implement on large data sets. The increasing availability of large longitudinal data sets warrants the investigation of computationally efficient inference approaches such as MAP-DP. Here in order to construct a semiparametric mixed effects model, we will use an iterative MAP algorithm similar to MAP-DP above with the only difference of not integrating out the component parameters. We construct the model by first placing a DPMM prior on β i in Equation (7.1). As we are interested in the interpretation of the clusters we do not collapse out the cluster parameters and the update steps described for MAP-DP are slightly altered (as in non-collapsed Gibbs) where the random effects β i are substituted for the individual data points x i ; an additional step updating the component means and precisions also needs to be included. Two further steps are needed to update the random effects β i and within-subject precision τ σ . The conditional p (β i |τ σ , z i = k, μ k , R k ) for the random effects β i is: where the conditioning is on the assigned cluster k with mean μ k and precision R k . We place a conjugate Gamma prior on the within-subject precision τ σ ∼ Gamma (a σ 2 , b σ 2 ) allowing for the calculation of the conditional posterior: where B is the collection of all random effects (β i ) N i=1 . The modes of both conditionals needed for MAP-DP are easily calculated in addition to the negative log likelihood necessary to check for convergence.

English longitudinal survey of ageing
We apply the semiparametric mixed effects model above to the English longitudinal survey of ageing (ELSA), a large longitudinal survey of older adults aged over 50 in the United Kingdom. ELSA is a multi-purpose health study which follows individuals aged 50 years or older (Netuveli et al., 2006). Collected health-related factors include clinical, physical, financial and general well-being. Of primary interest is the effect of the different factors on quality of life (QoL) measured using a compound measure of several health and socio-economic indicators. The ELSA survey has been conducted in five waves spanning ten years. In this preliminary study we look at the response of 6,805 individuals across all 5 waves.
We wish to check the hypothesis that measures of cognition such as memory and executive mental function, as estimated by verbal fluency, are useful predictors of QoL and whether they are more informative than standard measures of depression and activities of daily living (ADL) that have been found to be statistically significant predictors of QoL (Netuveli et al., 2006). We propose to answer these two questions via selection of two models with different sets of covariates. The first model includes depression and ADL as inputs whereas the second model includes measures of cognitive ability, specifically prospective memory and verbal fluency. The models are assessed using 5-fold cross-validation and computing the average held-out likelihood (Equation (3.4) in Section 3.2).
The model that includes ADL and depression as covariates achieves a significantly lower average held-out likelihood than the competing model containing cognitive measures suggesting that ADL and depression are more informative predictors of QoL than the cognitive measures we considered (Table 5). Table 5 Cross-validation average held-out likelihood for two models.

Cognitive measures
Depression + ADL 0.364 3.834 The average elapsed time for fitting the models using MAP inference is 11.05 seconds and 16.29 seconds 3 . For comparison we performed inference using a truncated DP random effects model with MCMC and 100,000 iterations to ensure convergence on less than half of the data (3,000 individuals) and the resulting time to convergence is in excess of five hours making inference on larger data sets impractical. On the other hand, the rapid inference obtained using MAP-DP enables a wide array of diagnostic and validation methods to be exploited, which suggests the approach can be scaled up to very large datasets.

Discussion and future directions
We have presented a simple algorithm for inference in DPMMs based on nondegenerate MAP, and demonstrated its efficiency and accuracy by comparison to the ubiquitous Gibbs sampler, variational DP, and the SVA algorithms. The attractiveness of SVA lies in the simplicity and scalability of the resulting algorithms but as we have shown, it entails significant structural departures from the DPMM as well as removing from the modeler's arsenal standard tools of model comparison and selection. We believe our approach is highly relevant to applications since, unlike SVA, it retains the preferential attachment (rich-getricher) property while needing two orders of magnitude fewer iterations than Gibbs. Unlike SVA, the out-of-sample likelihood may be computed allowing the use of standard model selection and model fit diagnostic procedures. Lastly, this non-degenerate MAP approach does not require the approximation inherent to the factorization assumptions of VB.
As with most MAP methods, MAP-DP can get trapped in local minima, however, standard heuristics such as multiple random restarts can be employed to mitigate this risk. This would increase the total computational cost of the algorithm somewhat but even with random restarts it would still be far more efficient than the Gibbs sampler.
Although not reported here due to space limitations, we point out that different implementations of the Gibbs sampler can lead to different MAP inference algorithms for DPMMs. For example, different MAP procedures can be derived from the different Gibbs samplers presented in (Neal, 2000b). In general, we have found these alternative algorithms to be less robust in practice, as they do not integrate over the uncertainty in the cluster parameters. However, when such assumptions are justified, our MAP approach can be readily applied to different constructions of the DPMM, for example to allow for non-conjugate choice of priors (extending Algorithm 7 (Neal, 2000b)).
Bregman divergence between any two vectors x and θ is defined as D φ (x, θ) = φ (x) − φ (θ) − x − θ, ∇φ (θ) for the function φ : S → R being differentiable and strictly convex on a closed convex set S ⊆ R D . Bregman divergences can be efficiently used to provide a compact parameterization of exponential family distributions with their expectation parameter. This generalizes the result that a group of points are summarized by their mean in Euclidean space to all spaces that can be described with Bregman divergence as a distortion measure.

Appendix D: DP-means λ parameter binary search
In our experiments with the DP-means algorithm, it is necessary to have an automatic way of obtaining the parameter λ for synthetic experiments where we wish to obtain a specific number of clusters K target . We use a binary search approach where λ is set in a sequence of binary search steps: 1. Initialisation: Set λ to the mid-point of the range L 1 = 0, U 1 = M 2 where L 1 , U 1 are respectively the lower and upper bounds of the range for the first iteration. M 2 is the maximal squared Euclidean distance and is set to M 2 = D d=1 (max (x d ) − min (x d )) 2 where max (x d ), min (x d ) are respectively the upper and lower bounds of the data for dimension d.
(The use of the maximal Euclidean distance originates in the DP-means algorithm step which creates a new cluster when d i,k > λ where d i,k is the squared Euclidean distance of data point i to the mean of cluster k.) 2. For iteration i = 1, 2, . . .
(a) Run the DP-means algorithm with λ = 1 2 (U i + L i ) which returns K obtained , (b) If K obtained > K target then there are too many clusters so we will increase λ. We update the lower bound L i+1 = λ and leave the upper bound unchanged U i+1 = U i , (c) If K obtained < K target there are too few clusters so we need to decrease λ. We update the upper bound U i+1 = λ and leave the lower bound unchanged L i+1 = L i , (d) Stop when K obtained = K target .