Model-based clustering with envelopes

: Clustering analysis is an important unsupervised learning technique in multivariate statistics and machine learning. In this paper, we pro-pose a set of new mixture models called CLEMM (in short for Clustering with Envelope Mixture Models) that is based on the widely used Gaussian mixture model assumptions and the nascent research area of envelope methodology. Formulated mostly for regression models, envelope methodology aims for simultaneous dimension reduction and eﬃcient parameter estimation, and includes a very recent formulation of envelope discriminant subspace for classiﬁcation and discriminant analysis. Motivated by the envelope discriminant subspace pursuit in classiﬁcation, we consider parsimonious probabilistic mixture models where the cluster analysis can be improved by projecting the data onto a latent lower-dimensional subspace. The proposed CLEMM framework and the associated envelope-EM algorithms thus provide foundations for envelope methods in unsupervised and semi-supervised learning problems. Numerical studies on simulated data and two benchmark data sets show signiﬁcant improvement of our propose methods over the classical methods such as Gaussian mixture models, K-means and hierarchical clustering algorithms. An R package is available at https://github.com/kusakehan/CLEMM .


Introduction
Cluster analysis (or clustering) is a cornerstone of multivariate statistics and unsupervised learning. The goal of clustering is to divide the observed samples into groups or clusters according to their similarity/dissimilarity (see, for example, Hartigan [22], Jain et al. [25], Kaufman and Rousseeuw [28], Chi and Lange [8] for backgrounds on clustering). Among various clustering approaches, two of the most widely used algorithms are the K-means clustering [33] and the hierarchical clustering [26]. Both methods and their variations are iterative algorithms with different convergence criteria (e.g. minimized dissimilarity within clusters) and starting points (e.g. start the algorithm by assigning all observations to one cluster or assigning each observation as its own cluster). On the other hand, to facilitate statistical interpretation and inference, clustering analysis can also be based upon probabilistic models [e.g. 4,20,34]. By assuming mixture distributions, clusters can be determined based on the maximum likelihood estimators in the model. See Lindsay [32] and Yao and Lindsay [44] for general backgrounds on mixture models. In this paper, we focus on the Gaussian mixture W. Wang, X. Zhang and Q. Mai/Model-based clustering with envelopes 1 models (GMM) because of their popularity and effectiveness in approximating (non-Gaussian) multimodal distributions (see [48,7] for example).
Clustering algorithms suffer from the curse of dimensionality, as distance measures and parameter estimation become more challenging with increased dimensions [38]. Dimension reduction of the data thus may often improve clustering accuracy and also provide informative visualization. However, due to the unsupervised nature of clustering, many supervised dimension reduction methods for classification [e.g., 13,40,37,47] or for regression [see 30, for an overview] are not directly applicable to clustering. As a widely used unsupervised dimension reduction technique, principal component analysis (PCA) is often used as a pre-processing step in clustering, especially when variables are highly correlated. For example, Ding and He [18] explains the connections between PCA and K-means clustering by showing that principal components are the continuous solutions to the cluster membership indicators in K-means clustering. In modelbased clustering, there have been several proposals that share the same spirit of modeling clusters by constrained estimation in a lower-dimensional subspace. In particular, our parsimonious modeling approach is conceptually similar to the factor analysis approaches in Rubin and Thayer [35] and Baek et al. [3], but notably differs in several ways. First, nearly all the existing latent-subspace or factor analysis methods are built upon the idea that covariance matrix in each cluster has a "low-rank plus diagonal" structure, where the low-rank structure is driven by a low-dimensional latent variable. In contrast, our method is built on the envelope principle that is more flexible and general. It assumes that there exists a subspace such that observations projected onto this subspace would share a common structure that is invariant as the underlying cluster varies. In other words, data projected onto this subspace contains no information about the cluster differences and is thus immaterial to clustering. Therefore, clustering becomes more efficient if we eliminate these extraneous variations. Secondly, the subspace learning in our method is completely data-driven and integrated into the likelihood framework and EM-algorithms. The targeted dimension reduction subspace in our approach, i.e. the envelope, always exist and is a natural inferential and estimative object for dimension reduction in clustering. Finally, unlike in PCA and factor analysis, the envelope method is more adaptive and direct. The components useful for clustering are not necessarily the leading components that are identified by PCA and factor analysis. In presence of highly correlated variables, it is likely that some components with large variability are actually not useful for clustering. The envelope, on the other hand, is a more targeted dimension reduction subspace whose goal is to improve efficiency in Gaussian mixture model parameter estimation and thus to obtain better clustering result.
Our proposed Clustering with Envelope Mixture Models (CLEMM) framework advances the recent development of envelope methodology that was first proposed in the context of multivariate linear regression by Cook et al. [12] and then further developed in a series of regression problems [e.g., 39,9,45,19,29] and general multivariate parameter estimation problems [14]. See [11] for an overview on envelopes and [10] for more detailed backgrounds. Whilst all existing envelope methods concentrate on supervised learning, particularly for regres-sion problems, our work differs obviously by tackling a unsupervised learning problem. Such an extension is far from trivial, and new techniques are required throughout its development. Moreover, our development for envelope-EM algorithms enriches the envelope computational techniques [15,16]. The CLEMM approach, which is essentially a subspace-regularized clustering, also complements the sparse penalized solutions in the literature [e.g., 42,6].
The rest of the article is structured as follows. In Section 2 we formally introduce the definition of CLEMM and illustrate the working mechanism of envelope in clustering. We also connect with the recent study on envelope discriminant analysis [47]. In Section 3, we derive the maximum likelihood estimators and develop the envelope-EM algorithms for CLEMM. In Section 4, we explore an important special case of CLEMM that further assumes shared covariance structure across clusters. Under this shared covariance assumption, the envelope-EM algorithm can be even faster than the standard EM estimation in Gaussian mixture models, which does not require subspace estimation. Model selection is discussed in Section 5. Numerical analysis includes simulations and two benchmark datasets are given in Section 6. Finally, Section 7 contains a summary and a short discussion on some future research directions. Proofs and technical details are given in the Appendix.

Notation and Definitions
We first introduce the following notation and definitions to be used in this paper. For a matrix B ∈ R p×q , the subspace of R p spanned by the columns of B is denoted as B = span(B). When B T B is positive definite, we use P B = P B = B(B T B) −1 B T to denote the orthogonal projection onto the subspace B = span(B). The orthogonal complement subspace B ⊥ of B is constructed with respect to the usual inner product and that B ∪ B ⊥ = R p and B ∩ B ⊥ = 0. The projection onto B ⊥ is then written as Q B = Q B = I p − P B .
We will use the following definitions of a reducing subspace and an envelope. The definitions are equivalent to that given by Cook et al. [12] and contain constructive properties of reducing subspaces and envelopes. In the following sections, we provide an intuitive construct of envelope in clustering, where the matrix M in the above definition is replaced by the covariance of predictor X ∈ R p and the subspace B will be the subspace that captures the location and shape changes across clusters.

CLEMM: Clustering with Envelope Mixture Models
In a multivariate Gaussian mixture model (GMM), the observed data X i ∈ R p , i = 1, . . . , n are assumed to be i.i.d. following the finite mixture Gaussian distribution as where π k ∈ (0, 1) and K i=1 π k = 1 are the mixing weights, N(µ k , Σ k ) denotes the multivariate normal distribution with mean µ k ∈ R p and positive definite covariance matrix Σ k . The key to model-based clustering with GMM is to estimate the parameters θ ≡ (π 1 , . . . , π K , µ 1 , . . . , µ K , Σ 1 , . . . , Σ K ). The expectation-maximization (EM) algorithm [17] is a popular and standard approach for estimating these parameters. Specifically, the maximum likelihood estimator (MLE) for θ is obtained by iteratively updating within the EM algorithm. We will discuss more about the EM algorithm and the estimation procedure in Section 3.
Motivated by envelope modeling techniques in regression and classification, we assume that there exists a low-dimensional subspace that fully captures the variation of data across all clusters. Let (Γ, Γ 0 ) ∈ R p×p be an orthogonal matrix, where Γ ∈ R p×u , u ≤ p, is the semi-orthogonal basis for the subspace of interest. In particular, we refer to X M = Γ T X ∈ R u as the material part of X -the part that contains all the information about clusters; and we refer to X IM = Γ T 0 X ∈ R p−u as the immaterial part of X -the part that is homogeneous and does not vary across clusters. Without loss of generality, we assume E(X) = 0 and propose the CLEMM as follows, 2) where α k ∈ R u , π k ∈ (0, 1) is defined previously, Ω k ∈ R u×u and Ω 0 ∈ R (p−u)×(p−u) are symmetric positive definite matrices. The above model assumes that X M , which follows the GMM with parameters π k , α k and Ω k , k = 1, . . . , K, is multimodal and heterogeneous. In contrast, X IM is unimodal and follows the multivariate normal distribution. In other words, the distribution of the material part X M changes in both mean and covariance across different clusters while the immaterial part X IM does not vary. Furthermore, the last statement in (2.2) implies that the material part X M and the immaterial part X IM are independent of each other. This ensures that the immaterial part is not associated with the material part and can be eliminated completely.
To better understand the connections between CLEMM and GMM, we note that the CLEMM in (2.2) is equivalent (the proof is given in the Appendix) to the following parsimonious parameterization in the original GMM setting, (2.3) From (2.3), it can be seen that the centers of each clusters lie within the lowdimensional subspace span(Γ), which is a reducing subspace of each Σ k . As a direct consequence, the marginal covariance of X can also be written as where Ω x is the marginal covariance of X M = Γ T X. Therefore, the subspace span(Γ) is not only a reducing subspace of the intracluster covariance Σ k but also reduces the marginal covariance Σ x . These observations will help us construct CLEMM estimation: CLEMM-Shared in Section 4. The parameterization in (2.3) also links the two-part (material and immaterial parts) model in (2.2) with the original GMM in (2.1), and helps deriving the EM algorithms for CLEMM in Section 3. Similar to the envelope discriminant subspace in Zhang and Mai [47], the smallest such subspace span(Γ) is uniquely defined and always exists. We establish properties of the smallest such span(Γ) in the following section.

Envelope in clustering: a latent variable interpretation
In this section, we recast the smallest subspace span(Γ) that satisfies (2.3) as an envelope (cf. Definition 1). First, we introduce the latent variable Y ∈ {1, . . . , K} as the cluster indicator, then the GMM (2.1) can be expressed as, where Y is latent and unobservable in clustering. When the variable Y is observed class labels, (2.4) is commonly known as the quadratic discriminant analysis (QDA) model; and if we further assume shared covariance structure across classes, Σ 1 = · · · = Σ K , then (2.4) becomes the linear discriminant analysis (LDA) model. In classification, the ultimate goal is to obtain the Bayes' rule for classification defined as φ(X) = arg max k=1,...,K Pr(Y = k | X), which achieves the lowest possible error rate for any classifier (i.e. the Bayes error rate). Zhang and Mai [47] introduced the envelope discriminant subspace as the smallest subspace that is a reducing subspace of Σ x and also retains the Bayes' rule if we project the data onto it. With observable Y in (2.4), the envelope discriminant subspace leads to the same parameterization for µ k and Σ k as CLEMM in (2.3), where span(Γ) is the envelope discriminant subspace. This connection leads to the following properties of CLEMM that are straightforward derivations from Proposition 3 in Zhang and Mai [47] (and hence we omitted the proof). Let Then L contains the location changes across clusters and Q contains the spectral imsart-ejs ver. 2014/10/16 file: CLEMM_final.tex date: November 29, 2019 W. Wang, X. Zhang and Q. Mai/Model-based clustering with envelopes 5 changes in cluster-specific covariance matrices. The smallest subspace span(Γ) defined in CLEMM (2.2), or equivalently (2.3), is the envelope E Σx (L + Q). The subspace S = span(Γ) in CLEMM (2.3) can be defined equivalently using the following coordinate-free statements, The intersection of any two subspaces that satisfy (2.5) is a subspace that still satisfies (2.5). Therefore, the intersection of all such subspaces is uniquely defined, minimal dimensional and satisfies (2.5). It is in fact the Σ x -envelope of L + Q, denoted as E Σx (L + Q). With a bit abuse of notation, we henceforth use Γ ∈ R p×u as a semi-orthogonal basis for the envelope E Σx (L + Q), which has dimension u.

Working mechanism of CLEMM
Clearly, CLEMM can help reduce the total number of free parameters. For GMM, there are ( By reducing the number of free parameters and thus the model complexity, the CLEMM parameterization leads to potential efficiency gain in parameter estimation with EM algorithms. To provide more intuition about the working mechanism of CLEMM and its potential advantages over the classical GMM, we next consider some visualizations and an illustrative simulation example. Figure 1a is a schematic plot of the envelope on a bivariate tri-cluster data.
From this plot, we see clearly that the centers of clusters varies along the envelope direction, and that the heteroscedasticity is also captured by this direction. On the other hand, if we project the data onto Γ 0 , the three clusters become one. By eliminating the immaterial variation from Γ T 0 X, or equivalently, by projecting the data onto Γ, we expect a substantial improvement in distinguishing the three clusters.
To further verify the actual efficiency gain by CLEMM, we consider a simulation model (M1) in our numerical studies (see Section 6 for more details), where X ∼ 3 k=1 π k N (µ k , Σ k ) has p = 15 variables and three clusters of relative sizes (π 1 , π 2 , π 3 ) = (0.3, 0.2, 0.5). The envelope has dimension u = 1 and each Σ k has a relatively complicated format such that the predictors are all highly correlated with each other. Figure 1b plots the simulated data after being projected onto a two-dimensional plane consists of the true envelope and an arbitrary direction  from the orthogonal complement of the envelope. Clearly, we see the distinctions among the three clusters lie within the envelope (horizontal axis), while the distributions are the same along the immaterial direction (vertical axis). Figure 1c and Figure 1d show that actual estimated results from the classical GMM and our proposed CLEMM. The parameter estimation of µ k and Σ k are reflected by the three ellipses in the plot. Compared to the true distribution (eclipses) in Figure 1b, CLEMM clearly improves the parameter estimation substantially.
Not surprisingly, if we compare the cluster labels by GMM and by CLEMM, the mis-clustering error rate is also reduced drastically by CLEMM.

Connections with factor analyzers approaches
Baek et al. [3] proposed the Mixture of Common Factor Analyzers (MCFA) model, which is a popular approach in the factor analysis-type clustering methods and is thus included as a competitor in our numerical studies (Section 6). Using our notation, the MCFA model can be summarized as, . Therefore, this model can be viewed as a special case of our CLEMM model (2.3). If we let Ω k = Φ k + σ 2 I u and Ω 0 = σ 2 I p−u , then (2.3) reduces to (2.6). It assumes that the common covariance is isotropic and also restricts the shared subspace to contain the leading eigenvalues, since Ω k = Φ k +σ 2 I u now has larger eigenvalues than Ω 0 = σ 2 I p−u . The CLEMM approach is therefore more flexible than the factor analyzer approaches. The flexibility of CLEMM, however, leads to a more complicated EM algorithm as we carefully derives in the following section.

A brief review of the EM algorithm for GMM
Dempster et al. [17] introduced the EM algorithm which later become the most popular technique to solve GMM. In this section, we first give a brief review of the EM algorithm for GMM. To make our envelope-EM algorithm easier to comprehend, we present it in a way that is parallel to the classical EM algorithm for fitting GMM.
By introducing a latent variable Y , the GMM can be written as is the density function of N(µ k , Σ k ) and θ = (π 1 , . . . , π K , µ 1 , . . . , µ K , Σ 1 , . . . , Σ K ) is the set of all parameters. Directly solving this log-likelihood is difficult and the EM algorithm iteratively updates the estimator by treating Y i as missing data.
where y ik = 1 if the i-th observation belongs to the k-th cluster and 0 otherwise. The EM algorithm then estimates θ by iteratively maximizing the conditional expectation of the complete log-likelihood on the observed data. The EM algorithm for GMM is summarized as follows. Initialization: Choose an initial value θ (0) and set iteration number m = 0. We can simply choose the clustering result from K-means and hierarchical clustering as starting value. See [27] for more discussion on the choice of initial values for GMM.
Iterating over the E-step and the M-step below to generate a sequence of estimators θ (1) , θ (2) , . . .. E-step: Compute the expectation Q(θ | θ (m) ) = E{ c (θ) | θ (m) , X i , i = 1, 2, . . . , n} for m = 1, 2 . . ., which is equivalent to ) is the membership weight for each data point and it satisfies K k=1 η (m) ik = 1, and the symbol " " means equal up to an additive constant. Specifically, we have, . (3.1) It is well-established in the statistical literature that the EM algorithm is guaranteed to converge monotonically to a local maximum of the log-likelihood under mild conditions [17,43,5].

Envelope-EM algorithm for CLEMM
In this section, we develop the envelope-EM algorithm for estimating the CLEMM parameters. The estimation problem in CLEMM is far more complicated than fitting GMM or the envelope estimation in regression or classification. First of all, we introduce the latent variable Y into the CLEMM assumptions (2.2), If we know Γ, then the EM algorithm for CLEMM is a straightforward extension of the EM algorithm for GMM. Unfortunately, Γ is unknown and its estimation involves solving an non-convex objective function on Grassmann manifold. Therefore, we have to efficiently integrate the optimization for Γ into the EM algorithm.
For our envelope-EM algorithm and in all of our numerical studies, we use the same initialization as GMM. The E-step of envelope-EM is the same as the usual EM for GMM, except that we need to replace the GMM estimator θ (m) with the CLEMM estimator θ (m) = θ( φ (m) ). The key step is the maximization of Q(θ(φ) | θ (m) ), which is now an over-parameterized function under the CLEMM parameterization (2.3). Straightforward calculation shows that ) is the membership weights for each point. We carefully derived the maximizer for the above Q-function under the CLEMM constraints. The results are summarized in the following proposition. We define the following quantities that are intermediate estimators for the envelope-EM algorithm, The mean µ ik /n and Γ ∈ R p×u is the minimizer of the following objective function under the semi-orthogonal constraint Γ T Γ = I u ,  The objective function (3.7) for estimating the envelope is very similar to that of the envelope QDA model [47]. In discriminant analysis, where Y i 's are fully observed, the calculation can be simplified by replacing π Based on the results in Proposition 1, we now can summarize the envelope-EM algorithm in Algorithm 1. In each iteration of M-step, we see that the CLEMM estimators µ k and Σ k are coordinate-free since they depend on Γ only through span( Γ) (i.e. only through the projection matrices Γ Γ T and Γ 0 Γ T 0 ). Therefore, the optimization of Γ involved in the CLEMM estimation is defined on the set of all u-dimensional linear subspaces of R p , which is known as the Grassmann manifold and denoted as G p,u . We discuss such optimization in the next section.

Envelope subspace estimation
The constrained minimization of the objective function G (m) (Γ) can be done through gradient descent on a Grassmann manifold. In particular, we have the following closed-form expression for the gradient of G (m) (Γ) ignoring the orthogonality constraints Γ T Γ = I u , Many manifold optimization packages only requires the above "unconstrained" matrix derivative, where the geometric constraints are taken into account by different techniques. For example, our implementation (more details can be found in the Appendix) uses the curvilinear search algorithm proposed by [41]; and manifold gradient descent methods [1] update along the tangent space of the manifold and then projecting back onto the manifold at each iteration. However direct optimization is computationally expensive and requires good starting values. We adopt the idea of the 1D algorithm from [15] that is fast and stable for obtaining a good initial estimator of envelopes in linear models [e.g., 14,31]. For our problem, we propose the following modified 1D algorithm to obtain an initial estimator of minimizing arg min Γ∈G(p,u) G (m) (Γ). We start with g 0 = G 0 = 0 and G 00 = I p , let G l = (g 1 , . . . , g l ) denote the sequential directions obtained and let (G l , G 0l ) be an orthogonal matrix. Then for l = 0, . . . , u − 1, we obtain the sequential directions as follows.
k G 0l , and define the following stepwise objective function for w ∈ R p−l+1 , (3.10) • Solve w l+1 = arg min w∈R p−l f l (w) under constraint w T w = 1.
After the above sequential steps, we obtain an initial estimator G u ∈ R p×u for the optimization of G (m) (Γ) using its gradient (3.9). It is worth mentioning that in the real data applications, we are likely to have the within-class sample covariance matrix S (m) k to be very close to singular, if the probability π k is very small. The singularity could lead to the unstable optimization of G (m) (Γ). To our experience, the estimation often improves (in stability and sometimes speed) by adding a small diagonal matrix such as 0.01I p to the sample covariance estimate S (m) k of clusters k when this cluster has relatively small size.

CLEMM-Shared: A special case of CLEMM
Recall that the number of free parameters in GMM is of the order O(Kp 2 ), which can be much bigger than the number O(p 2 + Ku 2 ) for CLEMM. When the dimension p is moderately high, GMM fitting becomes ineffective or even problematic. As we have seen in our real data analysis, even when the true clusters exhibit different covariance structures, it is often beneficial to fit a more restrictive GMM by assuming Σ 1 = · · · = Σ K = Σ to reduce the number of parameters in estimation. More specifically, the number of free parameters in GMM and CLEMM are reduced to order O(p 2 ) under such assumptions. In this section, we consider the special case of CLEMM under the shared covariance assumption that Σ 1 = · · · = Σ K = Σ, that is, where Ω ∈ R u×u is a symmetric positive definite matrix that remains the same across all clusters.
Under this shared-covariance CLEMM model (4.1), the clusters share the same shape and are only distinguishable by their centroids. We will refer to this model as CLEMM-shared (and its counterpart GMM-shared) throughout our discussion. In terms of envelope, only the subspace L = span(µ 2 − µ 1 , . . . , µ K − µ 1 ) is relevant and the envelope degenerates to E Σx (L), which is the smallest subspace span(Γ) that satisfies (4.1). In the case of shared covariance GMM model, the total number of free parameters is reduced by (K − 1)(p − u) when we introduce the envelope structure and assume that the mean differences across clusters lie within a low-dimensional subspace. Similar to the general case, CLEMMshared performs dimension reduction and efficient parameter estimation under the more restrictive shared covariance GMM.
Another more practical benefit of model (4.1) is that the envelope-EM algorithm for CLEMM-shared can be further simplified and sped up over Algorithm 1. In fact, by utilizing a special form of Grassmann manifold optimization for shared covariance, we can accelerate the convergence of envelope-EM algorithm to be even faster than the standard GMM with shared covariance. The computational cost comparisons can be found in the Appendix E.2. We see that the CLEMM is generally slower than GMM because of the Grassmann manifold optimization involved, but the CLEMM-shared estimation can be faster than the GMM-shared estimation.
The envelope-EM algorithm for CLEMM-shared is analogous to Algorithm 1. We thus omit the details and only summarize the different M-step in the following proposition. The full description of the algorithm is given in Appendix D, Algorithm 2. Define the shared covariance estimator as where Γ ∈ R p×u is the minimizer of following objective function subject to From Proposition 2, we see the two major differences between CLEMM and CLEMM-shared lie in the shared covariances {Ω k } K k=1 versus Ω, and in the objective function for solving for envelope. Since there are only two matrices in F (m) (Γ), we can adopt the envelope coordinate descent (ECD) algorithm recently proposed by [16]. The ECD algorithm has shown to be much faster than the 1D algorithm and full Grassmann manifold updates without much loss of accuracy. Therefore, we are able to improve the computation and speed up the envelope-EM algorithm for CLEMM-shared.

Model Selection
Information criteria are often used to determine the number of clusters in GMM. Examples include Akaike Information Criterion (AIC, Akaike [2]), Bayesian Information Criterion (BIC, Schwarz [36]), Approximate Weight of Evidence Criterion (AWE, Banfield and Raftery [4]), among others. We adopt the AWE criterion to select the number of clusters, which suggests an approximate Bayesian solution to choose the number of clusters based on the "complete" log-likelihood where df (K) is the number of free parameters that we have discussed earlier.
Minimizing AWE(K) over K ∈ {1, 2, . . . } gives us the estimated number of clusters. After determining K, which is the same for both GMM and CLEMM, we then determine the envelope dimension u ∈ {1, . . . , p} as follows. For a given K, we replace −2 c (θ) in (5.1) with the objective function value nG (m) ( Γ) in Proposition 1. This is because the objective function G (m) ( Γ) are essentially the partially minimized negative log-likelihood (see Appendix B for the derivations). As such, the AWE criterion turns into a function of the envelope subspace dimension u, For the special case of CLEMM-shared, we replace G (m) ( Γ) with F (m) ( Γ) from Proposition 2. This type of envelope objective functions, G (m) ( Γ) and F (m) ( Γ), can also be interpreted as the quasi-likelihood for model-free envelope estimation and consistent dimension selection [46]. This modified AWE criterion has very promising performances in our simulations and real data applications.

Simulations
In this section, we empirically compare the clustering results of GMM and CLEMM. We also include K-means, Hierarchical Clustering (HC), and Bayes' error of classification as benchmarks. The Bayes' error is the lowest possible classification error estimated from the true parameters and using the class labels. The built-in functions in R are used for K-means (kmeans) and HC (hclust) algorithms. We also included the mixtures of factor analyzers with common component-factor loadings method [3,MCFA], where we set the number of component-factors the same as the envelope dimension. In addition, we include the results of LDA and QDA to compare with supervised learning methods. For LDA and QDA, we generate an independent testing data set with the same sample size as the training data set and report the classification error rates on the testing data set.
To evaluate the clustering result, we assign observations to a cluster by plugging-in the estimated parameters to the Bayes' rule. Then the optimal clustering is defined by the clustering error rate [6]: min π∈P E{I(φ(X) = π(Y ))} where P = {π : [1, ..., K] → [1, ..., K]} is the set of permutation function and Y is the latent label of observation X, and φ(X) being the Bayes' rule of classification.
We focus on the relative improvements of CLEMM over GMM in terms of parameter estimation and also mis-clustering error rate. It is expected that the parameter estimation from the EM algorithm and the envelope-EM algorithm depends on the initialization. In order to have reasonably good initialization, we consider the following initial values of π For each of the five simulation models, we generate 100 independent data sets with sample size n = 1000. Models (M1)-(M3) follow the general covariance structure in (2.2) and Models (S1) and (S2) follow the shared covariance structure in (4.1). The model parameters are set in the following way. The entries in Γ ∈ R p×u are first generated randomly from Uniform(0, 1); and then we orthogonalize Γ such that (Γ, Γ 0 ) is an orthogonal matrix. The mean vectors µ k = Γη k with each η k ∈ R u×1 randomly filled with N (0, 1) random numbers. The symmetric positive definite matrices Ω k ∈ R u×u and Ω 0 ∈ R (p−u)×(p−u) are generated as AA T / AA T F , where A is a square matrix with compatible dimensions and the entries in A are randomly generated from Uniform(0, 1). Finally, we let Σ * k = ΓΩ k Γ T +Γ 0 Ω 0 Γ T 0 and standardize it to Σ k = σ 2 ·Σ * / Σ * F , where the scalar σ 2 is chosen such that Bayes' error is around 5-10%. Other model specific parameters are summarized in the following.  Table 1 summarizes the mis-clustering error rates of each method. Clearly, CLEMM improves over GMM significantly regardless of the initial values. If we use the true parameter as the initial value for the EM and the envelope-EM algorithms, then CLEMM estimator almost achieves the Bayes' error rate while the GMM estimator may still have error rate more than twice of the Bayes' error rate. In four out of the five models (except Model (S1)), the K-means and HC algorithms actually fail to provide a good initial estimator. However, the CLEMM can drastically reduce the clustering error from those poor initial estimators. This demonstrates that CLEMM is robust to initialization. In all these models, the CLEMM demonstrates huge advantages over K-means and HC because of the model-based natural and parsimonious modeling of means as well as covariance matrices. Moreover, CLEMM either has a comparable performance or even outperforms LDA and QDA in these models. This indicates that CLEMM can utilize the latent lower-dimensional structure of data for better clustering results.
Next, we compare the parameter estimation accuracy of GMM and CLEMM. The results are summarized in Tables 2, 3 and 4 for the mean estimation of µ k , the cluster size estimation π k , and the covariance estimation Σ k , respectively. Again, not surprisingly, CLEMM has much more accurate parameter estimation comparing to GMM, especially in the more general cases (M1)-(M3) where the reduction in the number of free parameters is large. Even in the relatively simpler cases of shared covariance Models (S1) and (S2), CLEMM still has significant improvements over GMM. Moreover, as we include in Appendix E.2, the computational time costs of the envelope-EM algorithm for CLEMM is actually less than that of the EM algorithm for GMM under these two models.
Finally, we demonstrate the envelope dimension selection results by our AWE criterion in Table 5, where we report the percentage of simulated data sets from which the true dimension is selected by AWE. The dimension selection procedure seems promising, and we will further illustrate the AWE selection criterion on the following real data sets. Table 1 Clustering error rates (%). The reported numbers are averaged results and their standard errors (in the parenthesis) over 100 replications.   Table 3 Estimation errors in cluster sizes K k=1 π k − π k F . The reported numbers are averaged results and their standard errors (in the parenthesis) over 100 replications.    [23] for more background) which is a simulated three-class classification data set with 21 predictors and 800 observations. The three classes of waveforms are random convex combinations of two equal-lateral right triangle function plus independent Gaussian noise. It is commonly used as a benchmark data set in machine learning study to demonstrate the robustness of methods, because the distribution is actually not a mixture of Gaussian distributions.  First of all, we determine the envelope dimension u using the AWE criterion proposed in Section 5. Figure 2 visualizes the AWE scores for all candidate envelope dimensions on the three data sets. It is clearly suggested from the plots that we use u = 7 for the Forest Type data and u = 2 for the Waveform data.
In Table 6, we report the clustering error rates of each methods. In addition to the clustering methods, we also include the discriminant analysis results from LDA and QDA by using the true class labels. For LDA and QDA error rates, we use the R built-in functions LDA and QDA to obtain the prediction errors of class membership from leave-one-out cross-validation. In both data sets, we see substantial improvements by CLEMM over other clustering methods. It is very encouraging to see that CLEMM (without knowing the class label) has comparable results as the discriminant analysis methods (which makes use of the class label in estimation) in the Sound and Forest Type data. Moreover, for the Waveform data, where X | Y is no longer Gaussian, the clustering accuracy of CLEMM is even better than the classification accuracy of LDA and QDA. This indicates that our CLEMM method is more robust to non-Gaussian distributions than other clustering and discriminant analysis methods. Overall, comparing with other clustering methods, CLEMM is better at capturing the information from real data. Table 6 Clustering and classification error rates (%) on the two data sets.  In Figure 3, we visualize the true classes and the estimated clusters by CLEMM, GMM and MCFA. We visualize the data on the first two principal components of the data. We see clearly that CLEMM can better capture the variability of clusters, especially for the cluster of "Sugi" Forest Type. For the Waveform data, since we have the envelope dimension selected as u = 2, we next investigate the visualization of different subspaces: envelope, LDA subspace, principal component subspace. In Figure 4, we visualize the clusters on the envelope subspace estimated by CLEMM. This produces much better separated clusters than the visualization of GMM estimated clusters on the principal component subspace, and of true classes on the LDA subspace. The improvement on visualization by CLEMM over the true classes on LDA subspace is consistent with our earlier findings on the error rates (Table 6): the CLEMM  (without knowing Y ) is even more accurate than LDA classification (knowing Y ). We can see that the first envelope direction can correctly characterize the cluster location and material variation for one of the cluster (the blue dots). The results in Figure 4 suggest that CLEMM can assist in data visualization when the estimated envelope dimension is small.

Discussion
In this paper, we extend the envelope methodology to unsupervised learning problems by considering model-based clustering. The proposed parsimonious model, CLEMM, can simultaneously achieve dimension reduction, visualization, and improved parameter estimation and clustering. Compared to the standard GMM, CLEMM is much more effective in capturing the cluster location changes and the heterogeneous variation across different clusters. The envelope-EM algorithm developed in this paper can also be modified to handle missing data in general. As future research directions, the proposed clustering framework can be extended to mixture discriminant analysis [23,24], to incorporate regularization techniques such as the ones from Friedman [21] and Cai et al. [6], and to various unsupervised and semi-supervised problems.
under the constraint that − X) for k = 1, 2, ..., K. Maximizing value for Ω k , Ω 0 . Replace the maximizing values for µ and α k , we have Therefore, we obtain Γ by minimizing k Γ|+n log |Γ T S −1 x Γ| over the semi-orthogonal constrain that Γ T Γ = I u . Table 7 Computing time in seconds of (M1)-(M5). The reported numbers are averaged results and their standard errors (in the parenthesis) over 100 replications.