Variable Selection Methods for Model-based Clustering

Model-based clustering is a popular approach for clustering multivariate data which has seen applications in numerous fields. Nowadays, high-dimensional data are more and more common and the model-based clustering approach has adapted to deal with the increasing dimensionality. In particular, the development of variable selection techniques has received a lot of attention and research effort in recent years. Even for small size problems, variable selection has been advocated to facilitate the interpretation of the clustering results. This review provides a summary of the methods developed for variable selection in model-based clustering. Existing R packages implementing the different methods are indicated and illustrated in application to two data analysis examples.


Introduction
Model-based clustering is a well established and popular tool for clustering multivariate data. In this approach, clustering is formulated in a modeling framework and the data generating process is represented through a finite mixture of probability distributions. Often, all the variables at a user's disposal are employed in the modeling. Nonetheless, in many situations considering all the variables unnecessarily increases the model complexity. Moreover, some variables may not possess any clustering information and are of no use in the detection of the group structure. Rather, they could be detrimental to the clustering. Likewise, the case were all of the variables contain clustering information can also be problematic. Along with the increasing number of dimensions comes the curse of dimensionality (Bellman, 1957) and including superfluous variables in the model leads to identifiability problems and over-parameterization (Bouveyron and Brunet-Saumard, 2014a; Bartholomew, Knott and Moustaki, 2011). Therefore, resorting to variable selection techniques can facilitate model fitting, ease the interpretation of the results and lead to data classification of better quality. Even in situations of moderate or low dimensionality, reducing the set of variables employed in the clustering process can be beneficial (Fowlkes, Gnanadesikan and Kettenring, 1988).
This article gives a review of the available methods for variable selection in model-based clustering. Starting from early works on the topic, we will give a summary of the various approaches up to the most recent developments. The focus will be on variable selection when clustering multivariate continuous and categorical data, the two most common data types in practice. References to the existing R packages that implement the different methods are provided.
As illustrative examples, we will apply the variable selection methods on two illustrative datasets.
The exposition is structured as follows. Section 2 introduces the main features of the variable selection methods for mixture model clustering. Section 3 recalls the model-based clustering framework, with a brief description of the two principal models for multivariate continuous and categorical data: Gaussian mixture models and latent class analysis. In Section 4, classical and recent variable selection methods for Gaussian mixture models are reviewed. The section terminates with the list of available R packages and a data analysis example concerning historical mortality rates. Section 5 is dedicated to variable selection methods for latent class analysis. R packages and an application to voting data are presented in conclusion. The paper ends with a discussion in Section 6.

Methods of variable selection
When performing variable selection for clustering, the goal is to retain only the relevant variables. With this aim, it is crucial to define what it means for a variable to be, or not to be, "relevant". In supervised learning the matter has been the object of rigorous discussion for a long time; we mention the works of Yu and Liu (2004); Blum and Langley (1997); Kohavi and John (1997); Koller and Sahami (1996) and John, Kohavi and Pfleger (1994). Within the modelbased clustering approach, the question has been addressed recently in Ritter (2014). Model-based clustering places the clustering task into a formal modeling framework and the group structure is embedded in the group membership variable (McLachlan and Basford, 1988). In this case, the definition of relevance can be expressed in terms of probabilistic dependence (or independence) statements with respect to this variable (Ritter, 2014). Relevant variables contain the essential clustering information. In a model-based clustering context, their distribution directly depends on the group membership variable. On the other hand, irrelevant variables do not convey any beneficial information. These can be further divided into redundant and uninformative variables. Redundant variables provide information of similar quality to that already available in the relevant ones, therefore are not needed for a parsimonious modeling. In many situations, they contain similar information because they are correlated with the relevant ones. In terms of distributional representation, we may think of redundant variables as conditionally independent of the grouping variable given the relevant variables; they could be useful for clustering, but only if the relevant ones are not present. On the contrary, uninformative variables possess no discriminative information whatsoever. They correspond to noise and their distribution is completely independent of the group structure.
Different model-based clustering and variable selection strategies can be delineated by distributional assumptions on relevant and irrelevant variables. Two main assumptions are peculiar to the task of variable selection and mixture model clustering; these are: • Local independence assumption. The relevant variables are conditionally independent within the groups. • Global independence assumption. The irrelevant variables are independent of relevant clustering variables. Figure 1 gives a simple sketch of the two independence statements. In specifying a general strategy for model-based clustering and selection of variables, these assumptions can be combined together or employed separately. They have different implications on the clustering model and the model for the relations among the variables. In particular, the local independence assumption helps to simplify the modeling of the joint distribution of relevant variables and is especially useful in high-dimensional data settings. It is a standard assumption of the latent class analysis model (Clogg, 1988). For Gaussian mixture models, the statement corresponds to assuming components with diagonal covariance matrices. The global independence assumption implies that the joint distribution of the variables factors into the product of the mixture distribution of the relevant variables and the distribution of the irrelevant variables. The term "global" is used because the independence statement affects the distribution of all the variables, not only the clustering ones. This assumption simplifies the modeling of the relation between relevant and irrelevant variables. However, it limits the capability of taking into account the presence of redundant variables (Law, Figueiredo and Jain, 2004;Raftery and Dean, 2006). Many of the methods that are presented in this review make use of one of the two assumptions above, either implicitly or explicitly. Likewise, how the variable selection algorithm interacts with the model fitting process defines the overall approach to the problem. For a general learning task, the principal distinction is in whether the selection is carried out separately or jointly to the learning procedure (John, Kohavi and Pfleger, 1994;Dash and Liu, 1997;Dy and Brodley, 2004;Saeys, Inza and Larrañaga, 2007;Liu and Motoda, 2007). The first case corresponds to filter methods, where the selection is performed as a pre (or sometimes post) processing step. The second corresponds to wrapper methods, that combine learning and variable selection at the same time. In this case the selection procedure is "wrapped" around the learning algorithm. Filter approaches are easy to implement and computationally efficient. However, wrapper methods often provide superior results, despite being more involved (Blum and Langley, 1997;Kohavi and John, 1997;Guyon and Elisseeff, 2003).
In model-based clustering, filter methods perform the variable selection before (or after) the model has been estimated. The inferred classification is then used to evaluate the quality of the variables. In contrast, with a wrapper method model estimation and selection are conducted simultaneously. Variables selected via filter methods can miss important information as the selection is external to the model estimation. Indeed, wrapper methods have recently attracted most of the attention and are the most widespread. Their advantages lie in the fact that they can be naturally included in the model fitting, can lead to a better classification and can provide a better representation of the data generating process (Dy and Brodley, 2004;Law, Figueiredo and Jain, 2004).
Within the class of wrappers, the various methods for variable selection in mixture model clustering can be further distinguished according to the type of Fop,M. and Murphy,T. B./Variable Selection Methods 5 statistical approach used. Three major approaches can be found in the literature: • Bayesian approaches. In this class of methods, the problem of variable selection is addressed assuming the existence of a latent variable indicating if an observed variable characterizes a mixture distribution or not. Variable selection is conducted by making inference about the posterior distribution of such a latent variable. • Penalization approaches. Within this category, variable selection is performed by using a penalized log-likelihood approach. The penalization term is a function of the mixture parameters and acts to shrink the estimates towards an overall common value. Variables whose estimates take this common value across the different mixture components are considered irrelevant and are discarded. • Model selection approaches. Here the task of variable selection is re-formulated as a model selection problem. Different models are specified according to the role of the variables towards the clustering structure. Consequently, relevant variables are selected by comparing different models using some predefined criterion.
The above categorization is not exhaustive nor the definitions are intended to be mutually exclusive. Indeed, many of the methods that will be presented in the rest of the paper have some degree of overlap and a method belonging to one type of approach could be easily rephrased in terms of the other ones. These three general approaches are the predominant in the literature and the classification will be used for ease of exposition and a systematic presentation. In summary, the existing variable selection methods for model-based clustering are differentiated in relation to the distributional assumptions for relevant and irrelevant variables, the interaction between variable selection and model fitting, and the general statistical approach. We will give an overview of these methods after a short description of the model-based clustering framework.

Model-based clustering
Let X be the N × J data matrix, where each row x i = (x i1 , . . . , x ij , . . . , x iJ ) is the realization of a J-dimensional vector of random variables X = (X 1 , . . . , X j , . . . , X J ). Model-based clustering assumes that each observation arises from a finite mixture of G probability distributions, each representing a different cluster or group (Fraley and Raftery, 2002;Melnykov and Maitra, 2010;McNicholas, 2016). The general form of a finite mixture distribution is specified as follows: where the τ g are the mixing probabilities and Θ g is the parameter set corresponding to component g; Θ denotes the set of all parameters of the mixture. The component densities fully characterize the group structure of the data and each observation belongs to the corresponding cluster according to a latent cluster membership indicator variable z i = (z i1 , . . . , z ig , . . . , z iG ), such that z ig = 1 if x i arises from the gth subpopulation (McLachlan and Peel, 2000;McLachlan and Basford, 1988). For a fixed number of components, parameters are usually estimated using the EM algorithm (Dempster, Laird and Rubin, 1977;McLachlan and Krishnan, 2008;Bartholomew, Knott and Moustaki, 2011;O'Hagan, Murphy and Gormley, 2012). Moreover, generally model selection corresponds to the selection of the number of components G and to accomplish the task a plethora of methods have been suggested in the literature, the Bayesian Information Criterion (BIC, Schwarz, 1978) being the most popular one. Another popular approach for mixture model selection is the Integrated Complete-data Likelihood criterion (ICL, Biernacki, Celeux and Govaert, 2000), which gives more importance to models with well separated clusters. See McLachlan and Rathnayake (2014) for a detailed review of the various methods.
After parameters have been estimated, each observation is assigned to the corresponding cluster using the maximum a posteriori (MAP) rule (McLachlan and Peel, 2000;McNicholas, 2016). The posterior probabilities u ig = Pr(z ig = 1 | x i ) of observing cluster g given the data point i are estimated as follows: Then observation x i is assigned to cluster g if According to the nature of the data, different specifications for the component densities in (1) have been proposed. In the following sections we focus on the cases of continuous and categorical data, taking in consideration the two most popular distributions: Gaussian and Multinomial. However, various and more flexible distributional assumptions can be specified, enabling to take into account for skewness, heavy tails and different data types; for example, see Mc-Nicholas (2016) for a review on non-Gaussian components, McLachlan and Peel (1998) for mixtures of multivariate t-Student distributions, Lee and McLachlan (2013) and Lee and McLachlan (2016) for mixtures of skew-t and skew normal distributions, Karlis and Meligkotsidou (2007) and Rau et al. (2015) for clustering multivariate count data, McParland and Gormley (2016) for mixed data model-based clustering, Kosmidis and Karlis (2016) for the use of copulas in model-based clustering, DeSantis et al. (2008) for latent class analysis of ordinal data.

Gaussian mixture model
When clustering multivariate continuous data, a common approach is to model each component density by a multivariate Gaussian distribution. For a Gaussian Fop,M. and Murphy,T. B./Variable Selection Methods 7 mixture model (GMM) the mixture density in (1) becomes: where φ is the multivariate Gaussian density and µ g and Σ g are the mean and covariance parameters respectively; see Fraley and Raftery (2002), Melnykov and Maitra (2010) and McNicholas (2016) for reviews. To attain parsimony, several approaches involving re-parameterizations of the covariance matrix Σ g have been presented; for example Banfield and Raftery (1993); Celeux and Govaert (1995); Bouveyron, Girard and Schmid (2007); McNicholas and Murphy (2008); Biernacki and Lourme (2014). We refer to Bouveyron and Brunet-Saumard (2014a) for a review. Model based-clustering via GMMs can be performed in R (R Core Team, 2017) using the packages mclust , Rmixmod (Lebret et al., 2015), EMCluster (Chen and Maitra, 2015), mixtools (Benaglia et al., 2009) and flexmix (Leisch, 2004); also the EMMIX software (McLachlan et al., 1999) can be used for fitting mixtures of Gaussian distributions. Lastly, we note that in Python (Python Software Foundation, 2017), GMMs estimation can be executed via the packages scikit-learn 1 and PyMixmod 2 .

Latent class analysis model
For clustering multivariate categorical data, the common approach is to use the latent class analysis model (LCA, Bartholomew, Knott and Moustaki, 2011). Under this model, the mixture density in (1) is a mixture of Multinomial distributions as follows: where with π gjc representing the probability of occurrence of category c for variable X j in class g, and C j the number of categories of variable j. The factorization in C(x i ; π g ) is due to the local independence assumption, stating that the variables are independent within each latent class (Clogg, 1988). More details about the model can be found in Vermunt and Magdison (2002); Agresti (2002); Collins and Lanza (2010). For different values of G, not all the models can be fitted and constraints on the parameters need to be placed in order to ensure identifiability (Goodman, 1974); for examples see Formann (1985).
R packages implementing the LCA model for clustering categorical data are BayesLCA (White and Murphy, 2014), poLCA (Linzer and Lewis, 2011) and flexmix (Leisch, 2004); package e1071 (Meyer et al., 2017) contains function lca for fitting the LCA model on binary data.

Variable selection methods for Gaussian mixture models
In this section we provide an overview of the available methods for clustering and variable selection using Gaussian mixture models. Steinley and Brusco (2008) and Celeux et al. (2014) compare and evaluate the performances of some of the methods described in the subsequent sections. In particular, Steinley and Brusco (2008) perform an empirical comparison of different procedures for clustering and variable selection, also those not based on mixture models. While Celeux et al. (2014) compare the model selection approach of Maugis, Celeux and Martin-Magniette (2009a) (see Section 4.3) and the regularization approach of Witten and Tibshirani (2010) (see Section 4.2); the authors found that, in the case of correlated variables, the model selection approach was substantially more accurate in terms of both classification and variable selection than the regularization approach, and that both gave more accurate classifications than K-means without variable selection.

Bayesian approaches
Various methods have been developed within the Bayesian paradigm for simultaneous model-based clustering and variable selection. A common feature among them is the introduction of a variable ϕ, usually following a distribution p(ϕ). The variable splits X into two sets: X C , the set of variables defining a Gaussian mixture distribution, and X N C , the set of variables indicating a single multivariate Gaussian distribution. In its most general form, the distribution of the data, conditional on z, can be expressed as: where p(X C | z, G, ϕ, Θ) is a mixture distribution whose parameters are denoted by Θ, p(X N C | ϕ, Γ) is a single multivariate Gaussian distribution with parameters Γ, and Ω denotes the collection {Θ, Γ} of all parameters. Then, with the aim of variable selection and clustering, the focus is in drawing inference from the posterior distribution: In this context, Liu et al. (2003) propose the anchor mode model, where variable selection is performed by selecting the most informative principal components (or factors) of the data. In this approach, a preliminary dimension reduction through principal component analysis is performed and the first k 0 factors are retained. Then it is assumed that only a subset of these factors is informative for clustering and distributed according to a mixture of Gaussians, while the remaining components follow a simple Gaussian distribution. The subset consists of the first ϕ principal components, where ϕ is a random variable distributed according to the prior distribution p(ϕ). Inference on this number of relevant factors is conducted employing a Markov Chain Monte-Carlo (MCMC) scheme where the prior p(ϕ) is taken to be the Uniform distribution. The method is shown to perform well on high-dimensional gene expression data, however, selection is performed on a set of features derived from the original variables and in general the principal components with the larger eigenvalues will not necessarily contain the most useful information about the clustering (see Chang, 1983).
As an alternative to the "hard selection" approach (a variable is either selected or not), Law, Figueiredo and Jain (2004) suggest a Bayesian framework where the concept of feature saliency is introduced. Let ϕ = (ϕ 1 , . . . , ϕ j , . . . , ϕ J ) be a binary variable such that ϕ j = 1 if X j is relevant and ϕ j = 0 otherwise. Then the saliency of variable X j is the quantity ρ j = Pr(ϕ j = 1) and can be interpreted as the importance of the variable in characterizing the cluster structure of the data. Assuming conditional independence of the variables, it follows that the likelihood of a data-point can be expressed as: with clear use of the notation. Rather than ϕ, the interest here is in recovering the probabilities ρ j and a Dirichlet prior is placed on the corresponding vector.
To encourage the saliences of some variables to converge to zero, the authors adopt a minimum message length criterion (Wallace and Freeman, 1987) and utilize an EM algorithm for maximum a posteriori estimation. Within the same framework, Constantinopoulos, Titsias and Likas (2006) consider a variational learning method for estimation of the saliences. The same idea of a binary clustering-relevance indicator variable ϕ is employed in Tadesse, Sha and Vannucci (2005). The authors assume a prior on ϕ of the form: with η the hyper-parameter interpreted as the proportion of variables expected to discriminate the groups. A MCMC scheme is used for inference, and the vector ϕ is updated using a Metropolis search where a new candidate is generated from the previous state by adding, removing and swapping at random its entries. Posterior inference on ϕ is drawn after integrating out the parameters and considering the marginal posterior p(ϕ j = 1 | X). Then the best clustering variables can be identified as those with largest marginal posterior, p(ϕ j = 1 | X) > t, with a specified t. Alternatively, the selection can be performed by taking into account the complete vector ϕ with largest posterior probability among all visited vectors throughout the chain, thus considering the joint density of ϕ. The method is shown to perform well in clustering high-dimensional microarray data. Furthermore, in subsequent work, Kim, Tadesse and Vannucci (2006) extend the approach by formulating the clustering in terms of an infinite mixture of Gaussian distributions via Dirichlet process mixtures, while Swartz et al. (2008) expand it to the modeling of data with a known structure imposed from an experimental design.

Penalization approaches
In this context, a penalization term is introduced on the model parameters and variable selection is performed by inducing sparsity in the estimates. The aim is to maximize a penalized version of the log-likelihood under a Gaussian mixture model and discard those variables whose parameter estimates are shrunken to zero or to a common value across the mixture components. In its general form, this penalized log-likelihood is as follows: where the penalization term Q λ (Θ) is a function of the Gaussian densities parameters Θ and λ, a generic penalty parameter (here in the notation we denoted with Θ the collection of all Gaussian density parameters and with Θ g the subset corresponding to component g). Generally, the various methods are differentiated by the form of the function Q λ (·), having different implications on the selection of variables. Seminal work in this class of approaches is the method introduced by Pan and Shen (2007). The authors use a L 1 penalty function of the form: After centering the data, the method realizes variable selection by automatically shrinking several small estimates of µ gj towards zero. Indeed, if µ gj = 0 for all g, the component means for variable j are equal to the overall data mean and variable j does not contribute to the clustering. The authors show an application to the Golub et al. (1999) leukemia gene expression dataset, demonstrating the usefulness of the approach in "high dimensions -small sample size" settings.
A closely related approach is the one suggested by Bhattacharya and McNicholas (2014), where the penalizing function accounts for the size of the clusters and is given by: The parameter λ depends on the sample size and the authors derive a BIC-type model selection criterion for high-dimensional settings.
The L 1 penalty function treats each µ gj individually, not using the information that, across the mixture components, the parameters (µ 1j , . . . , µ gj , . . . , µ Gj ) corresponding to the same variable X j are grouped. This results in the fact that, if for a fixed variable j and some component g, we have µ gj = 0 while µ kj = 0 for all the remaining components, then the variable would not be excluded. Wang and Zhu (2008) suggest a solution to the problem by replacing the L 1 norm with the L ∞ norm. Thus the penalty function is given by: After re-parameterizing µ gj = α j β gj (α j ≥ 0), the authors consider also a hierarchical penalization function of the form: with λ 1 controlling the amount of shrinkage on the µ gj as a group for g = 1, . . . , G, and λ 2 controlling the shrinkage effect within variable j. This function has the advantage of being more flexible and inducing a less "hard" penalization than the L ∞ norm. Both penalty functions take into account the fact that component means corresponding to the same variable can be treated as grouped and tend to conduct a more effective variable selection.
The idea of grouped parameters is also accounted in Xie, Pan and Shen (2008a). Here the authors suggest the use of two planes of grouping: vertical and horizontal mean grouping. For the first, mean parameters afferent to the same variable are treated as a whole and the penalty function is: where µ j = (µ 1j , . . . , µ gj , . . . , µ Gj ) and ||·|| 2 denotes the L 2 norm. The same idea is exploited in the horizontal grouping, where prior knowledge that some variables work in groups is introduced in the penalization. Here the variables can be grouped in M groups indexed by m and each one of size H m . Then the function is given by: with µ gm the vector of means of component g for variables in group m. The two planes of penalization can be combined together and Xie, Pan and Shen (2008a) show their superior performance in comparison to the standard L 1 penalty. An approach allowing to identify which variables are discriminative for which specific pairs of clusters is the one proposed by Guo et al. (2010). They introduce a pairwise fusion penalty of the form: The penalty function shrinks the mean parameters toward each other and if µ gj = µ gh , variable X j is not useful in discriminating components g and h, but may be useful in discriminating the other components. A variable will be considered as non-informative and discarded only in the case where all the cluster means are shrunken to the same value.
A shared characteristic of the penalization methods presented to this point is the assumption that the data are generated by a Gaussian mixture distribution with a common diagonal covariance matrix. Consequently, the mixture components are differentiated only by their mean parameters and variables with cluster-specific means all equal can be discarded since non-informative to clustering. Nonetheless, in some situations, the assumption of a common isotropic covariance matrix might be too restrictive, as it implies that all the clusters are of spherical shape and have the same volume. For example, the Golub et al. (1999) data contain a sample of 38 patients known to have three types of leukemia, with different variability in the gene expressions. In this case, assuming a common isotropic covariance matrix would likely lead to the selection of a number of clusters larger than the actual number of types of leukemia in order to accommodate for the extra within-cluster variability.
Xie, Pan and Shen (2008b) move away from this assumption and present an approach for Gaussian mixture distributions with cluster-specific diagonal covariance matrices. They introduce two penalization terms of the form: After the data have been standardized to have mean 0 and variance 1 for all the variables, variable selection is performed by removing those variables having common mean 0 and common variance 1 across the clusters. The authors also expand the penalization to account for grouped variables in a similar fashion as in Xie, Pan and Shen (2008a). The method is applied to the Golub et al. (1999) dataset and it is shown to perform a more parsimonious selection than the approach of Pan and Shen (2007) and to detect clusters where gene expressions have different variances. Zhou, Pan and Shen (2009) further extend the previous method to the case of unconstrained covariance matrices. The authors suggest the following penalization: where W g is the precision matrix, W g = Σ −1 g , and ||W|| 1 = J j=1 J h=1 | w jh | . The term involving the mean parameters is used for variable selection, while that for the precision is a regularization for dealing with high-dimensional data without imposing a diagonal covariance matrix. In this framework, covariances between the variables are taken into account and clusters are also allowed to be elliptical and of different shapes. The authors apply their method on the Golub et al. (1999) data, improving the classification performance with respect to the previous approaches.
By introducing a penalization term in the mixture of factor analyzers model (see Bouveyron and Brunet-Saumard, 2014a, for example) dimension reduction and variable selection can be performed simultaneously. Xie, Pan and Shen (2010) introduce a penalization approach in the mixture of factor analyzers model where the penalty function is given by: with γ gjr the cluster-specific loading corresponding to variable j and latent dimension r. The penalization on the loadings serves as a grouped variable penalty similarly to Xie, Pan and Shen (2008a). The author apply the method for clustering gene expression data of lung cancer patients, uncovering clusters of subjects with distinct risks of mortality. Galimberti, Montanari and Viroli (2009) introduce a mixture of factor analyzers model with common factor loading matrix that projects the cluster means in a low dimensional space. They consider a penalized log-likelihood of the form: where V is the (orthogonal) J × R matrix of loadings (R < J) with entries γ jr and B is a diagonal covariance matrix. Therefore, variables whose estimated loadings are γ jr = 0 ∀ r will be non influential to the clustering and discarded.
A related method is suggested by Bouveyron and Brunet-Saumard (2014b), who propose to perform selection of the relevant variables by inducing sparsity in the loading matrix of the Fisher-EM algorithm (see (Bouveyron and Brunet, 2012;Bouveyron and Brunet-Saumard, 2014a) for details). The method assumes that the observed data lie in a discriminative latent space defined by latent features which are linear combinations of the original variables. As before, the density of each data point is expressed by: The main idea is to obtain a sparse estimate of V such that variables whose loadings are shrunk to zero will not be relevant in defining the discriminative latent subspace and hence not useful for clustering. The authors propose three different approaches. The simplest, although easy to implement and competitive with the others, aims at obtaining a sparse approximationṼ of the estimateV computed during an iteration of the Fisher-EM. To this purpose, the following minimization problem is posed: where v r is a column of V and ||·|| F denotes the Frobenius norm. The other two are more involved and estimate a matrix V that is directly sparse and defines a discriminative subspace; we point to the referenced work for details. For all the methods described above, maximization of the penalized likelihood for different specifications of Q λ (·) is usually carried out resorting to a form of penalized EM algorithm (Green, 1990). Moreover, model selection consists in the selection of the number of mixture components and of the optimal shrinkage parameter λ, and to accomplish the task a modified BIC is employed; see the referenced works for details.
A method that differs in some way from the general form of (4) is the sparse kmeans proposed by Witten and Tibshirani (2010). Let C = {C 1 , . . . , C g , . . . , C G } be a partition of the observations into disjoint subsets, and let w = (w 1 , . . . , w j , . . . , w J ) be a vector of weights for each variable. Clustering and variable selection is performed by solving the optimization problem: wherex j is the sample mean of variable j and s is a tuning parameter. The L 2 penalty will force the weights in the interval [0, 1], while the L 1 norm is used to shrink the values to 0. The weight w j is interpreted as the contribution of X j to the resulting clustering: a large value of w j indicates a variable useful for clustering, while w j = 0 means that variable j does not contribute to the clustering. Note that the above objective function is related to the log-likelihood of a Gaussian mixture model with a common isotropic covariance matrix of the form Σ g = σ 2 I, an implicit assumption of the method. Indeed, in this exposition we only considered the case associated to Gaussian mixture modeling and we adapted equation (10) of Witten and Tibshirani (2010) to the purpose, using the connection of k-means to the fitting of mixture of Gaussians with common spherical covariance matrix (see the cited work for additional details). Furthermore, the method needs the number of clusters G to be known in advance. However, the approach is particularly suitable for high-dimensional data and the authors describe a more general framework for sparse clustering. Related work on sparse k-means is in Sun, Wang and Fang (2012), where consistency properties are investigated.

Model selection approaches
Within this class of approaches, the problem of selecting the relevant clustering variables is recast as a model selection problem. Different models are specified by the role attributed to the variables in connection to their relation with the clustering variable z. Then these models are compared by means of a model selection criterion and relevant clustering variables are chosen accordingly to the best model. The framework was pioneered by the work of Raftery and Dean (2006), where three main roles are specified for the variables. The authors propose a procedure where X is partitioned into the following subsets of variables: • X C , the set of current clustering variables; • X P , the variable(s) proposed to be added or removed from the set of clustering variables; • X N C , the set of other variables not relevant for clustering.
Then the decision for inclusion or exclusion of X P is taken by comparing the models (Fig 2): In model M A , X P is useful for clustering and the joint distribution p(X C , X P | z) corresponds to a Gaussian mixture distribution; on the other hand, M B states that X P does not depend on the clustering z and the conditional distribution p(X P | X C ) corresponds to a linear regression. An important feature of the framework formulation is that in M B the irrelevant variables are not required to be independent of the clustering variables. This allows to discard redundant variables related to the clustering ones but not to the clustering itself. Another important characteristic of the model construction lies in the avoidance of the unrealistic assumption of the independence between clustering and irrelevant variables. The competing models are compared using the BIC approximation to their marginal likelihoods: where BIC clust (X C , X P ) is the BIC of a GMM in which X P adds useful information to the clustering, BIC no clust (X C ) is the BIC of the GMM on the current set of clustering variables and BIC reg (X P | X C ) is the BIC of the regression of X P on X C . Then if the difference BIC A − BIC B is greater than zero, X P is added to the set of clustering variables. To perform the selection, variables are added/removed and different models are compared using a stepwise algorithm.
The two models compared in Maugis, Celeux and Martin-Magniette (2009a). Raftery and Dean (2006), however, in the conditional distribution p(X P | X C ), X P can be related only to a subset X R ⊆ X C of the clustering variables. Therefore, in the regression only a subset X R of predictors are used to describe the dependency between X P and X C . In this way, the authors avoid the inclusion of unneeded parameters that would over-penalize the log-likelihood with the consequence of favoring models that declare some irrelevant variables as relevant when model comparison is performed using the BIC. The models compared in Maugis, Celeux and Martin-Magniette (2009a) are depicted in Fig. 3, with M A the same as previously stated and M C defined as follows: with p(X P | X R ⊆ X C ) the regression term of X P on the relevant predictors X R . For this model the BIC is given by: with BIC reg (X P | X R ⊆ X C ) the BIC of the regression of X P on X C after selection of the optimal set of predictors X R . Again, to decide if X P is relevant for clustering, the difference BIC A − BIC C is computed and X P is added to the set of clustering ones if this quantity is greater than zero. Model search is performed using a backward search and for selecting the relevant predictors in the regression the authors propose a standard stepwise procedure. The framework is subsequently expanded in Maugis, Celeux and Martin-Magniette (2009b), where the authors consider an additional role for the variables proposed to be added or removed. They explicitly account for the case where X P can be independent of the clustering variables X C . The assumption leads to further circumvent the over-penalization problem of the log-likelihood when parsimonious Gaussian mixture models are involved in the comparison and penalized likelihood criteria are used for model selection. The authors suggest a more flexible framework where three different models are compared, all represented in Fig. 4. Models M A and M C are as before, while model M D is specified as follows: with p(X P ) the density of a multivariate Gaussian distribution. Then, the corresponding BIC is given by: where BIC indep (X P ) is the BIC of a multivariate Gaussian model. For an efficient model search, the authors suggest to rewrite BIC D as: whereX R is allowed to be the empty set ∅ and thus BIC reg (X P |X R ⊆ X C ) = BIC indep (X P ) whenX R = ∅. As in Maugis, Celeux and Martin-Magniette (2009a), the difference BIC A − BIC * D is computed at each step of the algorithm and X P is added to X C if this quantity is positive. In the algorithm the variables are added and removed using a stepwise searching method and relevant predictors are selected by a backward stepwise algorithm. Maugis-Rabusseau, Martin-Magniette and Pelletier (2012) extend the variable selection framework to handle data with missing values without resorting to any imputation procedure.
Note that the mentioned methods not only perform the selection of the relevant clustering variables, but at the same time they select the optimal number of clusters and the best parsimonious Gaussian clustering model from the family of models of Celeux and Govaert (1995); we refer to the cited works for details. Lastly, it is worth mentioning two further extensions of the above modeling frameworks: Scrucca (2016) proposes the use of genetic algorithms for searching over the whole model space to overcome the sub-optimality of a stepwise search, while Galimberti, Manisi and Soffritti (2017) suggest a framework where the relevant variables can provide information about more than one clustering structure of interest.
In the above approaches, at each step of the variable selection algorithm, the likelihood of a GMM has to be optimized several times. To deal with this D   Fig 4: The three different models presented in Maugis, Celeux and Martin-Magniette (2009b).
computational issue, Marbac and Sedki (2017a) propose a procedure that relies on the ICL criterion and does not require multiple calls of the EM algorithm.
In the framework, a variable is declared as irrelevant to the clustering if its onedimensional marginal distributions are equal between the mixture components. The authors introduce a binary indicator variable ω = (ω 1 , . . . , ω j , . . . , ω J ) such that ω j = 1 if variable X j is relevant, 0 otherwise. In this context we would denote X C = {X j ∈ X : ω j = 1, ∀ j} and X N C = {X j ∈ X : ω j = 0, ∀ j}, and different models are specified by different configurations of the binary vector ω. These models are compared by means of a criterion based on the following integrated complete-data likelihood: where Ω denotes the collection composed of the parameters related to the mixture distribution for X C and the parameters related to the distribution for X N C . After assuming local and global independence among the variables and placing conjugate priors on the parameters, the above integral reduces to: where all the quantities have closed analytical expressions (see authors' paper for more details). For a non relevant variable, p(X j | ω j , z) = p(X j | ω j ) since this quantity does not depend on the data classification, while for a relevant one this quantity will assume different values over the mixture components. Consequently, variable selection is performed by finding the vector ω that maximizes the Maximum Integrated Complete-data Likelihood (MICL) criterion, given by: with z * ω = argmax z log p(X, z | ω). Maximization of MICL(ω) is carried out iteratively using an algorithm that alternates between two optimizations steps of the ICL: optimization on the classification z given the data and ω, and maximization on ω given the data and the classification z. As in Raftery and Dean (2006), Maugis, Celeux and Martin-Magniette (2009a) and Maugis, Celeux and Martin-Magniette (2009b), the algorithm also returns the optimal number of clusters and we refer to the paper for details. The approach is fast and scales well for problems with a large number of variables. However, the optimization on z can be computationally demanding for large sample sizes; the authors suggest that for sample sizes smaller than 10 4 this optimization is still doable. Furthermore, in a situation where variables are correlated and many redundant ones are present, the local and global independence assumptions could be too restrictive.

Other approaches
Other variable selection methods that do not univocally fit in the previous sections have been developed in the literature. Within a wrapper approach, Dy and Brodley (2004) propose to embed in the EM a forward selection algorithm for the maximization of two simple alternative criteria for variable selection. The first is based on a measure of separability and unimodality of the clusters, and consists in the quantity tr(S −1 W S B ), where S W is the within-cluster scatter matrix and S B is the between class scatter matrix. The second is based on the quantification of how well the model fits the data and the natural choice is the likelihood of a GMM itself.
A related method is the one of Andrews and McNicholas (2014). The authors suggest an hybrid filter-wrapper approach based on a variable within-group variance, given by: First, an initial estimate of W j is found by running a preliminary clustering step and the variables are listed in ascending order according to their values. The top variable in the list is automatically selected (as the one with the minimum W j ) and placed in the set of selected variables S. Subsequently, a variable j is with ρ rj the correlation between X j and X r , and c a coefficient used to weight the within group variation. The authors suggest to use the integer values from 1 to 5 and the procedure will tend to include more variables in the model as c increases. Lastly, clustering is performed by fitting a GMM on the final set S. Note that before running the procedure the data need to be standardized to have mean 0 and variance 1. Another method using the concept of separability and unimodality of the clusters as in Dy and Brodley (2004) is the one found in Lee and Li (2012). Here GMMs are used to estimate the density of the data and the variable selection is based on a cluster separability measure defined using ridgelines, one-dimensional parametric curves linking the modes of two clusters as function of their densities. Let y gk (a t ), be the set of points individuating the ridgeline between cluster g and k (a 0 = 0 < a 1 < · · · < a T = 1, t = 0, . . . , T ); the pairwise separability between these clusters is measured by: The aim is to find the set of variables that achieves the maximum aggregated separability across all the pairs of clusters and the authors suggest two alternative forward selection algorithms for the task. The method tends to select variables indicating well-separated clusters.
A simple filter approach used to pre-select the variables and reduce the dimensionality of the data is the one suggested by McLachlan, Bean and Peel (2002), within the context of clustering microarray expression data. In this approach, first a univariate mixture model is fitted to each variable in turn. Then the likelihood ratio statistic for the test of a single normal distribution versus a mixture of Gaussians is computed and those variables for which this statistic is significant are retained. The selected variables are subsequently clustered using a k-means procedure in order to find those centroids that represent subsets of variables. Model-based clustering is consequently performed on these representative centroids, which represent the data in a lower dimensional subspace.
The stepwise algorithm presented in Maugis, Celeux and Martin-Magniette (2009b) could be very slow in high-dimensional settings. To overcome the computational burden, Celeux et al. (2017) suggest an hybrid approach where a LASSO-like procedure and the model selection algorithm are employed together. The approach consists of two subsequent steps. First, the method of Zhou, Pan and Shen (2009) is applied to the data (see Section 4.2) and a ranking of the variables is obtained. For fixed G and a set of different combinations L λ1 × L λ2 of values (λ 1 , λ 2 ), the clustering score of each variable X j is defined as:
The larger the value of O G (j), the more a variable is likely to be relevant for the clustering. Then the variables are ranked in decreasing order according to this quantity; denote this rank as I G . Ideally with ν 1 , ν 2 , m 0 , M 0 and s 0 some hyperparameters. For values of ν 1 < 1, the prior shrinks all the component means of variable j towards the common value b 0j , with the effect that variables uninformative for the clustering are effectively fit by a single mixture component. Note that this approach could be interpreted as an hybrid form of Bayesian and penalized variable selection procedures.
Although not directly related to the Gaussian mixture modeling framework, it is worth to mention two other approaches that can be thought of a crossover between Bayesian and penalization methods. The first is the work from Hoff (2006), where it is proposed a Bayesian framework for finding subsets of variables that distinguish the clusters from each other. Thus, every cluster is characterized by its own set of discriminating variables that differentiates it from the rest of the clusters. In the method, each data point is modeled as: whereμ j is the overall data mean of variable j, ϕ gj is a binary variable indicating if X j is "active" on cluster g and ij is an error component such that ij ∼ N (0,σ 2 j ), withσ 2 j the overall data variance of X j . The model results in a multivariate Dirichlet process mixture model and the use of conjugate priors allows the implementation of a Gibbs sampling scheme for inference. Rather than performing a genuine variable selection, the method identifies the cluster-specific relevant variables by detecting the shifts in means and variances from the corresponding overall data quantities. The approach belongs to the class of subspace clustering methods and is related to the work of Friedman and Meulman (2004). The second is the one proposed by Nia and Davison (2015) and is closely related. In this case, the authors consider the following linear model for the data points: x gij =μ j + γ j ϕ gj µ gj + gij , where x gij is the data of clustering observation i measured on variable j in cluster g; γ j and ϕ gj are binary variables such that γ j ∼ Bernoulli(ν) and ϕ gj ∼ Bernoulli(η) respectively. Probability ν may be interpreted as the prior proportion of relevant variables, while η as the prior proportion of non-overlapping component means for a clustering variable. Two families of prior distributions are suggested for µ gj : Gaussian and asymmetric Laplace. In both cases, the marginal distribution of the observations is given in closed form and results in a spike and slab density in which the data is modeled by a mixture of two densities. When (γ j = 1, ϕ gj = 1), the variable is active and the data is modeled by the slab density which drives the clustering procedure when a variable is relevant for clustering. When γ j = 0 or ϕ gj = 0, the density of the data corresponds to the spike density, which reduces the effect of non-informative variables. Variables not important for any of the clusters can be discarded. The setting allows the use of Bayes factors for computing the importance of each variable and perform the selection, and clustering of the data is achieved via an agglomerative hierarchical procedure.

R packages and data example
The available R packages for variable selection for Gaussian mixture models are: sparcl (Witten and Tibshirani, 2013), clustvarsel (Scrucca and Raftery, 2015), VarSelLCM (Marbac and Sedki, 2017b), vscc (Andrews and McNicholas, 2013), SelvarMix , and bclust (Nia and Davison, 2012). Table 1 lists them, with information about the method and the type of approach. It is also worth to mention the package ClustOfVar (Chavent et al., 2012). Rather than performing variable selection and obtain a classification of the data points, this package aims to find clusters of variables linked to a central synthetic variable obtained from a principal components decomposition of the data; for this reason, it will not be considered in the subsequent analysis. Lastly, we also point to the C++ softwares SelvarClust 3 and SelvarClustIndep 4 , which implement the modeling framework of Raftery and Dean (2006) and Maugis, Celeux and Martin-Magniette (2009a,b). We will  (2014) bclust other/Bayesian Davison (2015, 2012) not consider these software in the following analysis, since the aim of this paper is in reviewing R packages. The adjusted Rand index (ARI, Hubert and Arabie, 1985) will be employed to assess and compare the clustering performance of the packages. Let us consider two different partitions of the N data points, one into G clusters and the other into K clusters; this index is defined as: where N gk is the number of observations falling in cluster g and cluster k, while N g. = k N gk and N .k = g N gk . The ARI corrects the comparison for the fact that two partitions could match merely by chance (Hubert and Arabie, 1985). The index has a maximum value of 1 and attains this value when there is perfect matching between the two partitions; on the other hand, it has expected value of 0 when the two partitions are completely independent. Therefore, the index is used to measure the agreement between two different classifications of the data. We apply the different variable selection methods to data from the Human Mortality Database (Human Mortality Database, 2017). The database contains life tables from several countries, spanning a period of time from the middle of the 18 th century to 2015. We consider central mortality rates for both genders over 10-year time intervals and 5-year age intervals. The central mortality rate n m x is interpreted as the average 10-year interval rate at which people die during the period between age x and x + n, normalized by the number of those living. The data contain 389 age patterns of mortality, each over a 10-year period time for 24 age intervals from 0 to 110+ ( Figure 5). The aim is to cluster the mortality schedules and individuate those age groups relevant for the clustering structure. There are 27 non overlapping 10-year periods, some present multiple times (especially recent ones), and the patterns of mortality tend to cluster according to the historical period (Clark and Sharrow, 2011). To assess the quality of the clustering, we will compare the estimated partition with the information regarding the 10-year period each schedule belongs to using the ARI. First we fit a Gaussian mixture model using the mclust package. The model selected is a 3-component mixture, with an ARI of 0.13 between the estimated classification and the historical period. Then we apply the variable selection methods listed in Table 1. Figure 6 displays the selected age groups for each method, alongside the chosen number of clusters and the ARI between the estimated classification and the historical period. Note that for sparcl the number of components needs to be set in advance and we fixed it to the number selected by mclust. Package bclust did not discard any variables, most likely due to the fact that adjacent age groups are highly correlated and the independence assumption of this method is unrealistic in this case. Also package vscc did not remove any of the variables. Package VarSelLCM retained the smallest subset of variables. However, the package obtained an ARI of zero and surprisingly selected only the oldest age intervals, those expected to have the least information, thus it seems to have detected a spurious solution. We ran the package multiple times and the other best solutions found were a model with 3 clusters and the same subset of variables, and a model with only one cluster and no variables discarded. The package seems to be sensitive to the initialization and, also in this case, this is an indication that the strong independence assumptions at the basis of the method implemented could be too restrictive. The other packages, sparcl, SelvarMix, and clustvarsel discarded age groups representing the old population. Among these, clustvarsel performed the most parsimonious selection, retaining only 11 age groups in the range from 0 to 49 years old, and with the highest ARI. The clusters are discriminated along the young and middle ages, as it would be expected since for elder ages the differences among death rates level off. Table 2 reports the cross-tabulation between the historical periods and the clustvarsel partition on the selected age intervals. The clusters have a meaningful interpretation in terms of the time dimension. Figure 7 shows the estimated cluster means on the age groups selected by clustvarsel. The mclust estimated cluster means are added for comparison. In particular, the mortality patterns of clusters 1, 3 and 4 estimated on the selected variables are close to the average patterns estimated on all the data. The extra cluster 2 seems to capture the mortality pattern related to the period corresponding to the two World wars. Moreover, cluster 5 has a mean rate similar to cluster 3, but higher death rate for age intervals above 20 years old and captures the extra-heterogeneity present in the more recent years. Table 3 reports the computing time of the packages considered in Table 1 on a Dell machine with Intel Core i7-3770 CPU @3.40GHz×8. The packages are listed from the fastest to the slowest. vscc is the fastest among the packages considered, but it did not discard any of the age intervals. Package sparcl is the second fastest. In the case of highly correlated variables, clustvarsel performs the most parsimonious selection and obtains the best classification, with an acceptable computing time. Note, however, that in using clustvarsel we considered only 4 out of the 14 covariance models available, as some of them may require a long computational time to be fitted and to consider a set of models consistent with SelvarMix. Therefore, aside from vscc, sparcl is the fastest and as such it is particularly suitable to perform variable selection in  high-dimensional settings where it is reasonable to assume local independence among the variables. Nevertheless, the package requires the number of clusters to be specified in advance, while clustvarsel and SelvarMix automatically select such number. bclust and VarSelLCM are the slowest in this example, and care may need to be taken when using these packages on highly correlated data, since they make use of independence assumptions that could be too restrictive. Lastly, we point that clustvarsel, SelvarMix and VarSelLCM allow for parallelization of the computations.

Variable selection methods for latent class analysis
After the description of the main variable selection methods for GMMs, in this section we present the different approaches for variable selection in latent class analysis.

Bayesian approaches
Only recently, attention has been posed on the use of the Bayesian framework for variable selection in latent class analysis. The overall setting is similar to the one provided in Section 4.1 and we refer there for its description. Moreover, generally the methods borrow ideas from Bayesian variable selection for Gaussian mixture models. Indeed, Silvestre, Cardoso and Figueiredo (2015) propose an adaptation of Law, Figueiredo and Jain (2004) method to categorical data clustering and variable selection. The mixture of Gaussian densities is replaced by the mixture in (3) and the authors use the same concept of feature saliency. Analogously, a Dirichlet distribution is assumed on the saliences and an EM algorithm for MAP estimation is employed.
White, Wyse and Murphy (2016) present an approach inspired by the work of Tadesse, Sha and Vannucci (2005). The authors propose the of use a collapsed Gibbs sampler algorithm in which all the mixture parameters are integrated out. A prior of the form p(ϕ | η) = j η ϕj (1 − η) 1−ϕj is assumed for the indicator variable ϕ. Then, assuming conjugate priors on all the parameters, it is possible to marginalize them out analytically, obtaining the posterior: with Ω denoting the collection of parameters of the latent class analysis model and the parameters of the distribution for the irrelevant variables. The above quantity has a closed form expression and allows to implement a simple MCMC scheme. Subsequently, inference is performed using a post-hoc procedure. Variable selection and selection of G is carried out jointly, by calculating the proportion of time the sampler spent in a certain number of groups and including a certain variable. The authors employ the method on different datasets and it is shown to outperform a more involved reversible jump MCMC algorithm for variable selection.

Penalization approaches
Despite of the large amount of work developed for penalized model-based clustering with GMMs (confront Section 4.2), little work has been done in the context of LCA. Additionally, more attention has been given to the aim of model identifiability and regularization, rather than to the purpose of variable selection per se.
Within this framework enters the work of Houseman, Coull and Betensky (2006). The authors propose a penalization approach for latent class regression analysis of high-dimensional genomic data where a set of categorical response variables is related to a set of numerical covariates. Here the class-conditional probabilities π g in (3) are expressed as function of a vector of covariates and a set of regression coefficients. Then a ridge or LASSO penalty is used in order to obtain sparse estimates for the vectors of regression coefficients. Consequently, by discarding those predictors with coefficients shrunken to zero, the clusters would be characterized by different sets of selected covariates used to model the categorical responses. However, variable selection is not performed on the variables directly involved in the clustering.
Wu (2013) presents a related approach, but this time focusing on the selection of variables involved in the clustering. In the method, the class conditional probabilities π gjc of observing category c for variable j within class g are reparameterized in terms of the logit transform as: Then, similarly to (4) in Section 4.2, the following penalized log-likelihood is considered: where α and β g are the collections of parameters α jc and β gjc respectively, and we made explicit the dependence on them of the Multinomial probability mass function. For a variable X j , the quantities β gjc measure the difference from the overall probabilities exp(α jc )/ Cj l=1 exp(α jl ) and its contribution to the clustering. Consequently, if G g=1 Cj c=1 | β gjc | = 0, variable X j is considered not influent to the classification and discarded. As for GMMs, the estimation in this setting is carried out by means of a penalized EM algorithm and the optimal λ and the number of clusters G are selected by BIC.

Model selection approaches
Model selection approaches for variable selection in LCA draw from the work developed for GMMs. In Dean and Raftery (2010) the authors suggest a framework similar to the one presented in Raftery and Dean (2006). As in section 4.3, the following partition of X is considered: • X C , the set of current clustering variables; • X P , the variable proposed to be added or removed from the set of clustering variables; • X N C , the set of variables not relevant for clustering.
Then the authors propose to compare the two models represented in Fig. 8. Model M A states that X P is a clustering variable and there is no edge between X C and X P due to the local independence assumption of LCA that implies p(X C , X P | z) = p(X C | z)p(X P | z). In model M B , X P is not useful for clustering (the missing edge between z and X P ) and is assumed to be independent from the set X C . Under these models, the joint distribution of X is expressed as: with p(X C , X P | z) the LCA model on on the clustering variables and X P , p(X P ) the Multinomial distribution and p(X N C | X C , X P ) the distribution for the non-informative variables. The two models are compared via the BIC approximations: B   Fig 8: The two models presented in Dean and Raftery (2010).
where BIC clust (X C , X P ) is the BIC of a LCA model in which X P is a relevant clustering variable, BIC no clust (X C ) is the BIC of the LCA model on the current clustering variables and BIC(X P ) is the BIC for a Multinomial distribution fit. Then X P is considered useful for clustering if the difference BIC A − BIC B is greater than zero. To search through the model space, the authors propose a forward headlong search algorithm (Badsberg, 1992) consisting of inclusion and removal steps. The algorithm has the advantage of being more computationally efficient than a backward search, but the disadvantage of being sensitive to the initialization of the set of clustering variables.
To deal with the problems of multimodality of the LCA log-likelihood and the sensitivity to the initialization, Bartolucci, Montanari and Pandolfi (2016) propose to add an extra step in the variable selection procedure of Dean and Raftery (2010). They consider a random check step aimed at initializing the estimation algorithm with a large number of random starting values, so as to prevent the problem of incurring in local optima. In an application to nursing home evaluation, the authors also perform a sensitivity analysis of the selected variables with respect to the initialization of the set of the clustering variables. The extra step is shown to be beneficial in increasing the chances of finding the global optimum.
A major drawback of Dean and Raftery (2010) framework is the independence assumption between the proposed variable and the set of clustering ones in model M B . In fact, because of this assumption, the model does not take into account that the proposed variable could be redundant for clustering given the set of already selected relevant variables. This results in the method not being capable of discarding variables related to the clustering ones, but not directly related to the group structure itself. To overcome the problem, Fop, Smart and Murphy (2017) propose the modeling framework depicted in Fig. 9. Following Maugis, Celeux and Martin-Magniette (2009a,b) (consult Section 4.3), in model M C the proposed variable is not relevant for clustering and it is assumed to be dependent on the clustering variables X C . Model M A is as before, while under Fig 9: The models proposed in Fop, Smart and Murphy (2017).
model M C the density of X is given by: where the conditional distribution p(X P | X R ⊆ X C ) is modeled using a multinomial logistic regression. Only a subset X R ⊆ X C of relevant predictors enter in the regression, avoiding the inclusion of extra parameters without necessarily increasing the likelihood. Moreover the subset X R is allowed to be the empty set, and in the case of X R = ∅ the Dean and Raftery (2010) method is automatically recovered. For M C the BIC is given by: with BIC reg (X P | X R ⊆ X C ) the BIC of the multinomial logistic regression after the selection of the set X R , accomplished using a simple backward stepwise search, similarly to Maugis, Celeux and Martin-Magniette (2009a,b). Models are compared by means of the difference BIC A − BIC C and X P is added to X C if this quantity is greater than zero. To perform the model search, the authors suggest a backward swap-stepwise selection algorithm where standard removal and inclusion steps are alternated to a swap step. In this step two different configurations of model M C are compared and they differ in the fact that one clustering variable is swapped with one of the non-clustering variables. The extra step allows the algorithm to not incur in local optima in the case of highly correlated variables. An application to the clustering of patients suffering low back pain shows that the method is capable of performing a parsimonious variable selection with a good classification performance and interpretable results. As in the case of GMMs, both Dean and Raftery (2010) and Fop, Smart and Murphy (2017) methods return the best clustering variables and the optimal number of latent classes; see the cited works for details. Toussile and Gassiat (2009) present an approach where the modeling problem of simultaneously selecting the number of latent classes and the relevant variables is considered in a density estimation framework. Let S be the set of clustering variables and let M (G,S) denote a model of probability distributions such that: with p(X C ) denoting a LCA model on the relevant variables and p(X N C ) the Multinomial distribution model for the non-clustering variables. The aim is to select the couple (G, S), which defines the model that generated the data. The selection procedure is implemented using an algorithm that first finds the optimal subset S for a range of possible values of G and then selects the optimal G for a fixed set of clustering variables. The authors employ penalized log-likelihood criteria for model selection and consistency results for BIC type criteria are derived. A further generalization and discussion is found in Bontemps and Toussile (2013), where they derive a criterion that minimizes a risk function based on the Kullback-Leibler divergence of the estimated density with respect to the true density. The general form of this penalized log-likelihood model selection criterion is given by: with * (G,S) the maximized log-likelihood under model M(G, S), D (G,S) indicating the dimension of the model and κ is a parameter depending on the sample size and the collection of the models. The authors suggest the use of slope heuristics (see Baudry, Maugis and Michel, 2012, for an overview) to calibrate the value of κ. Marbac and Sedki (2017c) extend the approach of Marbac and Sedki (2017a) (see Section 4.3) to clustering and variable selection of data of mixed type, with data on only categorical variables as a particular case. The method is based again on the MICL criterion and does not require multiple calls of the EM algorithm. In addition, the approach can manage situations where data have missing values. However, global and local independence need to be assumed to compute the MICL, and optimization is carried out over the variable indicator ω and the cluster membership indicator z.

Other approaches
As for GMMs, other methods have been suggested for variable selection in LCA. Zhang and Ip (2014) present a filter method where the absolute and relative importance of a variable towards the clustering structure is assessed by means of two measures. One is the expected posterior gradient, which is a measure of the relative change between the entropy of the prior distribution of the latent classes and the entropy of the posterior distribution of the latent classes after observing the data. This quantity is bounded between 0 and 2 and higher values indicate higher discriminative power of a variable. The other measure is the Kolmogorov variation of posterior distribution, which is based on the Kolmogorov distance between the class-conditional distributions. The measure is linked to the classification accuracy and the total variation of the posterior distribution, and, if there is a substantial reduction in this quantity without variable X j in the model, then X j can be considered a relevant clustering variable. Both measures are presented in the context of LCA for mixed type data, and the observation of only categorical data is a particular case.
Another filter method where the selection is carried out by evaluating the impact of a variable on the clustering solution is proposed in Bartolucci, Montanari and Pandolfi (2017). Here, the initial set of variables is assumed to provide an optimal clustering of the data and the objective is to select a minimum subset of variables that is sufficient to reproduce the inferred classification. To achieve this, the authors suggest a selection algorithm that removes the variables whose presence does not significantly impact the clustering. Suppose a LCA model has been fitted to the data and the estimated posterior probabilitiesû ig (consult Section 3) have been obtained. In the procedure, for each variable j, the proportion F −j of observations whose class assignment is modified when X j is removed with respect to the initial clustering is computed. By (3), the updated estimateû (−j) ig after removing X j is simply given by: Therefore: Then the variable with the minimum F −j is removed. If some variables have the same value of F −j , the variable to be removed is the one with the minimum Kullback-Leibler distance: The method is developed in connection to item selection for questionnaires and is fast and simple to implement. However, it requires the somewhat strong assumption that the inferred classification and the number of latent classes does not change when removing the variables. It is worth to mention an approach that lies outside the modeling framework of LCA and is closely related to the work of Hoff (2006). Hoff (2005) presents a method for subset clustering of binary sequences where each latent class is characterized by its own set of discriminating variables that differentiates it from the rest of the classes. Let ϕ gj be a binary variable indicating the relevance of variable j to cluster g. The author suggests to parameterize the class conditional probability of occurrence of X j as: whereπ gj is the probability that X j = 1 within class g and π j is the probability of observing variable X j in the data. Thus, if ϕ gj is active, the corresponding variable is relevant and the π gj differs from the overall data value. The framework involves a Pólya urn scheme for the parameters and the indicator variables and results in a Dirichlet process mixture model. Inference is performed using MCMC and allows to recover the subset of informative variables of each latent class.

R packages and data example
The R packages for variable selection for latent class analysis are ClustMMDD (Toussile, 2016), LCAvarsel  and VarSelLCM (Marbac and Sedki, 2017b). In particular, we note that VarSelLCM implements a more general framework for clustering and variable selection of data of mixed type (Marbac and Sedki, 2017c). Table 4 lists the packages, with information regarding the type and the method; all implement a model selection approach.
In this section we consider the uscongress data available in the UCI repository at the web page https://archive.ics.uci.edu/ml/datasets/congressional+voting+ records. The data contain votes of N = 435 members of the U.S. House of Representatives, 267 Democrats and 168 Republicans. Each record expresses the position of a member on 16 key votes regarding major issues selected by the Congressional Quarterly Almanac (1984); the votes are presented in Table 5. A vote can take three possible outcomes: y (yea), if in favor of a specific matter, n (nay), if against, or u (unknown), if the position is unknown. A graphical representation of the data is in Figure 10. The observations are classified into two groups corresponding to the main parties, but it is reasonable to expect the presence of more than two classes in the data because of internal subdivisions. Indeed, at the time the Democratic party was split into Northern and Southern Democrats, and a large portion of the second often voted in agreement with the Republican party (Congressional Quarterly Almanac, 1984).
First, we fit a LCA model on all the variables, determining the number of components using BIC. The selected model is a 4-class model and Table 6 contains a cross-tabulation of the estimated classification and the party membership. Class 1 and Class 2 are fairly polarized into Democrats and Republicans. Class 3 is predominantly characterized by Democrats, while Class 4 contains an equal amount of representatives from both sides.   Then we apply the variable selection procedures ClustMMDD-bic (Toussile and Gassiat, 2009), ClustMMDD (Bontemps and Toussile, 2013), LCAvarsel-ind (Dean and Raftery, 2010), LCAvarsel (Fop, Smart and Murphy, 2017) and VarSelLCM (Marbac and Sedki, 2017a,c). The variables selected by each method are displayed in Figure 11. Also the chosen number of latent classes and the ARI between the estimated classification and the party affiliation are reported. VarSelLCM does not discard any variable in this example. The method underlying the package makes use of both local and global independence assumptions, and they are too restrictive for this type of data. ClustMMDD and LCAvarsel-ind  select the same model and discard 2 variables. ClustMMDD discards 3 variables and selects a 3-component model. Note that the common discarded variables 2 and 10 are votes with close call results, and thus likely uninformative with respect to the two parties or any clustering structure. LCAvarsel performs a more parsimonious selection, choosing a model with 4 latent classes and retaining 10 variables, but attaining a lower ARI. Figure 12 displays the class-conditional probabilities of the outcomes for the votes selected by ClustMMDD-bic, ClustMMDD, LCAvarsel-ind and LCAvarsel; we did not include VarSelLCM, as it discarded none of the variables. We focus the attention on the LCAvarsel result, as it is the package performing the most parsimonious selection. Table 7 reports a tabulation of the estimated classification and the party affiliation, with an interpretation similar to Table 6. Figure 12 (d) suggests that Class 1 and 2 are denoted by very polarized outcomes and opposite voting positions. Class 3 includes members who likely expressed a vote not loyal to their party line. Class 4 is characterized by an higher probability of an unknown position regarding the selected key votes. Table 8 reports the computing time of the packages listed in Table 4 on a Dell machine with Intel Core i7-3770 CPU @3.40GHz×8. The methods are listed from the fastest to the slowest. LCAvarsel-ind is the fastest, followed by VarSelLCM, although the last did not discard any of the variables. LCAvarsel is slower, but still with an acceptable computing time compared to the two. Methods ClustMMDD and ClustMMDD-bic are the slowest in this example and have the same computing time, since they implement the same selection procedure. The algorithm at the basis of package ClustMMDD performs a more extensive search than the greedy ones implemented in LCAvarsel and VarSelLCM, however at a larger computational cost. Lastly, it is worth to notice that packages LCAvarsel and VarSelLCM can implement parallel computations. cused on variable selection methods for Gaussian mixture models and latent class analysis, the most common model-based clustering approaches. Illustrative data examples have been used to show some of the methods in action and we provided references to the available R packages implementing them. The datasets and the R code used for the analyses is available at the web page https://github.com/michaelfop. We conclude with some final remarks and comments.

Distributional assumptions
Assumptions concerning the dependence among relevant and irrelevant variables play a crucial role in the variable selection procedure. The global independence assumption simplifies the model for the joint distribution of irrelevant and relevant variables. Directly or indirectly, Bayesian and penalization approaches make use of this assumption, resulting in frameworks with a simpler association structure and that allow to discard uninformative variables. However, in the case where the variables are highly correlated, the global independence assumption can prevent to accomplish a parsimonious selection, in which also redundant variables are discarded (Law, Figueiredo and Jain, 2004;Tadesse, Sha and Vannucci, 2005;Raftery and Dean, 2006). On the other hand, model selection methods allow to depict a flexible and realistic structure for the relations among relevant, redundant and uninformative variables, but at the cost of a more complex model for the joint distribution of relevant and irrelevant variables. We remark that, when using a variable selection method, the tradeoff between model complexity and selection performance needs to be taken into account. The local independence assumption is commonly used in latent variable models, especially in the case of multivariate categorical data clustering. This assumption notably simplifies the model for the joint distribution of the relevant clustering variables (Bartholomew, Knott and Moustaki, 2011). However, it can hardly hold in practice and several approaches in the literature have been proposed to overcome it. For example, Gollini and Murphy (2014) propose a setting with continuous latent variables that allow to model the dependences among the observed categorical variables; Marbac, Biernacki and Vandewalle (2015) develop a framework where the categorical variables are grouped into blocks, each one following a specific distribution that takes into account the dependency between variables. Extending these frameworks to allow for variable selection would be interest of future research.
In this review, we mainly focused on Gaussian and Multinomial mixture models for clustering of continuous and categorical data. Variable selection for model-based clustering with other distributions and/or of data of different nature is still an open research area. The topic has only recently started to attract attention: Wallace et al. (2017) propose a framework for clustering and variable selection with the multivariate skew-Normal distribution, while Marbac and Sedki (2017c) define a method for data of mixed types. There is certainly a wide scope for further investigations and developments in this direction.

Computational aspects
Solving a variable selection problem is a task that requires a noticeable computational cost. For fixed number of mixture components, in general there are 2 J possible combinations of variables that could be considered as relevant clustering variables (Miller, 2002). The problem is worsened when the variable selection is concomitant to the problem of mixture components selection and model estimation. As already noted, filter approaches keep the two tasks separated, since the variable selection procedure is executed before or after estimation of the clustering model. Wrapper approaches combine model estimation, variable selection and number of components determination; hence, usually multiple models need to be estimated for different combinations of clustering variables and number of components. For these reasons, the firsts are usually faster, while the seconds are computationally expensive, but often give superior results (Guyon and Elisseeff, 2003). To speed up the computations, a filter method could be employed as a preliminary step to reduce the number of variables, then the wrapper method could be used on the reduced set of variables.
Regarding the statistical approach used to perform the variable selection, the various methods present different computational characteristics and issues. Bayesian methods provide a solid ground for uncertainty evaluation of the variable selection process. However, the MCMC schemes employed within this class of methods often require runs with a large number of iterations to explore the enormous space and ensure convergence (Tadesse, Sha and Vannucci, 2005, in-deed, all the works examined in this review considered a number of iterations in the order of 10 5 − 10 6 ). Unfortunately, the MCMC algorithms used do not allow parallelization of the computations in order to speed up the process. Model selection approaches usually implement greedy stepwise algorithms that have a general computational complexity of O(KJ 2 ), hence they become rapidly impractical as the number of variables and mixture components increases. If the number of variable is large, and is expected that only a small number of variables J 0 J are relevant, forward-type algorithms can be applied, thus reducing the complexity to O(KJJ 0 ). However, in this case the difficult issue of how to initialize the algorithm and the set of clustering variables arises.
Backward-type procedures do not require initialization of the clustering set, but can be very computationally demanding if the actual set of clustering variables is small. Headlong strategies could be employed to mitigate the problem (Badsberg, 1992). Nonetheless, the advantage of these stepwise methods is that they can be implemented in parallel, saving computational time. Penalization approaches are usually faster than Bayesian and model selection methods. In fact, the computational complexity of these approaches mainly depends on the form of the penalty function adopted and, in most cases, analytical solution to the optimization problem or efficient numerical procedures are already available. For this reason, they tend to scale particularly well to high-dimensional settings and have been proven to perform well also in the case of data with thousands of variables. Nevertheless, these methods do not allow for parallel computations.
Given the constant increase of the data dimensionality, the development of efficient algorithms allowing for scalable model-based clustering and variable selection is a relevant research topic that is likely to attract considerable attention in the future.

Missing data
Missing observations are a common issue in many data analysis problems. Among the works examined in this review, only few present methods for modelbased clustering and variable selection in the presence of missing data. Maugis-Rabusseau, Martin-Magniette and Pelletier (2012) extend the model selection method of Maugis, Celeux and Martin-Magniette (2009b) with the aim of avoiding a preliminary imputation procedure. The authors consider the case of missing values generated under the missing at random mechanism (Little and Rubin, 2002, MAR). Under this assumption, the missing responses are ignorable for likelihood-based inference. Hence, the framework and the model selection criterion can be stated in terms of only the observed data and the imputation process is avoided. Bartolucci, Montanari and Pandolfi (2016) use similar arguments for clustering categorical data via the latent class analysis model. The same MAR assumption is also considered in Marbac and Sedki (2017c), where the MICL criterion used for variable selection is computed considering only the observed entries. In application to data related to the quality-of-life of elderly patients hosted in nursing homes, Bartolucci, Montanari and Pandolfi (2017) move away from the MAR assumption and suggest to add an extra category corresponding to the missing responses, circumventing the assumption of ignorability of the missing data.
In general, in the literature multiple works exist for mixture model estimation with missing data, see for example: McLachlan and Peel (2000); Little and Rubin (2002); Hunt and Jorgensen (2003);Formann (2007) ;Chen, Prentice and Wang (2014). These approaches could be incorporated in a general framework for model-based clustering and variable selection for data with missing entries and may be interest of future developments.

Software
The last remark regards software availability. Despite the quantity of theory and methods developed, not many R packages for variable selection in model-based clustering are available. In particular, there is lack of packages implementing the various penalization methods for GMMs, especially useful for clustering highdimensional data. This is somewhat surprising given the practical importance of the topic.