Tandem clustering with invariant coordinate selection

For multivariate data, tandem clustering is a well-known technique aiming to improve cluster identification through initial dimension reduction. Nevertheless, the usual approach using principal component analysis (PCA) has been criticized for focusing solely on inertia so that the first components do not necessarily retain the structure of interest for clustering. To address this limitation, a new tandem clustering approach based on invariant coordinate selection (ICS) is proposed. By jointly diagonalizing two scatter matrices, ICS is designed to find structure in the data while providing affine invariant components. Certain theoretical results have been previously derived and guarantee that under some elliptical mixture models, the group structure can be highlighted on a subset of the first and/or last components. However, ICS has garnered minimal attention within the context of clustering. Two challenges associated with ICS include choosing the pair of scatter matrices and selecting the components to retain. For effective clustering purposes, it is demonstrated that the best scatter pairs consist of one scatter matrix capturing the within-cluster structure and another capturing the global structure. For the former, local shape or pairwise scatters are of great interest, as is the minimum covariance determinant (MCD) estimator based on a carefully chosen subset size that is smaller than usual. The performance of ICS as a dimension reduction method is evaluated in terms of preserving the cluster structure in the data. In an extensive simulation study and empirical applications with benchmark data sets, various combinations of scatter matrices as well as component selection criteria are compared in situations with and without outliers. Overall, the new approach of tandem clustering with ICS shows promising results and clearly outperforms the PCA-based approach.


Introduction
Clustering or cluster analysis is the task of finding groups of observations in a data set.Clustering problems arise in various application fields such as marketing for customer segmentation or in biostatistics for analysing gene expression data.In the present paper, our primary focus is on hard clustering, where observations are divided into subsets referred to as clusters.These clusters form a partition of the data set with each observation belonging exclusively to one cluster.As highlighted, for instance, in Hennig et al. (2015, p. 2), the definition of a cluster is neither unique nor precise.Consequently, numerous clustering algorithms have been developed and studied.Their usefulness and relevance depend on factors such as the context, the data type, and the objectives of the data analyst.Several books and surveys delve into this subject, e.g., Kaufman & Rousseeuw (1990), Xu & Wunsch (2005), Aggarwal & Reddy (2014), and Hennig et al. (2015).
The commonly accepted objective in clustering is to group similar observations together within the same cluster, while dissimilar ones should be assigned to different clusters.Among the many existing approaches, common clustering methods are either distance-based, model-based, or density-based.In the present paper, we focus on centroid-based clustering techniques which belong to the distance-based category, as well as clustering through Gaussian mixture models, which belong to the model-based category.The data in consideration consist of observations characterized by numerical variables or features.
When the cluster structure lies in a low-dimensional latent subspace, or when some variables are irrelevant for identifying the cluster structure, a preprocessing step like feature selection or dimension reduction can be beneficial.Tandem analysis, as termed by Arabie & Hubert (1994), is a well-known method involving principal component analysis (PCA) followed by clustering on the selected principal components.However, several authors including Chang (1983), Soete & Carroll (1994), Vichi & Kiers (2001), and Radojičić et al. (2021) warn against this approach, as it may mask the cluster structure of the data.The explanation is that PCA focuses on conserving inertia, linked to data variability, and does not necessarily extract the relevant information for clustering.Hence, many papers propose alternatives to tandem analysis (see, e.g., Ding & Li, 2007;Vichi & Saporta, 2009;Timmerman et al., 2013;van de Velden et al., 2019a;and references therein).
In the present paper, we propose a new tandem clustering approach in which we substitute PCA with invariant coordinate selection (ICS), as proposed by Tyler et al. (2009).ICS serves as a dimension reduction method akin to PCA, involving the computation and selection of linear combinations of the original features.While PCA is derived from the eigendecomposition of a single scatter matrix, ICS relies on the simultaneous diagonalization of two scatter matrices (see, e.g., Nordhausen & Ruiz-Gazen, 2022).Unlike PCA, ICS can extract components that are pertinent for clustering.More precisely, Tyler et al. (2009) show that for a location mixture of elliptical distributions and any pair of scatter matrices, ICS can identify Fisher's linear discriminant subspace without knowledge of the class labels.Additionally, ICS is affine invariant while PCA is merely orthogonally invariant.
The use of ICS as a preprocessing step for clustering is not new.ICS and its earlier versions have been recognized as a method for dimension reduction preceding group identification in Art et al. (1982), Caussinus & Ruiz (1990), Caussinus & Ruiz-Gazen (1993, 1995, 2007), Gnanadesikan et al. (1993), Tyler et al. (2009), Hennig (2009), Peña et al. (2010), Fekri & Ruiz-Gazen (2015), Alashwali & Kent (2016), and Fischer et al. (2017and Fischer et al. ( , 2020)).However, there is no in-depth study of ICS in combination with clustering, and the present paper aims to fill this gap.Following the study by Archimbaud et al. (2018) on ICS for outlier detection, we first highlight the relevance of ICS when the cluster structure lies in a low-dimensional subspace.We then compare several pairs of scatter matrices and discuss the selection of invariant coordinates.It is important to note that, akin to the standard PCA approach, ICS is not well-suited to high-dimensional data.In this article, we assume a scenario where the number of observations is large compared to the number of variables, and the number of variables is moderate.
In the context of clustering, ICS can be compared to linear discriminant analysis (LDA) but in an unsupervised situation where the groups are unknown.When the groups are known, and assuming similar cluster shapes, it becomes possible to compute within-group and between-group scatter matrices.LDA (see Mardia et al., 1982) is essentially a joint diagonalization of these specific scatter matrices, equivalent to the joint diagonalization of within-group and total covariance matrices.Keeping the analogy with LDA, in situations where the groups are unknown, we anticipate ICS yielding valuable results when preprocessing data before clustering if we jointly diagonalize a scatter matrix close to the within-group covariance matrix and another scatter close to the total covariance matrix.Regarding the within-group matrix, this paper further explores local scatter matrices (Hennig, 2009), weighted pairwise scatter matrices (Caussinus & Ruiz, 1990;Tyler et al., 2009), and minimum covariance determinant (Rousseeuw, 1985) scatter matrices based on a small percentage of observations.Interestingly, Kettenring (2006) claims that cluster analysis and LDA are "at the opposite ends of a spectrum".While no information is available about the number or the types of clusters in clustering, the number of groups and their content is known in LDA.Following this idea, ICS is positioned in the middle of the spectrum, making assumptions about the type of clusters (ideally Gaussian distributions with different means and the same covariance matrices), but lacking information about the number of groups and how the observations are distributed among the groups.Similar to LDA, some theoretical properties of ICS can be derived for mixtures of Gaussian distributions with different means and equal covariance matrices (see Tyler et al., 2009;Archimbaud et al., 2018), and we predominantly focus on this model throughout the paper.However, theoretical properties of ICS extend to other types of mixture types (see Archimbaud, 2018), and in practice, ICS proves useful beyond this specific framework, as demonstrated in artificial and empirical examples.
The paper is organized as follows.In Section 2, we examine a mixture of Gaussian distributions with different means and equal covariance matrix.We derive a property at the population level when the number of variables increases while the number of groups is kept fixed.In this scenario where the dimension of the affine subspace containing the cluster means decreases relative to the total dimension, we highlight the difficulty of identifying clusters when using the total or the within-group Mahalanobis distances.In contrast, the dimension reduction induced by ICS allows for solving the problem.Section 3 provides a detailed presentation of ICS, along with the definition of several scatter matrices relevant in a clustering context, as well as a description of component selection criteria.Section 4 presents results from an extensive simulation study for Gaussian mixtures with two, three, and five groups.We compare a variety of dimension reduction strategies using ICS or PCA followed by some centroidbased and model-based clustering algorithms.Section 5 illustrates the use of tandem clustering with ICS in three empirical examples, and Section 6 concludes the paper and discusses further perspectives.

Highlighting the relevance of ICS in case of low-dimensional cluster structure
Consider the following probability model introduced in Caussinus & Ruiz-Gazen (1993).Let X be a d-variate real random vector distributed according to the following mixture distribution: where the probability measure P is concentrated on a unknown l-dimensional linear manifold with l < d, and N (x, Σ W ) denotes the d-dimensional normal distribution with mean x and covariance Σ W .That is, the mixing probability P corresponds to the unknown structure of interest.An example of such a model is the case where P is a mixture of Dirac probability measures at points µ 1 , . . ., µ q .This example is a finite mixture of Gaussian distributions with different means and equal covariance matrices, with l equal to the dimension of the affine subspace which contains the mean vectors µ 1 , . . ., µ q .In this paper, we focus on the case where l is small compared to d. Considering the Mahalanobis distances, we present a result for the case of a Gaussian mixture distribution with different means and equal covariance matrices.The result is analogous to Proposition 1 in Archimbaud et al. (2018) but tailored to the clustering context rather than outlier detection.The number of groups q remains constant as the overall dimension d increases to infinity, which means that the intrinsic dimension l is small compared to the original dimension d in model ( 1).In such a context, ICS is particularly suitable for data reduction before clustering, as it allows to transform the data from the d-dimensional space into the l-dimensional subspace associated with the structure of interest.
The question of generalizing the kmeans algorithm to the use of the Mahalanobis distance instead of the Euclidean distance has been studied in the literature for a long time.Gnanadesikan et al. (1993) provide an example featuring three groups in three dimensions where the Euclidean distance proves inappropriate, in contrast to the Mahalanobis distance based on the withingroup dispersion.The latter accounts for the elliptical nature of the groups (i.e., correlations).Another advantage of using the Mahalanobis distance is its invariance under affine transformations of the data, eliminating the need for variable standardization before clustering (see Cerioli, 2005, for more details).However, the use of the Mahalanobis distance is not the most common choice when applying a generalized variant of kmeans.The primary challenge lies in adapting the kmeans algorithm to incorporate the within-group covariance matrix (see Cerioli, 2005;Lapidot, 2018;and references therein).In our proposed tandem approach, Euclidean distances computed on the ICS components represent Mahalanobis distances with respect to one of the scatter matrices on the original variables.Opting for a scatter matrix close to the within-group scatter matrix allows the consideration of elliptical clusters without modifying the kmeans algorithm.
Let X be a d-variate real random vector distributed according to a mixture of q Gaussian distributions with q < d distinct location parameters µ 1 , . . ., µ q , and the same positive definite covariance matrix Σ W : where ϵ 1 , . . ., ϵ q are mixture weights with ϵ 1 +• • •+ϵ q = 1.Let Σ denote the total covariance matrix of X.We now examine the squared pairwise Mahalanobis distances between two random vectors X and Y with the Gaussian mixture distribution defined in (2): These distances are affine invariant, meaning that r 2 (AX + b, AY + b) = r 2 (X, Y) holds for any full rank d × d matrix A and any d-dimensional vector b.
To discern groups, it is crucial that the distance between two independent random vectors X h and Y h ′ from two different groups h ̸ = h ′ is stochastically larger than the distance between two independent random vectors X h and Y h from the same group h.However, we can prove that asymptotically, the variance of the difference between these distances increases with the dimension d when the number of groups q is held constant, making cluster identification increasingly challenging.
Then under model (2), the expressions converge in distribution to standard Gaussian distributions as d goes to infinity with some constants c and c W > 0.Moreover, the expectations of the differences of the Mahalanobis distances remain bounded as d goes to infinity.
The proof is similar to the proof of Proposition 1 in Archimbaud et al. (2018) and is not detailed further.The key element of the proof lies in the fixed dimension l ≤ q − 1 of the subspace of interest (where q denotes the number of groups), while the total number of dimensions d grows to infinity.This implies that the number of dimensions d − l ≥ d − q + 1 in which there is no cluster structure is also increasing to infinity.In these dimensions, the expectation of the difference of distances remains bounded as d goes to infinity, whereas its variance inflates and masks the difference in expectation in the subspace of interest.If the relevant subspace of dimension l is known, the issue arising from the d − l irrelevant dimensions can be addressed by projecting the data set onto that subspace and computing distances based only on the relevant l dimensions.This is exactly what ICS is all about, offering the data-analyst the capability to identify a subspace revealing the clusters (in an unsupervised way), and to project the data onto it.Note that Proposition 1 describes a property at the distribution level for d → ∞, while this paper primarily focuses on the sample setting with d < n.At the sample level, it is still possible to satisfy d < n if d increases to infinity as long as n increases at a faster rate.
To illustrate the above discussion, Figure 1 highlights the utility of dimension reduction for kmeans clustering in two dimensions.The example features two groups: 850 observations generated from the distribution N d=2 (0, I d ) and 150 observations from N d=2 ((10, 0) ⊤ , I d ).The distinction between the groups is apparent in the first dimension, while the second dimension carries no structure.
We compare the outcomes of the standard kmeans algorithm applied to (i) the standardized data (variables Std X 1 and Std X 2 ), (ii) the first component of PCA (PC 1 ) using the sample correlation matrix, and (iii) the first component of ICS (IC 1 ) using the COV − COV 4 pair of scatter matrices (see Section 3.1).Figure 1 comprises 9 plots.The first row presents scatterplots of the standardized data (left), the PCA components (center), and the ICS components (right).While ICS successfully captures the dimension of interest on the first component, this is not the case for PCA.The middle row of Figure 1 visualizes pairwise Euclidean distances between the observations.For the standardized data (left), distinguishing between distances of observations within the same group and those from different groups is not feasible, which is consistent with the result of Proposition 1.For PCA and ICS, pairwise distances are computed on PC 1 (center) and IC 1 (right), respectively.Only for IC 1 do we observe a discernible gap between the two types of distances.The bottom row of Figure 1 presents scatterplots of the standardized data with observations colored based on different kmeans clusterings.The two groups are not correctly identified when kmeans is applied to the original standardized data (left) or PC 1 (center), while they are accurately recovered for IC 1 (right).

Invariant coordinate selection
In Section 3.1, the principle of ICS and various scatter matrices are detailed.Section 3.2 discusses ICS in a clustering context, while Section 3.3 introduces criteria for selecting which invariant coordinates to keep.

Scatter matrices and the principle of ICS
ICS was initially introduced as generalized PCA in Caussinus & Ruiz (1990), Caussinus & Ruiz-Gazen (1993, 1995, 2007), Caussinus et al. (2003), and received its current name in Tyler et al. (2009), where many of its properties were also derived.The main ingredients of ICS are two scatter matrices.In this context, a scatter matrix is any holds for all d-dimensional vectors b and all full rank d × d matrices A. Here, X denotes a d-variate random vector with cumulative distribution function (CDF) F X .For a sample X n = (x 1 , . . ., x n ) ⊤ , the scatter functional is applied to the empirical CDF F Xn and referred to as a scatter matrix.
The literature is abundant with suggestions for scatter matrices, with the most popular being the sample covariance matrix where xn = 1 n n i=1 x i is the sample mean.Other popular proposals originate from the robustness literature, such as the M-estimator of scatter based on the likelihood of the Cauchy distribution (Kent & Tyler, 1991), denoted by MLC(X n ).It can be computed using the simultaneous estimation equations One of the most widely-used robust scatter matrices is the minimum covariance determinant estimator (MCD) (Rousseeuw, 1985).For a tuning parameter α, the MCD selects out of the n observations those n α = ⌈αn⌉ observations x i1 , . . ., x in α for which the sample covariance matrix has the smallest determinant, i.e., where xα,n is the sample mean of the selected set of observations and c α is a consistency factor.In the robustness literature, α is typically chosen from the interval [0.5, 1].For a value of α close to 0.5, the MCD α has a high breakdown point but low efficiency.Therefore, it is often combined with a reweighting step to increase efficiency, in which case the the MCD α is sometimes referred to as the raw MCD.Here, we denote the reweighted version by RMCD α .For more details about MCD α and RMCD α , see for example Hubert & Debruyne (2010), Cator &Lopuhaä (2012), andHubert et al. (2018).
On the other hand, there are scatter matrices for which robustness is not the primary focus, such as the matrix of fourth moments Several other scatter matrices have been proposed specifically in the context of ICS.For a positive, decreasing weight function w and a tuning constant β > 0, a one-step M-estimator of scatter can be computed as follows: , where r 2 (x i ) is again the squared Mahalanobis distance.Closely related to SCOV β (X n ) is the estimator based on pairwise differences and therefore does not require a location estimator.Both estimators are studied in great detail in Caussinus & Ruiz-Gazen (1993, 1995).Ruiz-Gazen (1996) suggests in this context to consider also For these scatter matrices, we use the weight function w(x) = exp(−x/2), and we set β = 2 for TCOV and β = 0.2 for UCOV.
The last scatter matrix we introduce is the local shape matrix LCOV V 0,β suggested by Hennig (2009).For LCOV V 0 ,β , a scatter V 0 is computed first, and the distance matrix of the corresponding pairwise Mahalanobis distances is calculated.Then, for each observation, the sample covariance matrix is computed based on the n β = ⌈βn⌉ nearest neighbors.All n covariance matrices are subsequently averaged, for which they first need to be appropriately scaled to be comparable.We standardize them in such a way that they all have determinant one.Note that this is not a scatter in our original sense but only a shape matrix (Nordhausen & Tyler, 2015), which suffices for ICS.We use LCOV COV,0.1 , but one could also use LCOV MCD0.8,0.1 as in Hennig (2009).
In a clustering framework, TCOV and LCOV can be seen as estimators of the within-group covariance matrix (cf. Figure B.13).An alternative estimator proposed by Art et al. (1982) and Gnanadesikan et al. (1993) is based on the smallest within-group pairwise differences of observations.However, it will not be considered further, as the algorithm is iterative and involves a graphical aid to determine the number of pairwise differences to consider.
A general result is that all scatter matrices are proportional to each other at the population level (if they exist) for exchangeable sign-symmetric distributed random vectors, which includes all elliptical distributed random vectors (Nordhausen & Tyler, 2015).This implies that for such models, when whitening the data (e.g., Nordhausen et al., 2008) or performing principal component analysis (PCA), it does not matter which scatter matrix and location estimator are used, as the results differ only in scale.Similarly, Tyler (2010) showed that a comparison of two scatter matrices is only meaningful if the sample size n is larger than the dimension d, making this a requirement for ICS.
Given a sufficient sample size, in a clustering framework assuming for example a Gaussian mixture model, different scatter matrices will measure different population quantities.The idea behind ICS is to exploit these differences.An ICS matrix W (X n ) is a matrix that simultaneously diagonalizes two scatter matrices V 1 (X n ) and V 2 (X n ) in the following way: ICS can be interpreted as a transformation where X n is whitened with respect to V 1 , and then a subsequent PCA is performed with respect to V 2 to see what structure is still left after the whitening with V 1 .ICS is usually computed via a generalized eigenvalue decomposition (see Tyler et al., 2009).
The so-called invariant coordinates are then obtained as where 1 n denotes an n-variate vector full of ones, and T (X n ) denotes a location estimator, usually the one that goes along with V 1 .In case V 1 has no natural accompanying location, any location estimator can be used for the purpose of centering, or centering can be omitted when computing the ICs.The name of ICS is motivated by a property of the (centered) invariant coordinates, which are often also called invariant components or simply ICs.These ICs are invariant in the sense that for any linear transformation of X n , i.e., where A and b are as above and J is a d × d sign-change matrix, i.e., a diagonal matrix with ±1 on its diagonal.Thus, if ICS is performed on the same data but measured in different coordinates systems, the resulting ICs will differ at most by their signs.Note that in the following, we will drop the dependency of the matrices on X n in the notation when the context is clear.
Not only do the values λ 1 , . . ., λ d fix the order of the components, but they also have a concrete interpretation as a measure of kurtosis for the corresponding components in the sense of V 1 and V 2 , since they correspond to the ratio of two scale measures, i.e., where w is a d-dimensional vector.If V 1 = COV and V 2 = COV 4 , this measure relates to the classical kurtosis measure of Pearson (Nordhausen et al., 2011).Tyler et al. (2009) show then that the maximal and minimal kurtosis in the sense of V 1 and V 2 that can be obtained for X n w ⊤ is captured by λ 1 and λ d , respectively.

ICS as preprocessing step for clustering
While ICS has diverse applications, including its use in a transformationretransformation framework, outlier detection, and as a method for independent component analysis (Tyler et al., 2009;Nordhausen et al., 2011;Archimbaud et al., 2018;Nordhausen et al., 2008), our focus is on studying ICS within the context of clustering.
The most common framework of clustering are location mixture models.In scenarios where class labels are known, a natural approach would be to maximize the ratio of the between-group and within-group covariance matrices to obtain Fisher's linear discriminant subspace.However, ICS addresses situations where class labels are unknown.The objective is for two scatter matrices, which are in general not proportional to each other in this model, to capture this ratio.Indeed, Tyler et al. (2009) proved that in an elliptical mixture model where the q populations have proportional scatter matrices and different means, ICS will generally produce at most q − 1 values among λ 1 , . . ., λ d which differ from the remaining non-informative values, which are all equal and denoted λ 0 .In the context of ICS as a preprocessing step for clustering, the goal is to use only the invariant coordinates spanning Fisher's linear discriminant subspace and discard other coordinates.We seek the l × d matrix W DS containing rows of W corresponding to values of interest in λ 1 . . ., λ d , i.e., those with λ i ̸ = λ 0 , i ∈ {1, . . ., d}.The problem is that the values of λ 0 and l ≤ q − 1 depend on the distributions of the groups, their mixture weights, and the selected scatters V 1 and V 2 .Unlike PCA where the first few PCs are typically of interest, the interesting ICs can be the first, last, or a combination thereof.Assuming V 1 is more robust than V 2 , it is expected, heuristically speaking and based on the generalized kurtosis interpretation, that groups with different sizes will exhibit large V 1 -V 2 kurtosis values, while groups with similar group sizes will have small V 1 -V 2 kurtosis values.This implies that large and small clusters will be found in the different extremes (e.g., Peña et al., 2010).
It is well established that there is no general best scatter combination for ICS, but that some combinations are better suited for certain applications than others.For example, Peña et al. (2010) consider the combination COV − COV 4 , known as FOBI (Cardoso, 1989;Nordhausen & Virta, 2019), to reveal clusters.However, this combination is highly sensitive to extreme observations leading Archimbaud et al. (2018) to recommend it for outlier detection.This combination is also of interest due to being completely moment-based and therefore easy to study.For example, one can show that in a Gaussian mixture framework, the value of the non-interesting (Gaussian) components is λ 0 = 1.For a mixture of two Gaussians with the same covariance matrix, it can be shown that if the mixture weight for the smaller cluster is (3 + √ 3) −1 ≈ 21%, then the kurtosis of the mixture is the same as for a pure Gaussian component, independent of the difference in locations.Therefore, the generalized eigenvalue of such a component equals 1 (see, e.g., Tyler et al., 2009).
When using ICS as a preprocessing step for clustering, the general motivation is that one scatter matrix should measure within-group scatter and the other between-group scatter (or equivalently, total scatter).Therefore, it seems natural to use one scatter matrix to estimate local structure and another to estimate global structure.For instance, Hennig (2009) suggests LCOV for the former.Robust scatter matrices could also be an interesting option to capture the dispersion of a single cluster.Specifically, for MCD α , it might be worth considering values of α that are smaller than 0.5 for this purpose.
While different scatter combinations have been considered for ICS in a clustering framework, a thorough comparison is still lacking in the literature.To conclude this part, we emphasize that ICS is performed in an unsupervised manner without using class label information.If such information is available, one of the scatter matrices involved in ICS can use this information, converting it to a supervised scatter matrix, which leads to supervised invariant coordinate selection (SICS).Many supervised dimension reduction methods, such as linear discriminant analysis (LDA), sliced inverse regression (SIR), and sliced average variance estimation (SAVE) can be seen as special cases of SICS (Liski et al., 2014).

Selection of the invariant coordinates to retain
Another topic often ignored in the literature is the selection of the ICs that should be retained for clustering.When the number of groups q is known, theoretical properties of ICS suggest that the number of invariant coordinates to retain is typically less than or equal to q − 1.In what follows, we compare several criteria for the selection of invariant coordinates.Two of the criteria that we consider further are based on the assumption of a known q, while one criterion operates without such an assumption.In a specific data analysis, the most natural approach might be to look at the scatterplot matrix of the invariant coordinates and at the generalized eigenvalues to determine which l < d invariant coordinates carry relevant information for clustering.However, for a more formal approach, we discuss potential rules below.
Assume the data come from a Gaussian mixture model with equal covariance matrices and q < d mixture components.In this case, non-interesting coordinates follow a Gaussian distribution whereas the informative ones are non-Gaussian.This setup aligns with the framework of non-Gaussian component analysis (NGCA), aiming to identify a non-Gaussian subspace.Nordhausen et al. (2017Nordhausen et al. ( , 2022) ) show that COV − COV4 serves as an NGCA method recovering this subspace.They derive asymptotic and bootstrap tests for the number of non-Gaussian coordinates, which can be used in a successive testing strategy to obtain an estimate for l.The key concept is that all Gaussian coordinates have a generalized eigenvalue of 1, resulting in small variance among these generalized eigenvalues.Resampling-based estimators of l for this scatter combination are considered in Luo & Li (2016, 2021).These tests, incorporating a joint estimation strategy, were extended to a resampling framework for all scatter combinations in an NGCA setting in Radojičić & Nordhausen (2020).The challenge lies in the unknown value of the non-interesting (Gaussian) generalized eigenvalues, except for their shared generalized eigenvalue.The idea in Radojičić & Nordhausen (2020) is to go for a specific l 0 through all combinations of l 0 neighboring generalized eigenvalues to choose the ones with minimal variance, and to use bootstrapping to decide if assuming equality is reasonable.However, bootstrapping is quite computationally expensive in this context.
If the number of clusters q is assumed known, one rule could be to find the d − q + 1 generalized eigenvalues with the smallest variance and keep the components belonging to the non-included generalized eigenvalues.This rule is referred to as the var criterion.While initially outlined in an NGCA framework, this approach can also be applied in a more general framework of elliptical mixtures, where again the generalized eigenvalues of the purely elliptical part will be equal, and the generalized eigenvalues of components containing cluster information will differ.
A simpler criterion than the var criterion within the framework of an elliptical mixture model is to assume q ≤ d/2, which implies that the majority of the generalized eigenvalues should be equal.Then a rule could be to choose the q − 1 components whose generalized eigenvalues deviate the most from the median of all generalized eigenvalues.This approach is referred to as the med criterion.
In a Gaussian mixture model, a simple approach is to apply marginal tests for normality and select all non-normal components, which we call accordingly the normal criterion.This approach proved successful for outlier detection via ICS in Archimbaud et al. (2018).Note that since those authors work in an outlier detection framework, they look for non-Gaussian components only among the first components.In a clustering context, it is natural to consider the first and last components.
To summarize the section on ICS, it is essential to highlight that the key characteristic of ICS as a preprocessing step for clustering is that the components are related to generalized kurtosis measures.Extreme generalized kurtosis values serve as indicators of potential interesting structures for clustering.This motivation does not strictly rely on the assumption of an underlying elliptical location-mixture model but can be argued for other models with group structures.Nevertheless, the theory and many of the component selection methods discussed above rely on a Gaussian mixture model, one of the most popular clustering models.Additionally, ICS is most meaningful when the clustering information is embedded in a lower dimensional subspace, and the resulting dimension reduction is independent of the subsequent clustering method applied.
In the following simulation study, we apply some popular clustering methods after component selection using ICS, although, in general, any other clustering method could be used.This makes ICS a general tool for dimension reduction in a clustering context, distinguishing it from numerous subspace estimation methods tailored for specific clustering methods, see for instance Soete & Carroll (1994); Vichi & Kiers (2001); Bouveyron et al. (2019).

Simulation study
In order to investigate the performance of tandem clustering with ICS and PCA, we perform simulations.The simulation design is described in Section 4.1, and the main results are presented in Section 4.2.Additional findings are discussed in Section 4.3, while Section 4.4 provides a brief summary and discusses some limitations of the simulation study.

Simulation design
Focusing on a data generating process in which the cluster structure lies in a low-dimensional subspace, we compare the performance of ICS and PCA with respect to reducing the dimensionality of the data while keeping the relevant structure for clustering.We study the impact of the scatter pair and the component selection criterion in different cluster settings and in the presence of outliers.For assessing the performance of the methods, we compute η 2 = 1 − Λ, where Λ denotes Wilks' lambda.That is, we compute where E is the within-group sum of squares and cross-products matrix and T is the total sum of squares and cross-products matrix.The η 2 is a measure of discriminatory power commonly used in discriminant analysis (see, e.g., McLachlan, 1992; as well as Todorov, 2007, for a robust version of Wilks' lambda).It is in the interval [0, 1], with a higher value indicating better discriminatory power.
After dimension reduction, we apply clustering methods to the obtained components to confirm the results on the discriminatory power.Our aim is not to compare clustering methods, hence for the most part we limit our study to five well-known methods: partitioning around medoids (PAM) (Kaufman & Rousseeuw, 1990), kmeans (Hartigan & Wong, 1979), tkmeans (a trimmed version of kmeans that identifies a specified proportion of observations as outliers) (Cuesta-Albertos et al., 1997), as well as model-based clustering with Gaussian mixtures (Fraley & Raftery, 2002) without allowing for noise (mclust) and with allowing for noise (rmclust).These clustering methods take the number of clusters k as an input, which we set equal to the true number of clusters q to keep the simulations simple.As an evaluation measure, we compute the adjusted Rand index (ARI) (Hubert & Arabie, 1985), which has a maximum value of 1 in the case of perfect clustering, while a value of 0 is expected for a random cluster assignment.For comparison, we also apply the clustering methods to the simulated data without dimension reduction, in which case we standardize the data using the mean and standard deviation before kmeans, and we employ robust standardization using the median and median absolute deviation (MAD) before PAM and tkmeans.
We generate n = 1000 observations on d = 10 variables from the model in (2) for different numbers of clusters q ∈ {2, 3, 5} with Σ W = I d , µ 1 = 0, and µ h+1 = δe h , h = 1, . . ., q − 1, where δ = 10 and e h is a d-dimensional vector with one in the h-th coordinate and zero elsewhere.In this setting, the cluster structure lies in a low-dimensional subspace of dimension l = q − 1.We thereby consider 22 different combinations of the mixture weights ϵ 1 , . . ., ϵ q (see Appendix A for details).
In addition to a baseline setting without outliers, we consider settings in which 2% and 5% of the observations are replaced by outliers.We draw the outliers from a uniform distribution on a large hyperrectangle with a central cutout in the shape of a smaller hyperrectangle around the original data.The length of each side of the smaller hyperrectangle is given by the range of the original data in the respective variable, while the corresponding side of the larger hyperrectangle is twice as long.Note that we treat outliers as an additional group in the computation of the η 2 and the ARI.
Concerning the selection of components, we use four criteria for ICS (see Section 3.3 for details): the med criterion and the var criterion (which select k − 1 components), the normal criterion using the D'Agostino test of skewness (D'Agostino, 1970) at a 5% significance level (which is not restricted to selecting a certain number of components), and an oracle criterion selecting the k − 1 components (among the first and last components) that maximize the discriminatory power for the true clusters (as measured by the η 2 ).For PCA, we use two criteria: the 80% criterion (which keeps as many components as necessary to explain at least 80% of the variability in the data), and the k − 1 criterion (which keeps the first k − 1 components).
We simulate 100 data sets for each of the 66 different cluster and outlier settings.Overall, we compare 72 different dimension reduction strategies: 68 for ICS (17 scatter pairs and 4 criteria) and 4 for PCA (2 scatters and 2 criteria).

Main results
Due to the large number of dimension reduction strategies, we only show a selection of the most relevant results here.Section 4.2.1 presents the results for the best-performing methods in terms of discriminatory power of the selected components, while Section 4.2.2 discusses clustering performance of selected clustering methods after dimension reduction.
A detailed discussion on the selected dimension reduction methods and component selection criteria is given in Appendix B.1.For ICS, the best-performing scatter pairs are LCOV −COV, TCOV −COV, as well as the robust scatter pair TCOV − UCOV.Furthermore, we compare the med criterion with the oracle criterion for component selection.The former is a representative selection since there is not much difference in the selection criteria for those best-performing scatter pairs, while the latter is included for reference purposes as it uses information on the true clusters and can therefore not be computed in practice.For PCA, the best-performing method is RMCD 0.75 combined with the 80% criterion for selecting the components.

Discriminatory power of selected components
We first focus on the overall results across the investigated cluster settings.Figure 2 shows boxplots of the discriminatory power of the selected methods for different outlier settings.
With and without outliers, ICS with LCOV − COV and TCOV − COV clearly outperforms PCA with RMCD 0.75 in terms of discriminatory power.This is not surprising, as PCA is not designed to pick up the relevant structure for clustering.Although ICS with the robust scatter pair TCOV − UCOV also clearly outperforms PCA with RMCD 0.75 when there are no outliers or 5% outliers, a conclusion cannot be drawn in the setting with 2% outliers.While the median η 2 is much higher for TCOV−UCOV than for RMCD 0.75 , the former also exhibits many points with a very small η 2 in Figure 2.This behavior is investigated in more detail below in the context of different cluster settings.
As the amount of outliers increases, we observe that the discriminatory power decreases for all methods, together with a slight increase in variability.Nevertheless, the discriminatory power of ICS with LCOV − COV and TCOV − COV remains good in all settings.For PCA, the decrease in discriminatory power is much more pronounced despite using the robust scatter matrix RMCD 0.75 .This indicates that PCA as a preprocessing technique for clustering is much more affected by outliers than ICS with a suitable scatter pair.
Next, we take a look at specific cluster settings to gain further insights.Since we investigate a large number of cluster settings, we present boxplots for a representative selection of cluster settings in Figure 3.The main finding is that TCOV−COV may be preferred over LCOV−COV, as the former has lower variability for two perfectly balanced clusters (50-50).However, both scatter pairs yield very similar and excellent discriminatory power in the investigated cluster settings.Furthermore, note that the med criterion essentially yields the same discriminatory power as the oracle criterion for those two scatter pairs.
Figure 3 also provides more insights into the behavior of the robust scatter pair TCOV − UCOV.Specifically, we observe that this scatter pair leads to similar discriminatory power as LCOV − COV and TCOV − COV, except in the setting with q = 2 highly unbalanced clusters (95-5) and with 2% of outliers.In this specific situation, the discriminatory power considerably drops towards 0 and the variability increases.This is due to an artifact of our simulation design, where the outlyingness structure is likely to be constrained to a low-dimensional subspace if the number of outliers is small.Both TCOV and UCOV treat a small cluster as outliers, meaning that this scatter pair tends to ignore (some of) the relevant structure in such settings.There is of course a philosophical discussion as to where the line is between a group of outliers and a small cluster of observations, but such a discussion is beyond the scope of this paper.
For PCA with RMCD 0.75 , we find much larger variability in settings with two clusters compared to the other cluster settings.Moreover, in all investigated cluster settings, the discriminatory power of PCA with RMCD 0.75 is lower or at best comparable to that of ICS with TCOV − COV and LCOV − COV.

Clustering performance
In order to validate our findings with respect to discriminatory power, we apply five popular and relatively simple clustering methods.To keep computation time low, we apply the clustering methods only for a subset of cluster settings (see Appendix A for details).
Figure 4 displays overall results of the ARI of the selected methods.It contains boxplots across the selected cluster settings for the different clustering methods and outlier settings.Compared to the ARI on the original data without dimension reduction, ICS with LCOV − COV and TCOV − COV improves the performance of PAM, kmeans, tkmeans, and mclust.For PAM and kmeans, there is not much change in the median ARI as it is already near perfect for the observed data, but variability is reduced or there are fewer instances where clustering fails.For tkmeans, we observe a similar effect, taking into account that this method trims away some observations so that the resulting outlier component is reflected accordingly in a somewhat lower ARI.For mclust, ICS yields a substantial improvement in ARI when outliers are present.For rmclust, on the other hand, the ARI is almost perfect for the observed data, as expected in this simulation design with Gaussian clusters and outliers.Nevertheless, there are some instances where rmclust fails.Note that ICS with LCOV − COV or TCOV − COV and a simple clustering method such as kmeans lead to an ARI as high as for rmclust for any situation and never completely breaks down.
As the amount of outliers increases, the median ARI of PAM and kmeans decreases while the median ARI of tkmeans increases accordingly.This behavior is expected and can be observed on the original data and after ICS.Also note that in the setting with 5% outliers, tkmeans on the observed data performs almost perfectly in most instances (but fails in some instances).Keep in mind that for tkmeans, we used the default trimming proportion of 5%.In practice, the trimming proportion needs to be carefully tuned, and the interplay of the trimming proportion and the number of clusters is quite complex (García-Escudero et al., 2011).Hence, this result could be due to the trimming proportion being equal to the true amount of outliers.
Otherwise, PAM, kmeans, and tkmeans after ICS are not much affected by the outliers.This reflects that the discriminatory power of the selected components remains high, even though the aforementioned clustering methods fail in some instances for TCOV−UCOV.This phenomenon is more pronounced for mclust and rmclust, where all scatter pairs yield a considerable number of unsuccessful clustering instances.
PCA with RMCD 0.75 deteriorates the overall performance of all clustering methods compared to clustering the observed data, either by a substantially lower median or a higher variability in ARI.This demonstrates that PCA struggles to capture the relevant structure for clustering.
From the overall results across cluster settings, we can already see that the results for clustering performance reflect the results for discriminatory power very well.A detailed discussion of the clustering performance for the selected cluster settings is therefore moved to Appendix B.2.

Additional findings
As we have seen from the previous section that clustering performance nicely follows the results for the discriminatory power of the selected components, Sections 4.3.1 and 4.3.2focus on discriminatory power using the MCD α − COV and COV − COV 4 scatter pairs, respectively.Furthermore, Section 4.3.3provides a comparison of model-based clustering after ICS with methods that integrate dimension reduction into Gaussian mixture modeling.

Discriminatory power with MCD α − COV scatter pairs
The MCD is typically applied in the context of robust statistics, where a subset size based on the parameter α ≥ 0.5 is used to capture the covariance structure of the majority of the data.However, in the context of ICS as a preprocessing step for clustering, α ≤ 0.5 is better suited to capture the local within-cluster structure.Hence, we focus on the scatter pairs MCD 0.1 − COV, MCD 0.25 − COV, and MCD 0.5 − COV.Details on the selection of these scatter pairs can be found in Appendix B.1.
Figure 5 shows boxplots of the discriminatory power for the different outlier settings and a representative selection of cluster settings.The main finding is that the performance of MCD α − COV scatter pairs in different cluster settings depends on the parameter α.We first focus on the setting without outliers.For q = 2 clusters, for instance, a lower α performs better for balanced clusters but yields larger variability for unbalanced clusters (90-10, 95-5), while a larger α is preferable for unbalanced clusters but yields lower discriminatory power or larger variability for relatively balanced clusters (50-50, 55-45, 60-40).However, as the number of clusters increases, this becomes less of an issue.
Regarding the selection of components, the med criterion yields a performance similar to that of the oracle criterion, except that the aforementioned effect is more pronounced.Specifically, the med criterion fails or exhibits large variability for some cases (50-50 and 55-45 for α = 0.25 and α = 0.5, and additionally 60-40 for α = 0.5).With outliers, the discriminatory power somewhat drops but remains good overall.Otherwise, results remain similar to the setting without outliers.
Overall, the results reveal that ICS with MCD α − COV has potential in a clustering context, but further research is needed.First, since results in different cluster settings depend on the parameter α, it could be interesting to consider α as a tuning parameter to be selected, e.g., together with the number of clusters.Second, the results for the oracle criterion indicate that the scatter pair MCD α − COV succeeds in revealing the relevant structure for clustering, but component selection criteria that can be computed in practice fail in certain cluster settings.

Discriminatory power with the COV − COV 4 scatter pair
As one of the most widely used scatter pairs in ICS, we take a closer look at the performance of COV − COV 4 in a clustering context (see also Peña et al., 2010).Boxplots of the discriminatory power for different cluster and outlier settings are displayed in Figure 6.
We first discuss the setting without outliers.For q = 2 clusters, the results are as expected from the theory explained in Section 3.2.In that case, it is known that the relevant structure is contained in the first component for very unbalanced clusters and in the last component for (somewhat) balanced clusters, while neither the first nor the last component can separate the clusters if the smaller cluster contains about 21% of observations.Indeed, we find that dis- criminatory power is very high except for the cluster settings 75-25 and 80-20, where the method fails.For q = 3 clusters, the method still cannot recover the cluster with 20% of observations (20-50-30, 10-70-20) but the discriminatory power is quite good since the two remaining clusters are highlighted.For q = 5 clusters, discriminatory power is very high throughout even though there are various cluster settings with clusters of size 20%.This is a promising result that needs to be verified from a theoretical point of view in further research.Furthermore, the med criterion is close to the oracle criterion in terms of performance, except in settings where the method struggles either way.
With 2% of outliers, there is a considerable drop in performance for the oracle criterion.But more importantly, the med criterion fails almost completely.In particular, the discriminatory power is close to 0 for q = 2 clusters, while the discriminatory power is very low and variability is large for q = 3 clusters.In these settings, the med criterion fails to pick the components that carry the relevant structure.Surprisingly, the results of the med criterion improve again for 5% of outliers, being more closely aligned again with the results of the oracle criterion.In fact, with 2% of outliers and our sample size, the outlyingness structure is artificially contained in a low-dimensional space.COV − COV 4 is designed to discover such a structure and therefore highlights it on the first component.However, the clusters are visible on the last component, which is not kept with the med criteria based on only k − 1 components.When the percentage of outliers increases, the outlying observations span the entire space so that the selected component(s) again capture the low-dimensional cluster structure, explaining the good results with 5%.4.3.3.Clustering performance for Gaussian mixture modeling with dimension reduction As Sections 4.2.2 and Appendix B.2 already provide a detailed comparison of mclust and rmclust (among other methods) on the observed data as well as after ICS and PCA, we now focus on comparing mclust and rmclust after ICS to methods that integrate dimension reduction into Gaussian mixture modeling.Specifically, we consider the approach of McLachlan et al. (2003) that employs mixtures of factor analyzers (denoted by MFA) and the approach of Raftery & Dean (2006) that incorporates variable selection into Gaussian mixture modeling (denoted by clustvarsel ).Note that Scrucca (2010) also introduces a dimension reduction method for Gaussian mixture modeling, but this approach applies dimension reduction after clustering for visualization purposes.That is, clustering is performed on the observed data rather than in a lower-dimensional subspace, hence this approach is not considered further here.
Figure 7 compares MFA and clustvarsel with mclust and rmclust after ICS with TCOV − COV.The left column displays the ARI across the 10 investigated cluster settings.Without outliers, MFA and clustvarsel perform (almost) perfectly, but also mclust and rmclust after ICS with TCOV − COV yield excellent results.With increasing percentage of outliers, however, the performance of MFA and (to a slightly lesser extent) clustvarsel deteriorates dramatically, most notably manifesting in a large variability in the ARI.On the other hand, rmclust but also mclust after ICS with TCOV − COV remain much more stable in the presence of outliers.In addition to this increase in stability, mclust and rmclust after ICS with TCOV − COV are also orders of magnitude faster to compute, as shown in the right column of Figure 7.
On a final note, we emphasize that our simulation design places the relevant structure for clustering on a subset of the observed variables, which is in line with the assumptions of clustvarsel.When transforming the generated data so that the relevant structure lies in a latent lower-dimensional space, clustvarsel can no longer be expected to succeed in dimension reduction, whereas ICS is invariant to affine transformations.Nevertheless, we did not investigate this further to keep the paper at a reasonable length.

Discussion of the simulation study
The main findings from our simulations are that (i) ICS with TCOV − COV and LCOV − COV performs best for tandem clustering, (ii) the scatter pair MCD α − COV with a value of α ≤ 0.5 could be an interesting alternative when α is treated as a tuning parameter but further research is needed, and (iii) PCA is not a suitable preprocessing step for clustering purposes and yields large variability in clustering performance.
However, there are several limitations of our simulations.First, we only consider one value each for the number of observations n and the number of variables d.We compare a large number of scatter pairs for ICS with different component selection criteria, and our main aim is to study those for different numbers of clusters with various cluster sizes.As such, we had to make some concessions to other parameters to keep the paper concise.Second, we only consider a simple mean shift model for q clusters that are defined in a subspace of dimension q − 1.An important consideration for this choice is that it allows us to use simple clustering methods such as kmeans to validate our findings for the discriminatory power of the selected components.More complex models with, e.g., different covariance structures in different clusters or non-normal distributions are beyond the scope of this paper and are left for further research.
As a short glimpse into the performance of tandem ICS in a non-normal setting, we provide an illustrative example in Appendix B.3 with a simulated data set coming from the so-called barrow wheel distribution.This distribution goes beyond Gaussian mixtures, and it is mentioned in the discussion of Tyler et al. (2009), where the authors noticed that the theoretical properties of ICS are also valid for this distribution (see also Archimbaud, 2018, for more theoretical details).

Empirical applications
To further study the performance of tandem clustering with ICS, we consider three publicly available benchmark datasets.First, the crabs data set (Campbell & Mahon, 1974) contains morphological measurements on crabs of both sexes and from two color-different species.We take log-transformations of all variables, which is common for this data set (e.g.Archimbaud et al., 2023b).Second, the well-known iris data set (Anderson, 1936;Fisher, 1936) gathers measurements of flowers from three different species.Third, the Philips data set contains measurements of a certain part for television sets, collected in 1997 by Philips Mecoma (The Netherlands) and studied by Rousseeuw & Van Driessen (1999) and Hubert et al. (2008) for anomaly detection purposes.In this specific case, the engineers agreed on two clear abnormal phenomena regarding 15% and 11% of the data, which we treat as small clusters.An overview of the dimensions and cluster sizes of the three data sets is given in Table 1.
We focus on the clustering performance on these data sets using the same five clustering methods and the best-performing dimension reduction methods from Section 4, thereby selecting the components using the med, var, and normal Table 1: Overview of the three benchmark data sets, where n denotes the number of observations, d the number of variables, and q the number of true clusters.criteria for ICS, as well as the 80% criterion for PCA.The clustering performance is once more evaluated by the Adjusted Rand Index (ARI) (Hubert & Arabie, 1985).The results are shown in Figure 8, where reference lines indicate the ARI on the initial data without dimension reduction.Even though the data sets are relatively low-dimensional, ICS improves the identification of clusters in most cases, as reflected by an increase in the ARI.As in the simulation results, the med and var criteria lead to similar performance.By contrast, PCA yields roughly the same or a lower ARI in all cases except for mclust and rmclust for the Philips data.Furthermore, the Philips data set is an interesting case where mclust and rmclust yield a lower ARI on the observed data than simple methods such as PAM, kmeans, and tkmeans.Note that the ARI slightly deteriorates in some instances also for ICS, hence the selection of the scatter pair and the component selection criterion is crucial.For a better understanding, we take a deeper look at the components obtained via ICS in order to relate them to the clustering performance.We thereby focus on the performance of kmeans and mclust, as the results are qualitatively similar for kmeans, tkmeans and PAM, as well as for mclust and rmclust.For the crabs data set, the ARI on the initial data is 0.04 for kmeans and 0.37 for mclust.The improvement is substantial for ICS with the scatter pairs TCOV−COV and TCOV−UCOV, each yielding an ARI between 0.78 and 0.89 depending on the criterion used and the clustering method.The MCD α − COV scatter pairs, on the other hand, achieve almost the same performance when α is small but the ARI decreases when α increases.Note that LCOV − COV works well for kmeans but not for mclust.Another interesting fact is that the normal criterion fails to select any components for any of the scatter pairs.To investigate this further, Figure 9 displays the first two and last two components from ICS with the scatter pair TCOV − COV, which clearly shows that the first two components reveal the relevant structure for clustering.The plot also visualizes the nested nature of the q = 4 clusters in the crabs data: there are two species (blue and orange) and two sexes (male and female).ICS highlights the two balanced groups regarding the species on one component (here, the first) and the two balanced clusters regarding the sex on the another component (here, the second).However, for two balanced groups, the D'Agostino test of skewness (D'Agostino, 1970) does not reject the null hypothesis of normality, explaining why the normal criterion results in the selection of no components.The med criterion, on the other hand, is based on the selection of k − 1 = 3 components (where k = 4 denotes the number of clusters specified in the clus- tering algorithm).It selects the first two components, which carry the relevant structure, but also the last component.It is possible that clustering can be further improved by selecting only the first two components.
For the iris data set, the ARI on the initial data is already excellent for mclust with a value of 0.90.However, it is noteworthy that for kmeans, the ARI of 0.62 on the initial data is improved upon with the scatter pairs LCOV − COV, TCOV − COV, TCOV − UCOV, and MCD 0.25 − COV to achieve competitive ARI values between 0.87 and 0.92.For the scatter pairs involving TCOV, the normal criterion selects only the first component, whereas the var criterion keeps the first two components and the med criterion selects the first and the last components, with the former resulting in a much higher ARI. Figure 10 displays the four components from ICS with the scatter pair TCOV − COV, confirming that the relevant structure is indeed contained only in the first component.For LCOV − COV, the var criterion also selects the first two components and med criterion the first and last component, here resulting in an excellent ARI, while the normal criterion does not select any component.It should be noted that the iris data set was used as an example in Tyler et al. (2009), too.Those authors also find that the relevant structure is captured by the first component, but they conjecture that the results are similar "for almost any pair of scatter matrices that we may choose".Our results are an indication that the choice of scatter pair matters more for this data set than initially thought.
For the Philips data set, the ARI on the initial data is rather low with a value of 0.26 for kmeans, but even lower for mclust (0.06).The most striking result in this example is that the ARI is by far the highest after ICS with MCD 0.5 − COV with a value of 0.89 for kmeans, while the other scatter pairs yield only a moderate improvement in the ARI (which is also the case for mclust).Furthermore, the med, var, and normal criteria yield similar results in terms of ARI for all considered scatter pairs.This is surprising as the med and var criteria always select the first two components, whereas the normal criterion selects many more (the first four in the case of MCD 0.5 −COV).Figure 11 indeed confirms that the first two components are necessary to reveal the clusters for the MCD 0.5 − COV scatter pair.Note that for this specific dataset, the clusters were labeled by Philips engineers, after investigation and interpretation, based on Mahalanobis distances calculated via the MCD.This might explain why other scatter pairs do not succeed in the same way as MCD 0.5 − COV to reveal the structure.
Further results for the three considered data sets can be found in Appendix C, where we include methods that integrate dimension reduction into the kmeans algorithm (Soete & Carroll, 1994;Vichi & Kiers, 2001) or into Gaussian mixture modeling (McLachlan et al., 2003;Raftery & Dean, 2006).None of those approaches attain the performance of tandem clustering with ICS.

Conclusions and discussion
Overall, ICS outperforms PCA for tandem clustering.More specifically, the scatter pairs TCOV − COV and LCOV − COV yield the best performance, but also MCD α − COV with a carefully selected parameter α ≤ 0.5 is promising.
However, there is no scatter pair that is uniformly best in every situation.Regarding the component selection criteria, note that we restrict the med criterion to selecting k − 1 components, where k denotes the number of clusters to be specified in the clustering algorithm.The main reason for this restriction is to have a simple, automatic component selection rule for the simulations.In practice, an even smaller subspace may carry the relevant structure, hence the med criterion is not well adapted in this context and may select too many components.On the other hand, the normal criterion can estimate the dimension of the subspace of interest, but the D'Agostino test (D'Agostino, 1970) struggles when there are two balanced groups on one component, which can result in selecting none of the components, or in presence of outliers.Other normality tests may be more suitable.Further research on suitable component selection criteria is necessary, but in practice, researchers and practitioners may also select the components based on a scatterplot matrix of the components or a screeplot of the generalized eigenvalues.A more advanced procedure could be to consider the choice of the invariant coordinates as a tuning parameter of the subsequent clustering method.In this context, the best subspace could be chosen based on an internal validation index such as the silhouette coefficient (Kaufman & Rousseeuw, 1990) or those described in Hennig (2019).
In addition to improving the component selection criteria, the present work has many perspectives, including the treatment of mixed data (Caussinus & Ruiz-Gazen, 2006;van de Velden et al., 2019b) and functional data (Archimbaud et al., 2022), but also the improvement of the ICS calculation algorithm by allowing to take into account (nearly) ill-conditioned data (Archimbaud et al., 2023b), or by refining the research of the components using a projection pursuit algorithm as suggested in Dümbgen et al. (2023).Generalizing ICS to highdimensional data through some regularization approach is another interesting perspective in the context of clustering (see Bouveyron & Brunet-Saumard, 2014, for a review for model-based clustering).In the context of rating-scale data from surveys in the social and behavioral sciences, tandem clustering with ICS may further be useful for detecting different response styles among respondents (e.g.Schoonees et al., 2015).

Computational details
All computations are performed with R version 4.3.1 (R Core Team, 2023).Our proposed approach for tandem clustering with ICS is implemented in package ICSClust (Archimbaud et al., 2023a), which is publicly available from https://CRAN.R-project.org/package=ICSClust.It builds upon packages ICS (Nordhausen et al., 2008(Nordhausen et al., , 2023) ) for ICS, rrcov (Todorov & Filzmoser, 2009) for the MCD scatter matrix, cluster (Maechler et al., 2022) for PAM, stats (R Core Team, 2023) for kmeans, tclust (Fritz et al., 2012) for tkmeans, and mclust (Scrucca et al., 2023) for mclust and rmclust.Furthermore, we use packages EMMIXmfa (Rathnayake et al., 2019) for MFA, clustvarsel (Scrucca & Raftery, 2018) for clustvarsel, and clustrd (Markos et al., 2019) for reduced kmeans and factorial kmeans.Replication files for the simulations and the empirical applications are available from https://github.com/aalfons/TandemICS-Replication.to use the raw estimator MCD α or the reweighted estimator RMCD α .Using MCD α gives (almost) full control over the subset size, which can be desirable in a clustering context to capture the local covariance structure.While scatter pairs with RMCD α in general yield a higher median η 2 than scatter pairs with MCD α , small differences in the median are not that relevant as long as the η 2 is large enough.On the other hand, scatter pairs with RMCD α fail to pick up the relevant structure in more instances compared to their counterparts with MCD α , i.e., there are more points with small η 2 in Figure B.12, which is most visible for the oracle criterion.For a rather large subset size (e.g., α = 0.75), using RMCD α can be considered detrimental to performance according to our results.Nevertheless, for a small subset size (e.g., α = 0.1), using RMCD α rather than MCD α can be beneficial to prevent that the estimate is based on too few observations.This could be of particular relevance when n is small, Regarding the component selection criteria for ICS, keep in mind that in this paper, we only consider clustering methods that take the number of clusters k as an input.The med and var criteria, which are based on selecting k − 1 components, are therefore conceptually appealing and, across the different scatter pairs, lead to the best overall performance in discriminatory power.In Figure B.14 for the setting without outliers, we observe a clear improvement in clustering performance with ICS for PAM and kmeans.Specifically, TCOV − COV and TCOV − UCOV yield a nearly perfect ARI in all cluster settings.LCOV − COV performs similarly in most cluster settings, but fails on occasion for perfectly balanced clusters (50-50).For tkmeans, we find a substantial improvement only for certain cluster settings (50-50, 20-50-30).On the other hand, mclust and rmclust perform almost perfectly on the observed data.While ICS keeps this near-perfect performance in most settings, it can increase failures in some settings (in particular for rmclust).Finally, PCA with RMCD 0.75 is detrimental to clustering performance, yielding a lower ARI or larger variability than on the observed data in almost every setting.
From  and LCOV − COV clearly yields the best performance except for rmclust, where performance improves for some settings but deteriorates in others.The simple clustering methods kmeans and PAM after ICS even outperform rmclust on observed data in the settings with q = 5 clusters.Finally, PCA with RMCD 0.75 still has a detrimental effect on clustering performance.

Appendix B.3. Illustration for the barrow wheel distribution
In the simulation study of Section 4, we generate the data from a Gaussian mixture model due to its simplicity and popularity.To provide an illustration of tandem clustering using ICS in a non-elliptical framework, we simulate a data set from the barrow wheel distribution, which was introduced in Hampel et al. (1986) and used by Stahel & Mächler (2009) as an example in the discussion of Tyler et al. (2009).A d-variate random vector X follows a barrow wheel distribution if X ∼ O((1 − ϵ)N (0, diag(σ 2 1 , 1, . . ., 1)) + ϵH), with H so that h = (h 1 , h ′ 2 ) ′ ∼ H implies h 1 ∼ χ d−1 and h 2 ∼ N (0, σ 2 2 I d−1 ).The d × d matrix O can be used to rotate and rescale the vector.
Figure B.17 shows a simulated data set of size n = 1000 from the barrow wheel distribution with d = 3, σ 1 = 0.1, σ 2 = 0.2 and ϵ = 0.2.Table B.2 reports the ARI values of kmeans, mclust, and rmclust (applied directly to this data set and after ICS with TCOV − COV), as well as MFA and clustvarsel.The approaches based on a Gaussian mixture model (mclust, rmclust, MFA, and clustvarsel) fail to detect the clusters in the simulated data set with ARI values between 0.102 and 0.346, while kmeans performs even worse with an ARI close to 0. However, using tandem clustering with ICS and TCOV-COV recovers the cluster structure in the first coordinate, as illustrated in Figure B.18. Applying the clustering methods to the first invariant coordinate greatly improves the identification of the three clusters and yields ARI values ranging from 0.835 for kmeans to about 0.96 for mclust and rmclust.

Figure 1 :
Figure 1: Illustration of kmeans clustering on simulated data.The columns correspond to the standardized data (left), the PCA components (center), and the ICS components (right).The top row displays the corresponding scatterplots, the middle row visualizes pairwise Euclidean distances, and the bottom row shows scatterplots of the standardized data, with colors indicating the clusters obtained by k-means on the original data and on the reduced data from PCA and ICS.

Figure 2 :
Figure 2: Discriminatory power as measured by the η 2 of the best-performing dimension reduction methods for ICS and PCA, respectively, using the corresponding best component selection criteria.Boxplots display the results across the different cluster settings with 100 simulation runs each.Results for different outlier settings are shown in separate panels.

Figure 3 :
Figure 3: Discriminatory power as measured by the η 2 of the best-performing dimension reduction methods for ICS and PCA, respectively, using the corresponding best component selection criteria.Boxplots display the results for selected cluster settings over 100 simulation runs.Results for different outlier settings are shown in separate rows, and results for different methods are shown in separate columns.

Figure 4 :
Figure 4: Clustering performance as measured by the ARI after the best-performing dimension reduction methods for ICS and PCA, respectively, using the corresponding best component selection criteria.Boxplots display the results across the different cluster settings with 100 simulation runs each.Results for different clustering methods are shown in separate columns, and results for different outlier settings are shown in separate rows.

Figure 5 :
Figure 5: Discriminatory power as measured by the η 2 of ICS with MCDα − COV scatter pairs, using the best component selection criteria.Boxplots display the results for selected cluster settings over 100 simulation runs.Results for different outlier settings are shown in separate rows, and results for different methods are shown in separate columns.

Figure 6 :
Figure 6: Discriminatory power as measured by the η 2 of ICS with the COV − COV 4 scatter pair, using the best component selection criteria.Boxplots display the results for all investigated cluster settings over 100 simulation runs.Results for different outlier settings are shown in separate rows.

Figure 7 :
Figure 7: Clustering performance as measured by the ARI (left column) and computation time in seconds (right column) for Gaussian mixture modeling after ICS with TCOV − COV and the med criterion, as well as for methods that integrate dimension reduction into Gaussian mixture modeling.Boxplots display the results across the different cluster settings with 100 simulation runs each.Results for different outlier settings are shown in separate rows.

Figure 8 :
Figure 8: Adjusted Rand Index (ARI) in the empirical applications for different clustering methods after selected dimension reduction methods for ICS and PCA, respectively, using different component selection criteria.Results for different clustering methods are shown in separate rows, and results for different data sets are shown in separate columns.Black reference lines indicate the ARI obtained on initial data without dimension reduction.

Figure 9 :
Figure 9: First two and last two components from ICS with TCOV − COV on the crabs data set: scatterplot matrix with density estimates on the diagonal.The true clusters are indicated by different colors.

Figure 10 :
Figure 10: All four components from ICS with TCOV − COV on the iris data set: scatterplot matrix with density estimates on the diagonal.The true clusters are indicated by different colors.

Figure 11 :
Figure 11: First two and last two components from ICS with MCD 0.50 − COV on the Philips data set: scatterplot matrix with density estimates on the diagonal.The true clusters are indicated by different colors.

Figure
Figure B.12: Discriminatory power as measured by the η 2 of the investigated dimension reduction methods with different component selection criteria.Boxplots display the results across the different cluster settings with 100 simulation runs each.

Figure B. 13 :
Figure B.13: Shapes of different scatter matrices for a sample from a bivariate Gaussian mixture distribution with three balanced components.In the first row, the local scatters (together with COV) are shown in the left panel and the global scatters in the right panel.In the second row, MCDα is shown in the left panel and RMCDα in the right panel for different values of α (together with COV).

Figures
Figures B.14, B.15, and B.16  display the clustering results for the setting without outliers, with 2% outliers, and with 5% outliers, respectively.The figures show boxplots of the ARI of the five clustering methods in the selected cluster settings, applied after the best-performing dimension reduction methods for ICS and PCA.

Figure B. 14 :
Figure B.14: Clustering performance as measured by the ARI after the best-performing dimension reduction methods for ICS and PCA, respectively, using the corresponding best component selection criteria.Boxplots display the results for the different cluster settings over 100 simulation runs in the setting without outliers.Results for different clustering methods are shown in separate rows, and results for different dimension reduction methods are shown in separate columns.

Figure B. 15 :
Figure B.15: Clustering performance as measured by the ARI after the best-performing dimension reduction methods for ICS and PCA, respectively, using the corresponding best component selection criteria.Boxplots display the results for the different cluster settings over 100 simulation runs in the setting with 2% outliers.Results for different clustering methods are shown in separate rows, and results for different dimension reduction methods are shown in separate columns.
Figures B.15 and B.16, we see that the main findings remain valid in the settings with outliers.For all clustering methods, ICS with TCOV − COV

Figure B. 16 :
Figure B.16: Clustering performance as measured by the ARI after the best-performing dimension reduction methods for ICS and PCA, respectively, using the corresponding best component selection criteria.Boxplots display the results for the different cluster settings over 100 simulation runs in the setting with 5% outliers.Results for different clustering methods are shown in separate rows, and results for different dimension reduction methods are shown in separate columns.

Figure B. 18 :
Figure B.18: Scatter plot matrices of the invariant coordinates obtained by ICS with TCOV − COV on the simulated data set from the barrow wheel distribution.

Figure
Figure C.19: Adjusted Rand Index (ARI) in the empirical applications for different clustering methods on the observed data, as well as after ICS with the med criterion for TCOV − COV and MCD 0.5 − COV.Results for different data sets are shown in separate columns.