Quantile-based clustering

A new cluster analysis method, $K$-quantiles clustering, is introduced. $K$-quantiles clustering can be computed by a simple greedy algorithm in the style of the classical Lloyd's algorithm for $K$-means. It can be applied to large and high-dimensional datasets. It allows for within-cluster skewness and internal variable scaling based on within-cluster variation. Different versions allow for different levels of parsimony and computational efficiency. Although $K$-quantiles clustering is conceived as nonparametric, it can be connected to a fixed partition model of generalized asymmetric Laplace-distributions. The consistency of $K$-quantiles clustering is proved, and it is shown that $K$-quantiles clusters correspond to well separated mixture components in a nonparametric mixture. In a simulation, $K$-quantiles clustering is compared with a number of popular clustering methods with good results. A high-dimensional microarray dataset is clustered by $K$-quantiles.


Introduction
In this paper we introduce a new clustering method, quantile-based or Kquantiles clustering.The method is fast and simple and can deal with large datasets.A special feature of the method is that it takes into account potential skewness of the within-cluster distributions.
The popular K-means method [19] represents all clusters by their centroids (cluster means) and assigns all points to the closest centroid.Quantile-based clustering represents the clusters by optimally chosen quantiles.Points are assigned to the closest quantile (or rather, in multidimensional data, distances to quantiles are summed up over the variables), but the distance measuring "closeness" treats points asymmetrically depending on which side of the quantile they are.This idea has been explored for supervised classification by [15], and here we present its application to clustering.
The algorithm for K-quantiles clustering works along the lines of Lloyd's classical K-means algorithm [25] and is in this way faster and simpler than 4850 C. Hennig et al. many modern clustering methods, at the same time being more flexible than K-means.
There is some ambiguity in the literature about to what extent K-means is model-based.The K-means objective function can be motivated without reference to probability models; it formalizes optimal representation of all points in a cluster by the cluster centroid in the sense of least squares.It is therefore sometimes presented as assumption-free method.But K-means can also be derived as Maximum Likelihood (ML) estimator of a fixed partition model of spherical Gaussian clusters with equal within-cluster variances, which seems to be a quite severe assumption.Indeed K-means tends to produce spherical clusters, so it is hardly appropriate to call it assumption-free, although it is regularly applied to data that do not follow this model assumption.Whether this is appropriate does not depend so much on to what extent the model assumption is really fulfilled, but rather on whether the K-means characteristics matches the "shape" of clusters required in the application in hand.Different applications of cluster analysis ask for different kinds of clusters, and the user of cluster analysis needs to understand such characteristics of methods in order to choose an appropriate one for the application of interest [14].
In the same way, K-quantiles clustering can also be derived as ML estimator for a fixed partition model of generalized asymmetric Laplace distributions.This is helpful also for the construction of K-quantiles clustering, because it implies how to penalize variables against each other when using different quantiles for different variables.It also allows for an in-built scaling of variables that takes skewness into account.However, the main rationale of K-quantiles clustering is not the estimation of asymmetric Laplace distributions, but rather to define a general clustering principle that is almost as simple as K-means but more flexible by taking within-cluster skewness into account.Throughout the paper, the number of clusters K is treated as fixed; the estimation of K is left to future work.
We review the principle of K-means clustering in Section 2. In Section 3, quantile-based clustering is motivated and defined.First, we motivate it in a discrepancy-based nonparametric manner.Then we link it to a fixed partition model of asymmetric Laplace distributions.Some attention is paid to the penalty term introduced by ML-estimation in this model.A simple greedy algorithm is proposed, and various constrained versions of the quantile-based clustering are proposed, which allow for more parsimony and less computational effort.Section 4 is devoted to consistency theory.Quantile-based clustering is proved to be consistent in a nonparametric setting for the canonical clustering functional defined on a distribution, and another theorem shows that this functional will yield clusters that correspond to mixture components in mixtures with strongly separated nonparametric components.Section 5 presents a simulation study that includes high-dimensional setups, in which K-quantiles clustering is compared with some popular clustering methods.In Section 6, K-quantiles clustering is applied to a real microarray dataset with more than 3000 variables.Section 7 concludes the paper.
Supplementary material with some more detailed results and information is available on arxiv, [16].

K-means and distance-based probabilistic clustering
The aim of centroid-based clustering is to seek the best partition of n data vectors xn = (x 1 , . . ., x n ) ∈ (R p ) n into K disjoint subsets characterized by cluster prototypes (centroids) ξ = (ξ 1 , . . ., ξ K ) (the tilde denotes a collection of vectors rather than a single one).
In classical K-means [19] the 'best' partition C = (C(1), . . ., C(n)) is obtained by minimizing over ξ and C the variance function given by where • denotes the L 2 or Euclidean distance, C(i) ∈ {1, . . ., K} for i = 1, . . ., n.A classical estimation algorithm for minimizing V K−means n,K consists of two steps sequentially iterated until convergence [25].In the first step, for fixed ξ the best partition C is found by assigning each point to the nearest cluster center.Then in the second step, for fixed C, the centroids ξ k (k = 1, . . ., K) are estimated.Since the sum of squared Euclidean distances in (1) is minimized by the mean, the centroids ξ k (k = 1, . . ., K) are the within cluster means.
Although usually no probability assumption is mentioned when K-means is introduced, K-means can be derived as Maximum Likelihood (ML) estimator of a fixed partition model of spherical Gaussian clusters with equal within-cluster variances.According to such a model, x 1 , . . ., x n are independently drawn from N (ξ C(i) , σ 2 I p ), i = 1, . . ., n, where C(i) ∈ {K = 1, . . ., K} are parameters giving the cluster memberships of the x i ; as opposed to a mixture model, in a fixed partition model these are not modelled as random.The log-likelihood of such a model is which is maximized by the ξ 1 , . . ., ξ K , C(1), . . ., C(n) that minimize V K−means n,K , in other words, by K-means.
More generally, starting from an arbitrary distance from a prototype, denoted by d(x, ξ), it is always possible to construct a probabilistic clustering model as proposed by [2] and [18].The kernel of the distance-based density is the inverse of the exponential of the distance measure weighted by a positive concentration parameter λ: where d(x, ξ) is a generic distance function from a location parameter ξ, λ > 0, and ψ(ξ, λ) is a normalization constant such that f (x; ξ, λ) is a proper density function.
Distance-based models have been used by several authors [see 26,7,6] and adapted for classification in a mixture-based perspective by [28] for ranking data and by [1] for textual data.
Note that, when d(x, ξ) is the L 2 (Euclidean) distance from the expected value of X, ξ = E[X], the density (2) is the Gaussian distribution.When d(x, ξ) is the L 1 distance, the density (2) coincides with the Laplace distribution.When d(x, ξ) is the cosine distance and data are normalized to 1 according to the L 2 norm, (2) becomes the von Mises-Fisher distribution [1].

Quantile-based clustering
We now introduce a new clustering strategy based on the idea of assigning points to the closest quantile.Measuring "closeness" by the squared Euclidean distance is associated with the mean, in the sense that means optimize (1).Quantiles can also be characterized by minimizing a sum of discrepancies, although these discrepancies are not symmetric; they depend on which side of the quantile a point is.Using these discrepancies in "K-means style" leads to a simple clustering method that allows for within-cluster skewness.

Clustering based on the quantile discrepancy
Let X be a univariate random variable defined on R with probability cumulative function F X (x).Let θ ∈ [0, 1] be a percentile and denote as q(θ) the corresponding quantile, such as F −1 X (θ) = q(θ) = inf{x : F X (x) ≥ θ}.The quantile q(θ) is the not necessarily unique value of ξ that minimizes the following variability measure: where, for a single point x, we define the quantile discrepancy from ξ as a function For θ = 0.5, this is the L 1 distance, but for θ = 0.5 it is not symmetric and therefore not a distance.Not being based on squares, it shares with the L 1 distance its better resistance against outliers compared to the L 2 distance.By definition the quantile discrepancy has a univariate nature.When X is a multivariate random variable on R p , the quantile discrepancy with respect to a generic vector of centroids ξ is defined as the sum of component-wise distances: (5) where θ j can be variable-wise or a single common percentile for all variables.
The basic idea of quantile-based clustering is to use the quantile discrepancy instead of the squared L 2 -distance in K-means, i.e., minimizing where again ξ = (ξ 1 , . . ., ξ K ).θ is assumed here to be the same for all clusters.ξ 1 , . . ., ξ K define the locations of the clusters.We call them "barycenters" from now on.
Then the empirical quantile vectors q nk (θ) = {q nk1 (θ 1 ), . . ., q nkp (θ p )}, k = 1, . . ., K, defined for j = 1, . . ., p as The proof of Proposition 1 is given in Appendix A.1.( 6) quantifies the discrepancy between the observations in a cluster and their barycenter given θ, and is therefore appropriate for finding the cluster barycenters and clustering the points, but it will not work well for finding θ.The problem of finding the optimal θ will benefit from a model-based approach.

The fixed partition model and quantile-based clustering
Quantile-based clustering can be derived as ML estimator of a probabilistic model, similarly to K-means.
When θ = 0.5, then ξ = q(1/2) is the median, and the quantile-based density is the Laplace distribution.When θ = 0.5 the quantile-based density is a special case of the asymmetric Laplace distribution [22] with expectation . Figure 1 shows some examples of its shape as θ varies.For xn = (x 1 , . . ., x n ) ∈ (R p ) n we assume that the p variables are independent within clusters, and that the parameters θ and λ do not differ between clusters; the clusters are distinguished only by different barycenters ξ k , k = 1, . . ., K. We aim at finding a compromise here between flexibility on one side and parsimony and computational simplicity on the other side.In the K-means model, variables are independent, all variables have the same within-cluster variances and clusters only differ regarding their centers.For quantile-based clustering, we define different levels of flexibility, see Section 3.5.For the moment we focus on the most general case of the models considered there, which allows both θ and λ to vary between variables, allowing for different distributional shapes and scales.Allowing them to differ between clusters as well, and incorporating within-cluster dependence would define a considerably more complex approach, both regarding the number of parameters and the computational burden.This is left for future research.
With parameter vector Θ = (θ, ξ, λ, C), the likelihood for a fixed partition asymmetric Laplace distribution model with independent variables is Plugging in (8) and taking logs, the ML estimator is (9) which for given θ and λ = 1 leads to the same clustering as (6).Proposition 1 enforces that ξ 11 , . . ., ξ Kp are the variable-wise within-cluster θ-quantiles, because the minimization with respect to ξ is independent of λ.For given (θ, ξ, λ), the ML estimator of the clustering C is, for i = 1, . . ., n: We will therefore omit C in the parameter vector in the following.Here is the resulting definition.Definition 1. Quantile-based (K-quantiles) clustering (with variable-wise θ and λ) is defined by observations are clustered by (10) based on (θ, ξ, λ) = T n,K (x n ).

Notes on penalization and scaling
Comparing ( 6) and (11) shows that the logarithmized normalization constant −n p j=1 log λ j θ j (1 − θ j ) acts as a penalty term, penalizing θ j too close to 0 or 1 and too small λ j .
In order to illustrate why this is required (focusing on θ first), consider K = 1 and univariate data generated by a Gaussian distribution with some parameters μ and σ 2 and take ξ = q n (θ), where q n (θ) is the quantile computed on the sample of size n.The dashed red line of Figure 2 shows the shape of the dispersion ) on a large sample with n = 10, 000 for a dense grid of values of the percentile between 0 and 1.
Since data have been generated by a symmetric distribution, the optimal value of θ should actually be 1  2 corresponding to the median, and Figure 2 shows that the penalty is required to achieve this.
Figure 3 shows the penalized dispersion function for data generated by a symmetric distribution, by a positive skew distribution and by a negative skew distribution.
The parameters λ j allow for implicit rescaling of the variables and scale equivariance.They need to be "penalized" because without the penalty term, λ → 0 would just enforce V n,K → 0. K-quantiles clustering is scale equivariant, which means that the clustering remains the same, and parameters change appropriately, if the variables in the data are multiplied by different constants.

Proposition 2. For constants
and the corresponding clustering C is the same as for T n,K (x n ).
The proof of Proposition 2 can be found in Appendix A.2.Note that the parameters λ j rescale the variables based on variation within clusters (only the discrepancy between x ij and the cluster barycenter to which x i is assigned are taken into account, see also Proposition 4 below).This is more appropriate than achieving scale equivariance by standardizing the variables beforehand based on the variance or some other dispersion measure, as is sometimes done for K-means, see [10].Such methods will estimate a large dispersion if along a variable the separation between clusters is large, which may lead to downweighting of variables that are in fact very informative for clustering.

A greedy search algorithm
Lloyd's classical K-means algorithm [25] is a greedy algorithm, and K-quantiles clustering can also be computed using a fast greedy algorithm.This is based on the following two propositions, which show that θ and λ minimizing V n,K can easily be found with all other parameters given.The propositions treat the case p = 1 w.l.o.g., because the variables can be treated separately for minimizing V n,K with respect to these parameters.
Proofs of Propositions 3 and 4 are given in Appendix A.3 and A.4, respectively.
The greedy algorithm consists of an initialization step and a clustering step, which makes V n,k smaller in each step and is repeated until convergence.Because there are only finitely many possible clusterings, the algorithm will reach convergence after a finite number of steps (as does Lloyd's algorithm).For big datasets, if convergence takes too long, one could fix a maximum number of iterations.However, often convergence is reached very quickly; also the constrained methods proposed in Section 3.5 are faster.The scheme of the algorithm is the following: 1. Initialization: For each variable j = 1, . . ., p, choose randomly a value θ j and K quantiles of equispaced probabilities as barycenters defined as Clustering step: Repeat the following until V n,K (θ; ξ) stops changing: (a) Compute the clustering C(1), . . ., C(n) using (10).
(d) for k = 1, . . ., K, j = 1, . . ., p compute the new barycenters ξ kj = q n k kj (θ j ), where , and q n k kj (θ j ) denotes the quantile among the Because of (10) and Propositions 3 and 4, V n,K is made smaller in every step, so the algorithm is guaranteed to converge.As usual, the algorithm only finds a local optimum of V n,k .Therefore it is recommended to repeat the algorithms with a number of h different initializations (we use 30 as default), and the best solution is chosen according to the minimum value of V (θ, ξ).The algorithm is implemented in the R package QuClu available on the CRAN Web page.Note that the proposed initialization of λ could be made scale equivariant by for example setting λ j = 1/s j with s 2 j being the sample variance of variable j, but this may come with the same issues as prior scaling of K-means, see Section 3.3.

Constrained versions
More parsimony and faster computation can be achieved by constraining the θ and λ-parameters.The algorithm described in Section 3.4 can easily be modified to accommodate these.
A common value of θ for all the variables is assumed, and variables are not implicitly scaled (the latter can make sense in applications in which the variables have comparable meanings and measurement units, if subject matter knowledge suggests that variable importance is proportional to variation).In this case we minimize the empirical loss function: • Algorithm CS: Common θ and Scaled variables through λ j .
A common value of θ is taken but variables are scaled through λ j .Then the empirical loss function to be minimized is: • Algorithm VU: Variable-wise θ j and Unscaled variables.
In this case we minimize • Algorithm VS: Variable-wise θ j and Scaled variables through λ j .This is the most flexible method, with V n,K defined in (9).
Minimisation problems CS and VS are scale equivariant (Proposition 2 holds with the same proof for CS as well), whereas CU and VU are not.

Consistency theory
In this section we show that the parameters estimated by K-quantiles clustering from data are consistent estimators of the K-quantiles clustering functional, i.e., the version computed on an underlying distribution rather than on data.The first result of this kind for clustering seems to be the work of [31] on K-means clustering, and our Theorem 1 for K-quantiles clustering uses some of Pollard's ideas.Such consistency results have been shown for a number of clustering approaches, see, e.g., [33,4,5], but are still lacking for many methods.
It is well known in the case of K-means that consistency for the K-means functional does not imply that the estimated parameters (i.e., the K mean vectors) are consistent for the parameters of the Gaussian fixed partition model for which K-means is the ML estimator (see [3]), and in the same way the result presented here does not imply that K-quantiles clustering is consistent for estimating the parameters of a fixed partition model of asymmetric Laplace distributions as introduced in Section 3.2.Anyway, the consistency result given here is essentially nonparametric, for very general distributions, and it ensures the asymptotic stability of K-quantiles clustering, and the estimated parameters can be analyzed by considering the K-quantiles clustering functional.There is some literature on convergence rates and "performance guarantees" for K-means clustering (e.g., [24,30]), but this relies on strong assumptions, and generalizing such results to K-quantiles clustering is beyond the scope of the present work.Instead, after the consistency result in Theorem 1, we show in Theorem 2 that the K-quantiles clustering functional defines clusters that are in line with "central sets" in a nonparametric mixture situation with sufficiently strong separation between mixture components.We are not aware of other results of this kind in the literature.
We consider the most flexible and general model defined above, with variablewise θ and scaled variables, which is the most difficult one for proving consistency.Corresponding results for the less flexible models can be obtained more easily.
The proof relies heavily on showing that parameter estimators for large n do not leave a compact set, but (considering a single variable) λ → ∞ and θ → 0 or θ → 1 may happen together without constraints on the parameter space (leading to the exponential distribution in the limit, which in practice could actually be integrated in the approach), causing trouble with uniform convergence arguments.This can be avoided by either constraining θ j ∈ [r, 1 − r], r > 0, or λ j ≤ λ + < ∞ for j ∈ {1, . . ., p}.We will impose the latter constraint here, so that results hold without further constraint for the unscaled case, i.e., λ = 1.
The parameter space used here is We use the notation defined in ( 9) and (11); in case that the argmin is not unique, any solution can be taken.We modify (9) multiplying by 1 n in order to use stochastic convergence of means to expectations: For a given distribution P on R p define In order to avoid issues due to label switching of the clusters, we consider consistency of lists (θ, ξ, λ), where ξ is the set of quantiles.Convergence and continuity are defined in terms of a distance d between two such lists (θ 1 , ξ1 , λ 1 ), (θ 2 , ξ2 , λ 2 ) that is the maximum of θ 1 −θ 2 , λ 1 −λ 2 , and the maximum over the Euclidean distances between any element of ξi and its closest element of ξj , i = j, i, j = 1, 2 (known as Hausdorff distance between ξ1 and ξ2 ).
The following assumptions will be required: A1 means that all involved integrals are finite; note that [31] requires x 2 dP (x) < ∞ for K-means.A2 enforces stability; as [31] noted for Kmeans, it implies that S K < S K−1 < . . .< S 1 because if S k = S k−1 for some k, one could add any point to ξk−1 to construct ξk that cannot have a worse value than S k together with θ k−1 , λ k−1 .
Theorem 1.If x 1 , x 2 , . . .∼ P i.i.d., and assumptions A1 and A2 hold, then, for n → ∞: The proof of Theorem 1 is given in Appendix A.5.The value of T K (P ) for given P implies a clustering of R p by The next result is about this implied clustering in case that P m , m ∈ N, is a sequence of mixture distributions with mixture components between which the separation becomes larger and larger with increasing m (we suspect that something like this is required to define clusters that can consistently be found by any method in such a nonparametric setting).
Here are some definitions and assumptions.Let G 1 , . . ., G K be distribution functions on R p defining distributions Q 1 , . . ., Q K parameterized in such a way that 0 is their "center" in some sense; it could be the mode, the mean, the multivariate median or quantile; important is only that G i is defined relative to 0. Let π 1 , . . ., π K > 0 mixture proportions with Assumption A3 enforces the distance between central sets to become large enough for the statement to hold.A4 makes sure the involved expectations exist (note that a similar theorem could be proved for K-means, but this would require a bound on the mixture component-wise E x 2 ).Define a sequence of distributions P m with distribution functions The following theorem states that in this setup, when evaluating the K-quantiles clustering functional, eventually the different clusters include the full central sets of the different mixture components (and central sets can be of arbitrarily large though fixed radius), and in this sense the clustering corresponds to the mixture structure.The mixture components are allowed to overlap, although for m → ∞ the overlap becomes arbitrarily small.Theorem 2. With the above definitions, assuming A3 and A4, for large enough m, the clusters of T K (P m ) = (θ m , ξm , λ m ) can be numbered in such a way that for k ∈ {1, . . ., K}: The proof of Theorem 2 is given in Appendix A.6.

Simulation study
The performance of the K-quantiles clustering algorithm is evaluated in an extensive simulation study.We generate p vectors from K, K = 2, 3, 5, populations, X (K) , according to five different scenarios: 1.In the first scenario, we consider symmetric Student t-distributed variables W j (j = 1, . . ., p) with three degrees of freedom, and we simulate K location-shifted populations from W j , each shift from the closest population being unitary [e.g.
2. In the second scenario, we test the behaviour of the clustering algorithm in highly skewed data by generating identically distributed vectors W j (j = 1, . . ., p) from a multivariate zero-centered Gaussian distribution, transforming them using the exponential function and shifting contiguous populations by 0.6 [e.g.
(v) a square root transformation on the shifted populations at (i) [e.g.
j ∼ |W j + 1.4|, . . .]. 4. In the fourth scenario, we simulate different distributional shapes and levels of skewness even for different classes within each variable.Within each class, data are generated according to beta distributions, X (k) j ∼ Beta(a, b), j = 1, . . ., n and k = 1, . . ., K, with parameters a and b in the interval (1, 10) randomly generated for each class within each variable.The absolute difference between the class expected values for each variable is bounded from above by 0.2 (this is done in order to not make the clustering task too easy; as cluster differences are aggregated over many dimensions, simulated clusters may be so strongly separated that every method can find them easily).5.The fifth scenario is similar to the fourth one.Within each class, data are generated according to beta distributions, X (k) j ∼ Beta(a, b), j = 1, . . ., n and k = 1, . . ., K, with parameters a and b randomly chosen to be in the intervals: (0, 1) and (1,5), or (0, 1) and (1,10), (1,3) and (5,10), (1,3) and (1, 3) so as to guarantee a higher level of skewness for some variables, for each class within each variable.The absolute difference between the class expected values for each variable is bounded from above by 0.1.
For each of the five scenarios and for each set of K populations, K = 2, 3, 5, we evaluate combinations of p = 50, 100, 500, n = 50, 100, 500, different percentages of relevant variables for grouping structure, i.e., 100%, 50% and 10%, independent or dependent variables (scenario 1-3 only), for a total of 648 different settings.In the "dependent variables" case, a dependence structure is introduced by generating variables W 1 , . . ., W p from either a t or a Gaussian distribution with random correlation matrix based on the method proposed by [20], so that the correlation matrices are uniformly distributed over the space of positive definite correlation matrices, with each correlation marginally distributed as Beta(p/2, p/2) on the interval (−1, 1).The irrelevant noise variables are generated independently of each other from the base distribution of each scenario.For each setting we simulate 100 datasets and we record the Adjusted Rand Index (ARI) [17] of the yielded classification compared with the true cluster membership.The performance of the algorithm is also examined in the case of imbalanced clusters: for the settings with K = 3 and n = 500 the group sizes are set to n 1 = 50, n 2 = 150 and n 3 = 300.
We compare the K-quantiles clustering algorithms' capability of recovering the original cluster memberships with those of seven other clustering methods: two model-based clustering approaches (mixture of Gaussians, mixture of factor analyzers) [27], K-means algorithm [25], Partition Around Medoids [21], agglomerative hierarchical clustering with unweighted pair group method (UPGMA), spectral clustering [29] and affinity propagation [8].The inclusion of irrelevant variables may prompt the idea that also clustering methods with variable selection should be tried out; however, variable selection is usually defined on top of an existing clustering method without variable selection, see, e.g., [9].Such ideas can be applied to K-quantiles clustering as well as to the competing methods, which we leave for future research.Mixtures of Skew-t distributions were also considered; however, due to computational difficulties in high dimensions, solutions were available only 20% of times, so we do not present results.
Details about the implementation and parameter tuning of these methods are given in the Appendix B; detailed tables of simulation results are in the supplementary material.
More specifically, we evaluate the accuracy of each clustering method as its ARI minus the ARI of the Common Unscaled K-quantiles clustering algorithm, divided by the average ARI in the given setting (for computing this average, the three percentages of relevant variables are aggregated in order to avoid blowing up small differences between uniformly small ARI values for only 10% relevant variables by small denominators; where methods do not deliver a solution, the ARI has been set to zero).This is done for the sake of enabling a simpler display of the many results, because it allows to aggregate results for different K, n, p, dependence structure, and percentage of relevant variables by scaling all these results so that they become comparable.Raw ARI values are given in the supplementary material.The aggregated distributions of these rescaled results are displayed in the boxplots of Figure 4 for the balanced cluster settings.Results for imbalanced clusters are similar and can be found in the Appendix B.1.
For all methods, the capability of recovering the original cluster membership improves as the sample size increases.For the K-quantile clustering, this is particularly evident in the scenarios where the percentage of relevant variables is very low; in all the other cases, in fact, results are generally very good and no remarkable difference due a different sample size can be noticed.All the methods, for fixed sample size and percentage of relevant variables, seem to perform better as p increases in almost all of the settings.As could be expected, clustering performances worsen as the number of irrelevant variables increases.
The K-quantiles methods perform very well in most situations compared to other clustering approaches.In the scenarios with identical distributional shapes and symmetric variables, solutions from the quantile clustering are mostly

The five panels show the distribution of the Adjusted Rand Index for (a) identically distributed symmetric variables, (b) asymmetric variables, (c) different distributions of variables, (d) different distributions of classes within variables in balanced and unbalanced populations and (e) different (skew) distributions of classes within variables in balanced and unbalanced populations.
preferable to those from any other method.Not surprisingly, common θ quantile procedures, i.e.CU and CS, slightly outperform those with variable-wise θ j s.
In the settings with identical distributional shapes and asymmetric variables, K-quantiles clustering methods outperform all other methods clearly and more or less uniformly; here, procedures with a variable-wise θ j , i.e.VU and VS, seem to produce a slightly better clustering.
With different distributions of variables, the K-quantiles clustering methods again show very good global results.Only occasionally, in some situations with just 10% relevant variables, Gaussian mixtures, K-means and spectral clustering can improve on K-quantiles.
In the fourth scenario with beta distributions differing between variables and classes within variables, the K-quantiles clusterings do not always outperform the other methods: while they generally produce good results, they often fall behind the accuracy of the K-means algorithm, spectral clustering, and also Gaussian mixtures.The reason why this happens is that although these distributions are skew, their tails vanish outside the unit interval, and often the difference between means is the most distinct feature discriminating the clusters.Therefore a squared loss function is suitable for finding them.This contrasts with the fact that K-quantiles beats these methods for symmetric but t-distributed data in the first scenario, despite the fact that Kmeans and Gaussian mixtures implicitly assume symmetry, as opposed to Kquantiles; however, the squared loss function is more affected by outliers in these cases.
The results of the fourth scenario prompted us to set up the fifth one, with parameters of the beta random variables chosen from different intervals so that there is more extreme skewness, and information about clustering is rather connected to distributional features other than the means.In this situation, the Kquantiles VS algorithm is the best.K-means and spectral clustering still yield fairly good results, although worse than the K-quantiles algorithms.Gaussian mixtures also still do well, probably because flexible covariance matrices are still versatile here to adapt to these setups.Their median performance is about on a par with three of the four K-quantiles algorithms (results vary depending on whether p is rather large compared to n or not, see the supplementary material) but worse than the VS algorithm.
Generally, the capability of recovering the clustering memberships and the rankings of the methods do not change much with dependence, although performances are slightly better under independence.Similarly, the ranking of the methods does not strongly depend on the number of clusters, nor on the presence of imbalanced clusters.
The table in the Appendix C provides some information on computing times.Currently our implementation for running the K-quantiles is coded in R; faster implementations are certainly possible.However, our experiments show that the growth in computation time with n is much slower than for PAM and the mixture model-based methods, so that for the largest data format that we tried (n = 50000, p = 100) our K-quantiles implementation is substantially faster than all mixtures and PAM, beaten only by K-means (the hierarchical clustering does not deliver a solution).This demonstrates that K-quantiles have the potential to be used with very large datasets.
The number of clusters K is treated as fixed here, and estimating K is left to future work.However, Figure 5 shows the average behaviour (over 100 replications) of the quantile discrepancy function V n for different numbers of clusters.Data come from K = 3 populations generated according to the second scenario, with an overall sample size n = 500 and 50 features.The discrepancy function decreases with increasing K; an elbow point is noticeable for K = 3, which is an indication that such curves may be of some use to choose K.The supplementary material gives some information about the distribution of estimated parameters in the third scenario and how this is related to skewness and variance of the cluster-wise distributions.

Application to gene expression data
For illustration we apply the K-quantiles clustering algorithms to gene expression data from the leukaemia microarray study of [11].The dataset contains the expression levels of 3051 genes for 38 leukaemia patients, obtained from acute leukaemia patients at the time of diagnosis.The study reports that 27 subjects have Acute Lymphoblastic Leukaemia (ALL), while 11 have Acute Myeloid Leukaemia (AML).The objective is to group the set of 38 patients so as to reflect the corresponding leukaemia diagnosis by employing information coming from their gene expression levels.In general, for methods that are not scale invariant, results depend on the scale.We consider results for unscaled data and for data with all variables scaled to unit variance.
Data are taken from the R package plsgenomics and are analysed by the same clustering methods described in Section 5.The number of true clusters for all the methods is taken as known and set equal to 2. As different versions of Mclust delivered different results, we have tried out different initializations and we chose the one with largest likelihood.For K-means five random starts are run.The default settings of all the other algorithms are considered.
Results from all the other methods are shown in Table 1.As can be seen, the K-quantiles clustering algorithm with variable-wise θ j and scaled variables via λ j is able to perfectly recover the original clustering memberships.When using the unscaled version, the performance of VU is still pretty good and superior to the other solutions.Quantile methods with common θ, whether using the scaled or the unscaled version, are not able to detect the grouping structure identified by the diagnosis: this is probably due to the fact that the distribution of the expression levels is really different for different genes (see Figure 6, where the density of a random sample of gene expression levels is plotted).
Mixture of Factor Analyzers could not return any solution, as it has ended up with errors.
The mixture of Gaussians and K-means yield exactly the same clustering (up to label switching); their results are still good.As the number of variables is very large, Mclust could only estimate mixture of Gaussians with spherical or diagonal covariance matrices, reducing to 6 out of 14 possible parsimonious models, namely: EII (spherical, equal volume), VII (spherical, unequal volume), EEI (diagonal, equal volume and shape), VEI (diagonal, varying volume, equal shape), EVI (diagonal, equal volume, varying shape), VVI (diagonal, varying volume and shape).
Spectral clustering and affinity propagation provide overall good results but worse than the mixture of Gaussians.
The second column of Table 1 reports the ARI of the clustering methods after having standardized the variables.Apparently feature scaling has removed part of the cluster-separation signal from the data and the obtained clustering is worse for most methods.This shows that in order to find an appropriate scaling, global standardization does not always work well.

Conclusion
K-quantiles clustering is a new clustering method based on representing clusters by quantiles of the within-cluster distribution.It can be interpreted as Maximum Likelihood estimator for a fixed partition model of asymmetric Laplace distributions, but like K-means it is not in the first place meant to be as- sociated with a specific model assumption, but rather to provide an intuitive objective function that allows for within-cluster skewness and is easy to optimize locally using a Lloyd-type algorithm.In our simulations the method did well on a wide range of within-cluster models different from the asymmetric Laplace.
[13] encourages researchers to give potentially informal descriptions of what specific kinds of clusters a new clustering method is meant to find.The development of K-quantiles clustering was motivated in the first place by the potential of the quantile-based discrepancy to add flexibility to K-means, particularly regarding within-cluster skewness.The underlying model suggests that clusters can be distributions of which the marginals are unimodal and potentially skew; Theorem 2 shows that sufficiently well separated subpopulations will be K-quantiles clusters even if not unimodal (as long as K is fixed and there are more than K modes, it is hardly possible to have only unimodal clusters).Similarly to K-means, the K-quantiles objective function sums up information over the different variables.This does not necessarily mean that variables have to be independent within clusters, but information about dependence is not used.The discrepancy is just an aggregation of variable-wise information.Clusters will not be rotation invariant and information carried in the original variables will be lost when considering linear combinations such as principal components.The clusters are treated as of the same shape, although with enough separation this does not stop the method from finding clusters with different shapes, see Theorem 2 and the simulations.The advantage of this is that a parsimonious parametrization allows the handling of high-dimensional data.An obvious generalization would be to allow the parameters θ and λ to vary between clusters, but this is likely to require considerably more computational effort.The use of unsquared distances gives outliers less influence on the cluster barycenters than in K-means or ML estimators for Gaussian mixtures.
The number of clusters K is fixed here, and estimating K is left to future work.Many methods for estimating the number of clusters are based on computing a clustering for a range of values of K and then cluster validation indexes or stability assessments are used to pick the best K [12,23].Such an approach can be used for estimating K together with K-quantiles clustering in the same manner as with K-means or K-medoids.Similarly, principles for variable selection that exist for K-means and other clustering methods could be applied to K-quantiles.
The penalization of the quantile defining probability θ and the scaling parameter λ as derived here from a fixed partition model of asymmetric Laplace distributions may also be helpful for quantile-based supervised classification as introduced in [15].

A.1. Proof of Proposition 1
The sum in ( 7) can be minimized for each component independently, see (5), and also separately for k = 1, . . ., K. Therefore consider w.l.o.g.p = 1 and Then the right side of ( 7) can be written as and the score function is , so that ξ = q n (θ) (any of the possible interval of quantiles).

A.3. Proof of Proposition 3
The proof derives by taking the first derivative of with respect to θ * , which gives: By equating the previous expression to zero and by multiplying by −θ * (1 − θ * ) we get the quadratic solution for θ.

A.4. Proof of Proposition 4
Similarly to proposition 2, the proof derives by computing the score with respect to λ * :

A.5. Proof of Theorem 1
The principle of the proof is to show that T n,K (x n ) for large enough n has to lie in a compact set C. In this compact set, by the uniform law of large numbers, V n,K (θ, ξ, λ, xn ) will converge uniformly to V K (θ, ξ, λ, P ), which in turn, together with continuity, will also enforce the minimizer to converge.(θ, λ) optimizing V n,K are enforced to eventually lie in a compact set by the penalty term − log λθ(1 − θ).For ξ, the argument is inductive, similar to what was done in [31].It is first shown that at least one of the optimizing ξ k must lie in a compact set, and then, assuming that this holds for K − 1 clusters but not for K, the Kth cluster can be shown to have an asymptotically negligible additional contribution to S K so that S K = S K−1 with contradiction against A2.
Consider the first term of V n,K (θ n , ξn , λ n , xn ): Then, For large enough n this converges, a.s., to P ( x > m)λ + B 1 , which can be made arbitrarily small by choosing m large enough.Therefore, for arbitrarily small δ > 0 and n large enough, In order to make use of this, a uniform convergence argument is needed.Recall (θ n , ξ * n , λ n ) ∈ C. According to [32], Example 19.8 (sometimes referred to as "uniform law of large numbers"), if F = {f θ : θ ∈ Θ} is a set of measurable functions with θ → f θ (x) continuous for all x, Θ compact, and θ, ξ) → 0 regardless of whether ξ comes from above or from below.Therefore, for fixed x ∈ R p and general K * , is continuous as minimum of continuous functions.U (θ, ξ, λ, x) can be bounded by a P -integrable function: Let ξ + an upper bound for the components |ξ kj | (assumed to be in a compact set here).Then, In particular, sup Going back to (14), choose m large enough that S K < S K−1 − δ.By definition of the optimizers, and, for large enough n, a.s., but also, eventually, This implies that ξ nk is eventually also captured in a convex set C. (15) now ensures uniform convergence of V n,K to V K over C. The existence of an integrable envelope of U together with continuity of U imply the continuity of V K as function of (θ, ξ, λ) ∈ C.This and A2 imply T n,K (x n ) → T K (P ) a.s., because otherwise with probability > 0 a subsequence of T n,K (x n ) can converge against (θ * , ξ * , λ * ) = T K (P ) but ∈ C and with V K (θ * , ξ * , λ * , P ) = V K (T K (P )), with contradiction to A2.

A.6. Proof of Theorem 2
The idea here is to show that if for arbitrarily large m a cluster in T K (P m ) can be found that has a nonempty intersection with at least two of the central sets { x − ρ mk < }, S K would be larger than what could be achieved by putting all the cluster barycenters at the cluster centers, contradicting the optimality of T K (P m ). Write Similar to the proof of Theorem 1, ∃θ − > 0, λ − > 0 so that for large enough m: min(θ m1 , . . ., θ mp , 1−θ m1 , . . ., 1−θ mp ) ≥ θ − , min(λ m1 , . . ., λ mp ) ≥ λ − , because otherwise the penalty term − so this cannot happen for the minimizer of V K .Therefore assume w.l.o.g.lim sup m P m (I m11 ) = 0.If also lim sup m P m (I m21 )= 0, for this subsequence, {γ m (x) = 1} has no nonzero probability overlap with any mixture component's central set (all of which have probability ≥ δ because of (13)), and there are K −1 clusters left to cover K central sets, in contradiction to (16).Therefore lim sup m P m (I m21 ) = τ > 0. This means that ξ m1 − ρ m2 must be bounded, otherwise the argument leading to (17) applies again.Therefore ξ m1 −ρ m1 is unbounded.There must be another cluster, w.l.o.g., {γ m (x) = 2}, so that P m (I m12 ) = τ * > 0. For x ∈ I m11 = ∅: This, together with ξ m1 − ρ m1 → ∞ at least for a subsequence, enforces ξ m2 − ρ m1 → ∞ as well.But then, as above, with b and again this is impossible for the minimizer of V K .Taken together, with any numbering of clusters, cannot happen together, so all the central sets { x − ρ mk < }, k ∈ {1, . . ., K} must eventually be subsets of different clusters.

Different distributional shapes and levels of skewness even for different
classes within each variable.Within each class, data are generated according to beta distributions with parameters a and b randomly chosen to be in the intervals: (0, 1) and (1, 5), or (0, 1) and (1, 5), (1, 3) and (5, 10), (1,3) and (1, 3), for each class within each variable.The absolute difference between the class expected values is bounded from above by 0.1, and the so chosen interval guarantees a higher level of skewness for some variables.
• K = 3, three populations X (1) , X (2) and X (3) : X • K = 5, five populations X (1) , X (2) , X (3) , X (4) and X (5) : X For each of the five scenarios and for each set of K populations, K = 2, 3, 5, we evaluated combinations of p = {50, 100, 500}, n = {50, 100, 500}, different percentages of relevant variables for grouping structure, i.e., 100%, 50% and 10%, independent or dependent variables (limited to scenarios 1-3), for a total of 648 different settings.The dependence structure among variables was modeled via the function rcorrmatrix from the R package clusterGeneration, so that the correlation matrices are uniformly distributed over the space of positive definite correlation matrices, with each correlation marginally distributed as Beta(p/2, p/2) on (−1, 1).The irrelevant noise variables were generated independently of each other from the base distribution of each scenario (for scenario 3 this means that all five base distributions were used in equal frequency, and for scenario 4 and 5 random parameters were used as within clusters for the informative variables).
For the case of imbalanced classes, data were generated from the five scenarios considering K = 3 and n = 500; the cluster sizes are set to n 1 = 50, n 2 = 150 and n 3 = 300.
The number of clusters is taken as known (and equal to the number of populations data are generated from) for every method.The clustering procedures that have been considered are the following: • Affinity Propagation clustering, estimated by the function apclusterK, with similarities computed as squared negative distances, from the R package apcluster.
For each setting 100 simulations were run.The average Adjusted Rand Index values and the corresponding standard errors are reported in the following tables, multiplied by 100; for each method and scenario the number of valid cases out of 100 runs is included as well.

B.1. Simulation results in the case of imbalanced classes
The performance of different clustering algorithms relative to the Common Unscaled K-quantiles clustering algorithm for imbalanced clusters (results on K = 3 and n = 500 only) is shown in Table 2.The labels along the horizontal axis refer to the different methods: CS, Common Scaled k-quantile; VU, Variable-wise Unscaled k-quantile; VS, Variable-wise Scaled K-quantile; Mixt-N, mixture of Normal distributions; MFA, mixture of factor analyzers; KM, k-means clustering; PAM, Partition Around Medoids algorithm; h-avg, hierarchical clustering with average linkage (UPGMA); SpeCl, spectral clustering; AffPr, affinity propagation.The five panels show the distribution of the Adjusted

Fig 1 .
Fig 1. Examples of quantile-based densities for different values of θ.

Fig 2 .
Fig 2. Unpenalized (dashed red line) and penalized (black line) quantile dispersion for data generated from a Gaussian distribution.

Fig 3 .
Fig 3.In the second row of the panel the penalized dispersion function is plotted against θ for data generated according to the density functions depicted in the first row of the panel: a symmetric distribution, a positive skew distribution and a negative skew distribution.

Fig 4 .
Fig 4. Performance of different clustering algorithms relatively to the Common Unscaled K-quantiles clustering algorithm for balanced clusters.The labels along the horizontal axis refer to the different methods: CS, Common Scaled k-quantile; VU, Variable-wise Unscaled k-quantile; VS, Variable-wise Scaled K-quantile; Mixt-N, mixture of Normal distributions; MFA, mixture of factor analyzers; KM, k-means clustering; PAM, Partition Around Medoids algorithm; h-avg, hierarchical clustering with average linkage (UPGMA); SpeCl, spectral clustering; AffPr, affinity propagation.The five panels show the distribution of the Adjusted Rand Index for (a) identically distributed symmetric variables, (b) asymmetric variables, (c) different distributions of variables, (d) different distributions of classes within variables in balanced and unbalanced populations and (e) different (skew) distributions of classes within variables in balanced and unbalanced populations.

Fig 5 .
Fig 5. Quantile discrepancy of the K-quantiles algorithm for different values of K, averaged over 100 replications.Error bars span ±1 standard error.

Fig 6 .
Fig 6.Leukaemia dataset: densities of five randomly selected variables (gene expression levels; fitted by R's density function with default settings).First row: all observations together.Second row: by true cluster.

•
Common θ and Unscaled variables (CU) K-quantiles clustering algorithm; • Variable-wise θ j and Unscaled variables (VU) K-quantiles clustering algorithm; • Common θ and Scaled variables (CS) K-quantiles clustering algorithm; • Variable-wise θ j and Scaled variables (VS) k-quantile clustering algorithm; • Mixture of Gaussian distributions, estimated by the default options of function Mclust from the R package mclust; • Mixture of skew-t distributions, estimated by the EmSkew function from the R package EMMIXskew, argument distr equal to mst, and initialised by the k-means clustering algorithm; • Mixture of Factor Analyzers, estimated by the fma function from the FactMixtAnalysis R package, by fitting models with number of latent factors from 1 to 20; • k-means clustering algorithm, run by the kmeans function from the stats R package, with five random starts; • Partition Around Medoids, run by the default options of the pam function from the cluster R package; • Agglomerative hierarchical clustering with average link, run by the hclust function, option method='average', of the stats R package; • Spectral clustering, estimated by the default options of function specc from the R package kernlab;

Table 1
Adjusted Rand Index multiplied by 100 for the leukaemia data set, respectively obtained on the original (unscaled) and on the scaled features.As shown earlier, CS and VS are scale invariant; any potential changes in their results are due to random initialisation and not to scaling.

Table 2 .
Average computing times in seconds.