Modal clustering asymptotics with applications to bandwidth selection

Density-based clustering relies on the idea of linking groups to some specific features of the probability distribution underlying the data. The reference to a true, yet unknown, population structure allows to frame the clustering problem in a standard inferential setting, where the concept of ideal population clustering is defined as the partition induced by the true density function. The nonparametric formulation of this approach, known as modal clustering, draws a correspondence between the groups and the domains of attraction of the density modes. Operationally, a nonparametric density estimate is required and a proper selection of the amount of smoothing, governing the shape of the density and hence possibly the modal structure, is crucial to identify the final partition. In this work, we address the issue of density estimation for modal clustering from an asymptotic perspective. A natural and easy to interpret metric to measure the distance between density-based partitions is discussed, its asymptotic approximation explored, and employed to study the problem of bandwidth selection for nonparametric modal clustering.


Introduction
Clustering is commonly referred to as the task of finding groups in a set of data points (see [24], [16] or [21]). While intuitively clear, this task is, in fact, far from being accurately defined. The density-based approach attempts to circumscribe this issue by framing the problem into a statistically rigorous setting where the observed data are assumed to be realizations of a random variable, and the clusters are defined with respect to some characteristic of its underlying probability distribution.
In this sense, a clustering procedure should not be limited to simply produce a partition of the observed data; instead, it must allow to obtain a whole-space clustering, that is a partition of the whole sample space [2,3]. In any case, each methodology is characterized by the way in which the clusters are defined in terms of the true distribution, leading to the concept of ideal population clustering. By serving as a reference "ground truth" to aim at, this concept introduces a benchmark to evaluate the performance of data-based partitions.
The ideal population goal in density-based clustering can be defined in terms of two different paradigms: the model-based approach, where each cluster is associated to a parametric mixture component, and the modal one (see respectively [28] and [30] for some recent reviews). This paper focuses on the latter formulation, whose name stems from the notion of clusters as the "domains of attraction" of the modes of the true density underlying the data [40].
Therefore, in practice density estimation assumes a key role in order to approximate the ideal population goal of modal clustering. While the modal formulation does not preclude using a parametric density estimate as a first step to perform a data-based modal clustering [4,36], a long-standing practice resorts to nonparametric estimators. Precisely, in this paper the focus lies on those estimators based on kernel smoothing (see e.g. [6] and [45]).
Under-or over-smoothed estimates may lead to deceiving indications about the modal structure of the underlying density function, and this problem is usually quantified through some measure of the discrepancy between the estimate and the target density. In contrast, the aim of this work is to consider nonparametric density estimation as a tool for the final purpose of modal clustering, focusing on an appropriate metric comparing the partitions induced by the true and the estimated distribution.
Our main result provides an asymptotic approximation for the considered metric, which allows to introduce new automatic bandwidth selection procedures specifically designed for nonparametric modal clustering. The accuracy of this approximation and the performance of the new methods in practice, with respect to the proposed error criterion, is extensively studied via simulations, and compared with some plausible competitors.
The rest of the paper is structured as follows. Section 2 formally introduces the modal approach to cluster analysis with reference also to algorithmic details. In Section 3 the distance criterion to target density estimation for modal clustering is presented, along with the main asymptotic result and its consequences. Section 4 contains the setup and results of the numerical experiments. A generalization to the multidimensional setting is discussed in Section 5. Finally, some concluding remarks are stated in Section 6.

Background
The connection between groups and density features, established by the modal approach to cluster analysis, allows to characterize the concept of ideal population clustering. Informally, a population cluster can be defined as the domain of attraction of a mode of the density [40]. An attempt to formalize this concept has been done in [3] with the aid of Morse Theory, a branch of differential topology focusing on the large scale structure of an object via the analysis of the critical points of a function (see e.g. [27] for an introduction).
Let us consider a continuous d-variate random variable X, with probability density function f : R d → R. Assume that f is a Morse function, i.e. a smooth enough function having nondegenerate critical points, and denote by M 1 , . . . , M r the modes of f (i.e. its local maxima). For a given initial value x ∈ R d , an integral curve of the negative density gradient −∇f is defined as The set of points whose integral curve starts at a critical point x 0 (as t → −∞) goes under the name of unstable manifold of x 0 and is defined as It has been showed [42]  Equivalently, if the integral curves associated to the positive density gradient are considered, then a modal cluster is defined as the set of points whose integral curves converge (as t → +∞) at the same mode. The concept of modal clusters as the domains of attraction of the density modes stems naturally from this definition. Operationally, a numerical algorithm is needed to find the eventual destination of an initial point, and most of the contributions in this direction take their steps from the mean-shift algorithm [17], essentially a variant of the gradient ascent algorithm. The algorithm transforms an initial point x (0) recursively, and identifies a sequence (x (0) , x (1) , x (2) , . . . ) according to an updating mechanism defined as where A is a d×d positive definite matrix chosen to guarantee the convergence to a local maximum of f . A partition of the data is therefore obtained by simply grouping together the observations climbing to the same density mode, via mean-shift updates.
From a practical point of view the density f is unknown, therefore an estimate is needed. When working in a nonparametric framework a common choice is given by the kernel density estimator.
In the following we focus on the univariate case for ease of exposition and mathematical tractability while the multivariate extension will be addressed in Section 5 below. Let X 1 , . . . , X n be a sample of i.i.d. realizations of X. Then, the kernel density estimator is defined bŷ where K is the kernel, usually a smooth, non-negative and symmetric function integrating to one, and h is the bandwidth, which controls the smoothness of the density estimate. In order to select the smoothing parameter some measure of the distance between the estimated and the true density is needed. A common choice is the Integrated Squared Error, defined as Depending on the observed data, the ISE is itself subject to a random variability that could hinder the problem of bandwidth selection (see [19]). Hence, its expected value is alternatively considered as a non-stochastic error distance. The optimal bandwidth h MISE is then defined as h MISE = argmin h>0 MISE(h).
Since minimization of the MISE does not lead to closed form solutions for the optimal bandwidth, its asymptotic counterpart -the AMISE -is often considered. Both the MISE and the AMISE depend on the true, unknown density function; for this reason several different approaches to estimate them have been proposed. Examples are the ones based on least squares cross validation, biased cross validation or plug-in bandwidth selectors. A comprehensive review of these methods is beyond the scope of this work and, for a complete exposition, readers can refer to [45] or to the more recent book by [6].
3 Density estimation for modal clustering

Asymptotic bandwidth selection for modal clustering
Bandwidth selectors based on the ISE or akin distances pursue the aim of obtaining an appropriate estimate of the density. However, the goal of modal clustering is markedly different from that of density estimation (see e.g. [12]). In fact, two densities that are close with respect to the ISE may result in quite different clusterings while, on the other hand, densities far away from an ISE point of view could lead to the same partition of the space. A graphical illustration of this idea is provided by Figure 1. The inappropriateness of the ISE, or related distances, depends on its focus on the global characteristics of the density, while modal clustering strongly builds on specific and local features, more closely related to the density gradient or the high-density regions (see also [10]). Therefore, the choice of the amount of smoothing should be tailored specifically for clustering purposes. So far, the aim of choosing an amount of smoothing for the specific task of highlighting clustering structures has been scarcely pursued in literature. A related idea, although without particular reference to cluster analysis, has been developed by [35], who propose a plug-in type bandwidth selector appropriate for estimation of highest density regions (see also [31] and [14]). Another related work, more focused on the clustering problem, is the one by [15], where the author suggests to consider the self-coverage measure as a criterion for bandwidth selection. Alternatively, the potential adequacy of a bandwidth selected to properly estimate the density gradient has been pointed out informally by [5] and explored numerically by [8]. The theoretical motivation of this suggestion lies on the strong dependence of both the population modal clustering and the mean shift updating mechanism on the density gradient. The suggestion in [9] follows the same rationale and the bandwidth is proposed to be selected as a modification of the normal reference rule for density gradient estimation.
To address the problem of bandwidth selection for modal clustering, an appropriate measure of distance should compare the data-based clustering induced by a kernel density estimate with the ideal population one. Stemming from [3], a natural choice is the distance in measure, where the considered measure here is the probability P induced by the density f . Formally, let C = {C 1 , . . . , C r } and D = {D 1 , . . . , D s } be two partitions with r ≤ s (i.e. possibly different number of groups). The distance in measure between C and D is defined as where is the symmetric difference between any two sets C and D and P s denotes the set of permutations of {1, 2, . . . , s}. This distance finds an interpretation as the minimal probability mass that would need to be re-labeled to transform one clustering into the other (see Figure 2 for a graphical illustration). In this sense, the second term in (3.1) serves as a penalization for unmatched clusters in one of the clusterings. Practically, this distance conveys the idea that two partitions are similar not when they are physically close, but when the differently-labeled points do not represent a significant portion of the distribution.
It should be noted that the choice of this distance to evaluate the performance of a databased clustering is not arbitrary. Indeed, many other possibilities are described in [29], but the conclusion of that study is that the distance in measure (called misclassification error there) is "the distance that comes closest to satifying everyone". Furthermore, in [43] the distance in measure is considered as "the most convenient choice from a theoretical point of view".
As with the ISE-MISE duality, the distance in measure is a stochastic error distance, so for the purpose of bandwidth selection it seems more convenient to focus on the Expected Distance in Measure whereĈ h is the data-based partition induced byf h and C 0 represents the ideal population clustering. Once the appropriate error distance is defined, the optimal bandwidth h is given by As it happened with h MISE , it does not seem possible to find an explicit expression for h EDM .
Hence, our goal will be to obtain an asymptotic form for the EDM that allows to derive a simple approximation to h EDM .
To this aim, consider a standard normal random variable Z, and denote by ψ(µ, σ 2 ) = E|µ+σZ| for µ ∈ R and σ > 0. Since |µ + σZ| has a folded normal distribution [25], it follows that ψ(µ, σ 2 ) can be explicitly expressed as where Φ denotes the distribution function of Z. This function ψ plays a key role in the asymptotic behavior of the expected distance in measure, as the next result shows (see Appendix A for a proof).
Theorem 1. Assume that f is a bounded Morse function with r ≥ 2 modes and local minima m 1 < · · · < m r−1 , three-times continuously differentiable around each m j , that ∞ −∞ |x|f (x)dx < ∞, and that the kernel K is supported on (−1, 1), has four bounded derivatives and satisfies where g (k) refers to the k-th derivative of a function g(·).
The asymptotically optimal bandwidth h AEDM is then defined as the value of h > 0 that minimizes AEDM(h). Due to the structure of ψ(·, 1), minimization of (3.4) is closely related to the problem of minimizing the L 1 distance in kernel density estimation and, in fact, reasoning as in [20] it is possible to show that h AEDM is of order n −1/7 . Unfortunately, as it happened with h EDM , it seems that neither h AEDM admits an explicit representation hence, to get further insight into the problem of optimal bandwidth selection for density clustering, it appears necessary to rely on a tight upper bound for AEDM(h).
To find such a bound it is useful to note that many properties of ψ(u, 1) are given in [13,Ch. 5], and can be translated to our function of interest by taking into account that ψ(µ, σ 2 ) = σψ(µ/σ, 1). It follows that ψ(µ, σ 2 ) is symmetric with respect to µ, nondecreasing for µ > 0 and convex, By taking into account that e −µ 2 /(2σ 2 ) and 1 − 2Φ(−µ/σ) are both bounded by 1, [13] also noted that for all µ ∈ R, σ > 0. However, a tighter bound for small values of µ is given in the next lemma.
The bound in Lemma 1 is tighter than (3.5) whenever |µ| ≤ (2π) 1/2 σ, but the situation reverses for bigger values of |µ|, so that none of the two bounds is uniformly better (see Figure 3) hence we should keep track of both of them. They lead to upper bounds for the asymptotic EDM.
Here, b = r−1 j=1 b j and a = r−1 j=1 a j and for = 1, 2, where The minimizers of AB1(h) and AB2(h) can be computed explicitly, and are given by

Some remarks
In this section we discuss in more depth some of the results derived in Section 3.1. The aim is to provide insights on the behavior of the approximations and bandwidth selectors and to discuss possible competitors.
Remark 1. Theorem 1 provides an asymptotic expression for the EDM that is valid as long as the true density has two or more modes. When the true density is unimodal (r = 1), expression Remark 2. A natural estimator of the density first derivative is the first derivative of the kernel density estimator. For this estimator it is possible to define the MISE as in (2.1), and to consider its minimizer h MISE,1 and its asymptotic approximation h AMISE,1 (see [39] and [7]). The bandwidths (3.6) and (3.7) share the same order as h AMISE,1 , whose expression is given by with This consideration strengthens the intuition, outlined in Section 3.1, that (3.8) could be an adequate bandwidth choice for modal clustering purposes.  for any odd value of k. Therefore the first summand of the AEDM expression, related to the bias, vanishes, leading to a monotonically decreasing behavior of the AEDM itself. A similar anomaly was observed in the related problem of mode estimation in [11]: if the true density is symmetric around its mode, then Chernoff's mode estimator is unbiased. Hence, in some special cases symmetry plays a certain role in the performance of these smoothing methodologies.
Remark 5. The derived bandwidths depend on some unknown quantities such as the true density, its local minima and its second and third derivatives. In order to be of practical use we shall resort to plug-in strategies, that is, data-based bandwidth selectors will be proposed in the next section by substituting the aforementioned unknown quantities with pilot estimates. This is the same procedure that is commonly adopted when considering the plug-in bandwidth selectorĥ PI,1 for density gradient estimation (see [23] and [5]).
It should be noted that this allows, from a practical point of view, to circumvent the issue about the perfect symmetry around a minimum since, by using a nonparametric pilot estimate of the third derivative, it is highly unlikely to encounter a similar situation in practice.

Numerical results
The idea of estimating the density for clustering purposes, via the minimization of the expected distance in measure -or its asymptotic counterpart -is explored in this section via simulations.
A total of B = 1000 samples for each of the sizes n ∈ {100, 1000, 10000} are generated from the univariate densities depicted in Figure 4 and whose parameters are reported in Appendix B.
The selected densities are designed to illustrate different modal structures to encompass different possible behaviors from a clustering perspective.
The first goal of the study was to evaluate the quality of the asymptotic approximation of the EDM and the behavior of the two bounds derived in Corollary 1. Since an explicit expression for the EDM was not available, we obtained a Monte Carlo approximation based on the B = 1000 synthetic samples.
The plots displayed in Tables 1 to 5 show the behavior of the asymptotic approximations, with respect to the EDM, as a function of the bandwidth h. As expected, the approximations improve as the sample size increases. The two bounds show a quite different behavior, with characteristics that reflect the theoretical properties pointed out in Section 3.2. The first bound is closer to the AEDM in uniform terms, but despite having a diverging behavior for large h the second bound is usually closer to the AEDM around the location of the minimizer h AEDM .     Occasionally (although rarely) the first step in the procedure above yielded a single mode, and then the AEDM was undefined. In those cases, and according to the rationale exposed in Remark 1, a sensible choice for h is the critical bandwdith proposed by [37], h crit = inf{h > 0 :f h (·) has exactly one mode}, so in that case we setĥ AEDM =ĥ AB1 =ĥ AB2 =ĥ crit .
Tables 1 to 5 also contain the Monte Carlo averages and standard deviations of the distances in measure obtained when performing modal clustering using the bandwidth selectorsĥ AEDM ,ĥ AB1 andĥ AB2 . For completeness, their performance is also compared to that ofĥ PI,1 , which so far probably represents their most sensible competitor in the clustering framework (see [8]).
In general,ĥ AB1 andĥ AB2 led to more accurate clusterings thanĥ AEDM , with a slight preference forĥ AB1 . The gradient-based bandwidthĥ PI,1 , in turn, not only produces competitive results, but its Monte Carlo average distance in measure appears lower than the one produced by the asymptotic EDM minimizers. In fact, a deeper insight into the standard errors of the obtained distances shows thatĥ AEDM , as well asĥ AB1 andĥ AB2 , produce more variable results. The higher variability seems to be due to the sensitivity of the minimizers to the plugged in pilot estimates, which strongly depend on local features of the density. Some further investigations, not fully reported here, suggest that the main responsible for this behaviour is not the pilot estimate of the local minima but the pilot density derivatives estimates at the minimum points. Also, due to the use of different pilot bandwidths to estimate the unknown m j , f (2) , and f (3) , it may occur, indeed, thatf (2) (m j ) assumes even negative values. On the other hand, while relying as well on some plug-in estimates, the gradient-based bandwidthĥ PI,1 produces more robust clusterings, as the quantities to be estimated refer conversely to global features of the density. As expected, this diverging behavior tends to vanish with increasing sample size since the asymptotic approximations improve. As a confirmation, with n = 10000, all the considered bandwidths perform comparably.

Multidimensional generalization
The concepts discussed so far refer to the one-dimensional setting where a mathematically rigorous treatment is feasible. The multidimensional generalization poses some difficulties since obtaining an asymptotic approximation of the EDM appears far from trivial. Hence, in order to gain some insight into the problem of selecting the amount of smoothing for nonparametric clustering in more than one dimension, some numerical comparisons are performed assuming the true density as known.
Denote by f : R d → R the true density function and bŷ its kernel estimate based on a sample X 1 , . . . , X n and indexed by a symmetric positive definite d × d bandwidth matrix H. The problem of bandwidth selection is considered by studying the EDM between the clustering induced by the kernel estimateĈ H and the ideal population clustering C 0 . These clusterings are not so easily identifiable as in the unidimensional setting, due to the arbitrary forms that the cluster boundaries may adopt, however an approximation of the distance in measure d(Ĉ H , C 0 ) can be computed by resorting to a discretization scheme as follows (see [8] for further details): 1. Take a grid over the sample space and rule the grid by considering hyper-rectangles centered at each grid point. For the multidimensional simulation study, a total of B = 1000 samples for each of the sizes n ∈ {100, 1000} were generated from the bivariate densities whose contour plots are shown in Using the synthetic samples from each density in the study, it was possible to obtain a Monte Carlo estimate of the (discretized version of the) EDM, which was then minimized over the class of scalar, diagonal and unconstrained bandwidth matrices. The EDM was computed also for the MISE-optimal bandwidth for density gradient estimation over the same matrix classes. In both cases, the true density as well as all the involved quantities were assumed to be known. The EDM minimizers were determined numerically, by running the procedure over a grid of sensible values of the entries, while the optimal matrices for gradient estimation were determined as in [7].
The results are reported in Tables 6 and 7. Clustering based on the optimal bandwidth according to the EDM is very accurate in both of the considered examples, and improves considerably for increasing sample size. The use of more complex bandwidth parametrizations does not seem worth for modal clustering since results obtained with a full, unconstrained bandwidth matrix are comparable with those obtained with a scalar bandwidth, while the latter requires a substantially smaller computational effort.
In the multidimensional setting, the gradient bandwidth is quite competitive in terms of EDM, as in the univariate case. Again the comparable performance of unconstrained bandwidth matrices does not seem to justify the use of more complex parametrizations.

Conclusions
The modal clustering methodology provides a framework to perform cluster analysis with a clear and explicit population goal. It allows clusters of arbitrary shape and size, which can be captured by means of a nonparametric density estimator. In this context, the distance in measure represents a natural and easily interpretable error criterion. Therefore, in this paper we have presented an asymptotic study of this criterion for the case where density estimates of kernel type are employed to obtain a whole-space clustering via the mean shift algorithm.
Our asymptotic approximations are useful to gain insight into the fundamental problem of bandwidth selection for modal clustering and, at the same time, serve as the basis to propose practical data-based bandwidth choices specifically designed for clustering purposes.
The finite-sample performance of the new proposals was investigated in a thorough simulation study, and compared to the oracle bandwidths i.e. the optimal choices when the true population is fully known. The gradient bandwidth, designed for the closely related problem of density gradient estimation, was also included as a natural competitor in the study.
The results of this simulation study have suggested that all the methods perform quite satisfactorily, and exhibit a very similar behavior for large sample sizes. For smaller samples, the performance of the gradient bandwidth was rather remarkable, since it obtained comparable or even better results than the new proposals, even without being specifically conceived for modal clustering.
This phenomenon resembles the conclusions obtained in [34] regarding the related problem of level set estimation. There, it was shown that the traditional bandwidth selectors for density estimation often outperformed more sophisticated methods designed for level set estimation purposes. The common pattern in both situations is that the optimal choices for the specific problems (level set estimation and modal clustering, respectively) depend on very subtle local features of the unknown density function, which are difficult to estimate, so that choices based on a more global, yet somehow related, perspective represent a sensible alternative.

A Proofs
Proof of Theorem 1. From Theorem 4.1 in [3] it follows that, with probability one, there exists n 0 ∈ N such that the kernel density estimator f h has the same number of local minima as f for all n ≥ n 0 . Let us denote by m h,1 < · · · < m h,r−1 the local minima of f h . Then, the expected distance in measure between the data-based clustering C h and the population clustering Proof of Lemma 1. From ψ(µ, σ 2 ) = σψ(µ/σ, 1), it suffices to show that ψ(u, 1) ≤ (2/π) 1/2 + (2π) −1/2 u 2 for u ≥ 0. From the definition of ψ, this is equivalent to proving that α(u) ≤ 1, where α(u) = e −u 2 /2 + u u 0 e −z 2 /2 dz − u 2 /2. Since α(0) = 1, it is enough to show that α is nonincreasing, but this immediately follows from the fact that α (u) = u 0 e −z 2 /2 dz − u.

B Parameter settings
In the following the parameter settings of the densities selected for the simulations are presented.
Since all the densities are mixture of Gaussian models, we adopt the usual notation where, for a given k component, π k represent the k-th mixture weight, µ k and σ 2 k (Σ k for the bivariate models) the mean and variance (covariance matrix).