A Comprehensive Approach to Mode Clustering

Mode clustering is a nonparametric method for clustering that defines clusters using the basins of attraction of a density estimator's modes. We provide several enhancements to mode clustering: (i) a soft variant of cluster assignment, (ii) a measure of connectivity between clusters, (iii) a technique for choosing the bandwidth, (iv) a method for denoising small clusters, and (v) an approach to visualizing the clusters. Combining all these enhancements gives us a complete procedure for clustering in multivariate problems. We also compare mode clustering to other clustering methods in several examples


Introduction
Mode clustering is a nonparametric clustering method [8,7,10,13,18,2] with three steps: (i) estimate the density function, (ii) find the modes of the estimator, and (iii) define clusters to be the basins of attraction of these modes. The partition defined by these basins of attraction is known as the Morse complex. There are several advantages to using mode clustering relative to other commonly-used methods: (1) There is a clear population quantity being estimated, namely, the Morse complex of the true density. (2) Computation is simple: the density can be estimated with a kernel density estimator, and the modes and basins of attraction can be found with the mean-shift algorithm. (3) There is a single tuning parameter to choose, namely, the bandwidth of the density estimator. (4) It is theoretically well understood since it depends only on density estimation and mode estimation [2,25,24].
Despite these advantages, there is room for improvement. First, mode clustering results in a hard assignment; there is no measure of uncertainty as to how well-clustered a data point is. Second, each cluster may connect to other clusters and there is no method for measuring the overlap between clusters. Third, it is not clear how to visualize the clusters when the dimension is greater than two. In this paper, we propose solutions to all these issues.
Related Work. Mode clustering is based on the mean-shift algorithm [13,8,10] which is a popular technique in image segmentation and clustering. [18,3] formally introduced mode clustering to the statistics literature. The idea of clustering based on high density regions had already been proposed in [14].

Review of Mode Clustering
Let p be the density function of a random vector X ∈ R d . Assume that p has k local maxima M = {m 1 , · · · , m k } and is a Morse function, meaning that the Hessian of p at each stationary point is non-degenerate. We do not assume that k is known. Given any x ∈ R d , there is a unique gradient ascent path, or flow, starting at x that eventually arrives one of the modes. We define the  clusters as the 'basins of attraction' of the modes, i.e., the sets of points whose ascent paths have the same mode. Now we give more detail.
The flow through x is a path π x : R → R d such that π x (0) = x and π x (t) = ∇p(π x (t)). (1) A standard result in Morse theory is that integral curves never intersect except at the modes, so the curves partition the space. We define the destination for the integral curve starting at x by Then dest(x) = m j for some mode m j . For each mode m j we define the basin of attraction of m j by C j = {x : dest(x) = m j }, j = 1, · · · , k.
In practice, p(x) is unknown and we need to estimate it. Let X 1 , . . . , X n be a random sample from p, and let K be a smooth, symmetric kernel. The kernel density estimator (KDE) with bandwidth h > 0 is defined by The modes M = { m 1 , . . . , m k } of p n and the destinations under p n of any point x, dest(x), are both easily found using the mean-shift algorithm [13,8,10]. The corresponding basins of attraction and estimated Morse complex are and the sample clusters are defined by Remark. We suggest selecting bandwidth targeted at estimating the gradient of p accurately. A slight modification of the suggestion in [6] is where S n,j is the sample deviation along j-th coordinate.

Soft Clustering
Mode clustering is a type of hard clustering, where each observation is assigned to one and only one cluster. Soft clustering methods [20,19,22,23] attempt to capture the uncertainty in this assignment. This is typically represented by an assignment vector for each point that is a probability distribution over the clusters. For example, whereas a hard-clustering method might assign a point x to cluster 2, a soft clustering might give x an assignment vector a(x) = (0.01, 0.8, 0.01, 0.08, 0.1), reflecting both the high confidence that x belongs to cluster 2 the nontrivial possibility that it belongs to cluster 5.
Soft clustering can capture two types of cluster uncertainty: population level (intrinsic difficulty) and sample level (variability). The population level uncertainty originates from the fact that even if p is known, some points are more strongly related to their modes than others. Specifically, for a point x near the boundaries between two clusters, say C 1 , C 2 , the associated soft assignment vector a(x) should have a 1 (x) ≈ a 2 (x). The sample level uncertainty comes from the fact that p has been estimated by p n . The soft assignment vector a(x) is designed to capture both types of uncertainty.

Soft Mode Clustering
We now present two soft mode clustering methods.
Method 1: Hitting Probability. Consider a diffusion (a continuous Markov process) that starts at x and that tends to move to high density regions. Let a HP (x) = (a HP 1 (x), . . . , a HP k (x)) where a HP j (x) is the conditional probability that mode j is the first mode reached by the diffusion, given that it reaches one of the modes. In this case a(x) is a probability vector and so is easy to interpret. Such diffusions have been used in statistics and machine learning for other purposes [21,9].
In more detail, let K h (x, y) = K ||x−y|| h and define the transition probability This defines a Markov process with q(y|x) the probability of jumping to y given that the process is at x. Fortunately, we do not actually have to run the diffusion to estimate a HP (x).
An approximation to the above diffusion process restricted to x, y in {X 1 , . . . , X n , m 1 , . . . , m k } is as follows. We define a Markov chain that has n + k states. The first k states are the estimated local modes m 1 , . . . , m k and are each absorbing states. That is, the Markov process stops when it hits any of the first k state. The other n states correspond to the data points X 1 , . . . , X n . The transition probability from each X i is given by for i, j = 1, . . . , n and l = 1, . . . , k. Thus, the transition matrix P is where I is the identity matrix and S is an n × k matrix with element S ij = P(X i → m j ) and T is an n × n matrix with element T ij = P(X i → X j ). Then by Markov chain theory, the absorbing probability from X i onto m j is given by A ij where A ij is the (i, j)-th element of the matrix We define the soft assignment vector by a HP j (X i ) = A ij .
Method 2: Density Distance. We can use the density to define a distance [5,26,11,4,16,1] which can then be used to construct a soft assignment vector. Given a density function p, the density distance between points (x, z) is where the infimum is over all smooth paths γ : [0, 1] → R d and γ(0) = x, γ(1) = z connecting x and z. When q = 1, the density distance is the same as the Euclidean distance. When q > 1, the density distance is a shortest path that follows high density regions to connect two points x, z.
Let m 1 , . . . , m k be local modes. For each x ∈ R d , we define the soft assignment vector a DD (x) = a DD 1 (x), . . . , a DD k (x) by .
A consistent estimate of d q (x, z) due to [16] is the following. Let V n = X n ∪ {x, z} and π be a path consisting of points The length of π, denoted as L q (π), is defined to be and a consistent estimator of d q (x, z) is Plugging (16) into (14) gives the soft assignment vector. There are two tuning parameters: q and β 0 . In practice, we pick q = 2 and where S n is defined in (8). This gives a scale-invariant soft assignment vector.

Measuring the Connectivity of Clusters
Here we define a measure of connectivity of clusters that can be used to provide geometric information and can also be used to help decide when clusters should be merged. We define the connectivity measure by where N i = n l=1 1(X l ∈ C i ) is the number of sample in cluster C i . The matrix Ω is useful as a dimension-free, summary-statistic to describe the degree of overlap/interaction among the clusters. If we think of the (hard) cluster assignments as class labels, then Ω is analogous to the mis-classification rate between class i and class j. The connectivity will be large when two clusters are close. We call Ω the matrix of connectivity measure or the connectivity matrix.
An application of Ω is to merge highly connected clusters. For instance, Figure 2 gives an example of a dataset with a ring-shaped cluster and a small spherical cluster at the center. We find the mode clusters and then we merge clusters with high connectivity. This gives clusters: the ring and the center cluster. Currently, merging is done based on a user-chosen threshold for Ω, in this case 0.16. Finding a data-driven method to decide when to merge clusters based on Ω is left to future work.
Another application of Ω is to connect, but not merge, clusters when visualizing the clustering result. This is useful for visualization as discussed in the next section.

Visualization
Here we present a method for visualizing the clusters that combines multidimensional scaling (MDS) with our connectivity measure for clusters. Given points X 1 , . . . , X n ∈ R d , we use MDS to project the data into R 2 . MDS finds Z 1 , . . . , Z n ∈ R 2 to minimize i<j (d( d is the Euclidean metric; see [17,27,15].
Our approach consists of two stages. At the first stage, we apply MDS on the modes and plot the result in R 2 . At the second stage, we apply MDS to points within each cluster along with the associated mode. Then we place the points around the projected modes. We scale the MDS result at the first stage by a factor ρ 0 so that each cluster is separated from each other. Figure 3 gives an example.
Recall that M = { m 1 , . . . , m k } is the set of estimated local modes and X j is the set of data points belonging to mode m j . At the first stage, we perform MDS on M so that where m † j ∈ R 2 for j = 1, . . . , k. We plot { m † 1 , . . . , m † k }.
At the second stage, we consider each cluster individually. Assume we are working on the jth cluster and m j , X j are the corresponding local mode and cluster points. We denote X j = {X j1 , . . . , X jNj }, where N j is the sample size for cluster j. Then we apply MDS to the collection of points {m j , X j1 , X j2 , . . . , X jNj }: where m * j , X * j1 , X * j2 , . . . , X * jNj ∈ R 2 . Then we center the points at m † j and place X * j1 , X * j2 , . . . , X * jNj around m † j . That is, we make a translation to the set {m * j , X * j1 , X * j2 , . . . , X * jNj } so that m * j matches the location of m † j . Then we plot the translated points X * j1 , X * j2 , . . . , X * jNj . We repeat the above process for each cluster to visualize the high dimensional clustering.
Note that in practice, the above process may cause unwanted overlap among clusters. Thus, one can scale { m † 1 , . . . , m † k } by a factor ρ 0 > 1 to remove the overlap.

Experiments
We ran two experiments in this section.
We use the Gaussian kernel with smoothing bandwidth h = 0.085 from (8). We display the soft clustering for each data point by three colors (red, green and blue). The intensity for each color shows the component of the soft assignment vector. We also provide the ternary plot that contains all the information on soft assignment vector for each data point. Each point on the ternary plot corresponds to one observation and the location of that point on the ternary plot is the the soft assignment vector for that observation. Note that for each point, the sum of all components of the soft assignment vector is 1 so that every vector can be represented by a unique point on the ternary plot. Figure 4 shows the result for applying the density distance (with q = 2 and β 0 = 28.87 by (17)). Figure 5 gives the analysis based on the hitting probability.
Eample 2: Olive Oil data. We apply our method to the Olive oil data introduced in [12]. This data set consists of 8 chemical measurements (features) for each observation and the total sample size is n = 572. Each observation is an olive oil produced in one of 3 regions in Italy and these regions are further divided into 9 areas.
Since these measurements are in different units, we normalize and standardize each measurement. We apply the normal reference rule (8) to select the smoothing bandwidth h = 0.587. We also merged tiny clusters (with sample size less than 5) into nearby larger clusters.
There are 7 remaining clusters. We visualize these clusters and the measure the connectivity (using the hitting probability). Instead of further merging cluster according to the connectivity measures, we keep these 7 clusters and connect pair of clusters if the connectivity measure is larger than 0.07. Finally, we color each point according to the region. (The information about the region was not used in the clustering.) The result is given in figure 6.
As can be seen, most clusters contain one dominating production area. Even in cases where one cluster contains multiple types of olive oil, the connectivity matrix captures this phenomena. For instance, cluster 2 and 3 both contain Calabria, Sicily and South-Apulia. We do observe a connection between cluster 2 and 3 in Figure 6 and a higher connectivity measure in the matrix for connectivity measures. When a group of oil produced in the same area is separated into two clusters, we observe an edge between these two clusters. This shows that the connectivity measure conveys more information on the hidden interaction between clusters.  Figure 6: Clustering result for the Olive oil data (d=8). Color denotes produce area. Note that we add edges to those pairs of clusters with connectivity measure > 0.07 (colored by red in the matrix). The width of edge reflects the degree of connection.

Conclusion
In this paper, we presented enhancements to mode clustering methods, including several approaches to soft mode clustering, a measure of cluster connectivity, and new visualization methods for highdimensional data. The cluster connectivity and visualization methods apply to other clustering methods as well. We considered several examples; a complete theoretical analysis of the proposed methods will be treated in a future paper.