Data ultrametricity and clusterability

The increasing needs of clustering massive datasets and the high cost of running clustering algorithms poses difficult problems for users. In this context it is important to determine if a data set is clusterable, that is, it may be partitioned efficiently into well-differentiated groups containing similar objects. We approach data clusterability from an ultrametric-based perspective. A novel approach to determine the ultrametricity of a dataset is proposed via a special type of matrix product, which allows us to evaluate the clusterability of the dataset. Furthermore, we show that by applying our technique to a dissimilarity space will generate the sub-dominant ultrametric of the dissimilarity.


Introduction
Clustering is the prototypical unsupervised learning activity which consists in identifying cohesive and well-differentiated groups of records in data.A data set is clusterable if such groups exist; however, due to the variety in data distributions and the inadequate formalization of certain basic notions of clustering, determining data clusterability before applying specific clustering algorithms is a difficult task.
Evaluating data clusterability before the application of clustering algorithms can be very helpful because clustering algorithms are expensive.However, many such evaluations are impractical because they are NP-hard, as shown in [4].Other notions define data as clusterable when the minimum between-cluster separation is greater than the maximum intra-cluster distance [13], or when each element is closer to all elements in its cluster than to all other data [7].
Several approaches exist in assessing data clusterability.The main hypothesis of [1] is that clusterability can be inferred from an one-dimensional view of pairwise distances between objects.Namely, clusterability is linked to the multimodality of the histogram of inter-object dissimilarities.The basic assumption is that "the presence of multiple modes in the set of pairwise dissimilarities indicates that the original data is clusterable."Multimodality is evaluated using the Dip and Silverman statistical multimodality tests, an approach that is computationally efficient.
Alternative approaches to data clusterability are linked to the feasibility of producing a clustering; a corollary of this assumption is that "data that are hard to cluster do not have a meaningful clustering structure" [12].Other approaches to clusterability are identified based on clustering quality measures, and on loss function optimization [4,9,3,8,7,11].
We propose a novel approach that relates data clusterability to the extent to which the dissimilarity defined on the data set relate to a special ultrametric defined on the set.
The paper is structured as follows.In Section 2 we introduce dissimilarities and an ultrametrics that play a central role in our definition of clusterability.A special matrix product on matrices with non-negative elements that allow an efficient computation of the subdominant ultrametric is introduced.In Section 3 a measure of clusterability that is based on the iterative properties of the dissimilarity matrix is defined.We provide experimental evidence on the effectiveness of the proposed measure through several experiments on small artificial data sets in Section 4. Finally, we present our conclusions and future plans in Section 5. When (S, d) is an ultrametric space two spheres having the same radius r in (S, d) are either disjoint or coincide [18].Therefore, the collection of closed spheres of radius r in S, C r = {B[x, r] | r ∈ S} is a partition of S; we refer to this partition as an r-spheric clustering of (S, d).

Dissimilarities, Ultrametrics, and Matrices
In an ultrametric space (S, d) every triangle is isosceles.Indeed, let T = (x, y, z) be a triplet of points in S and let d(x, y) be the least distance between the points of T .Since d(x, z) max{d(x, y), d(y, z)} = d(y, z) and d(y, z) max{d(y, x), d(x, z)} = d(x, z), it follows that d(x, z) = d(y, z), so T is isosceles; the two longest sides of this triangle are equal.
It is interesting to note that every r-spheric clustering in an ultrametric space is a perfect clustering [5].This means that all of its in-cluster distances are smaller than all of its between-cluster distances.Indeed, if x, y belong to the same cluster B[u, r] then d(x, y) r.
r and this implies d(x, y) = d(x, v) > r because the triangle (x, y, v) is isosceles and d(y, v) is not the longest side of this triangle.The closed spheres of this spaces are: for 4 r < 10, {x 1 , x 2 , x 3 , x 4 , x 5 } for 10 r < 16, S for r = 16, Based on the properties of spheric clusterings mentioned above meaningful such clusterings can be produced in linear time in the number of objects.For the ultrametric space mentioned in Example 2.1, the closed spheres of radius 6 produce the clustering If a dissimilarity defined on a data set is close to an ultrametric it is natural to assume that the data set is clusterable.We assess the closeness between a dissimilarity d and a special ultrametric known as the subdominant ultrametric of d using a matrix approach.
Let S be a set.Define a partial order " " on the set of definite dissimilarities D S by d d if d(x, y) d (x, y) for every x, y ∈ S. It is easy to verify that (D S , ) is a poset.
The set U S of ultrametrics on S is a subset of D S .
Theorem 2.2.Let {d i ∈ U S | i ∈ I} be a collection of ultrametrics on the set S. Then, the mapping d : S × S −→ R 0 defined as Proof.We need to verify only that d(x, y) satisfies the ultrametric inequality d(x, y) max{d(x, z), d(z, y)} for x, y, z ∈ S. Since each mapping d i is an ultrametric, for x, y, z ∈ S we have hence d is an ultrametric on S. Proof.The set U d is nonempty because the zero dissimilarity d 0 given by d 0 (x, y) = 0 for every x, y ∈ S is an ultrametric and d 0 d.Since the set {e(x, y) | e ∈ U d } has d(x, y) as an upper bound, it is possible to define the mapping e 1 : S 2 −→ R ≥0 as e 1 (x, y) = sup{e(x, y) | e ∈ U d } for x, y ∈ S. It is clear that e e 1 for every ultrametric e.We claim that e 1 is an ultrametric on S.
The ultrametric defined by Theorem 2.3 is known as the maximal subdominant ultrametric for the dissimilarity d.
The situation is not symmetric with respect to the infimum of a set of ultrametrics because, in general, the infimum of a set of ultrametrics is not necessarily an ultrametric.
Let P be the set The usual operations defined on R can be extended to P by defining Let P m×n be the set of m × n matrices over P. If A, B ∈ P m×n we have If A ∈ P m×n and B ∈ P n×p the matrix product C = AB ∈ P m×p is defined as: for 1 i m and 1 j p.
If E n ∈ P n×n is the matrix defined by that is the matrix whose main diagonal elements are 0 and the other elements equal ∞, then AE n = A for every A ∈ P m×n and E n A = A for every A ∈ P n×p .The matrix multiplication defined above is associative, hence P n×n is a semigroup with the identity E n .The powers of A are inductively defined as For A, B ∈ P m×n we define A B as a ij B ij for 1 i m and 1 j n.Note that if A ∈ P n×n , then A E n .It is immediate that for A, B ∈ P m×n and C ∈ P n×p , then A B implies AC BC; similarly, if C ∈ P p×m and CA CB.
Let L(A) be the finite set of elements in P that occur in the matrix A ∈ P n×n .Since he entries of any power A n of A are also included in L(A), the sequence A, A 2 , . . ., A n , . . . is ultimately periodic because it contains a finite number of distinct matrices.
Let k(A) be the least integer k such that A k = A k+d for some d > 0. The sequence of powers of A has the form where d is the least integer such that A k(A) = A k(A)+d .This integer is denoted by d(A).
The set {A k(A) , . . ., A k(A)+d−1 } is a cyclic group with respect to the multiplication.
If (S, d) is a dissimilarity space, where S = {x 1 , . . ., x n }, the matrix of this space is the matrix A ∈ P n×n defined by a ij = d(x i , x j ) for 1 i, j n.Clearly, A is a symmetric matrix and all its diagonal elements are 0, that is, A E n .
If, in addition, we have a ij a ik + a kj for 1 i, j, k n, then A is a metric matrix.If this condition is replaced by the stronger condition a ij max{a ik + a kj } for 1 i, j, k n, then A is ultrametric matrix.Thus, for an ultrametric matrix we have and A m is an ultrametric matrix.
Proof.Since A E n , the existence of the number m with the property mentioned in the theorem is immediate since there exists only a finite number of n × n matrices whose elements belong to L(A).Since A m = A 2m , it follows that A m is an ultrametric matrix.
For a matrix A ∈ P n×n let m(A) be the least number m such that A m = A m+1 .We refer to m(A) as the stabilization power of the matrix A. The matrix A m(A) is denoted by A * .
The previous considerations suggest defining the ultrametricity of a matrix A ∈ P n×n with A E n as u(A) = n m(A) .Since m(A) n, it follows that u(A) 1.If m(A) = 1, A is ultrametric itself and u(A) = n.Theorem 2.5.Let (S, d) be a dissimilarity space, where S = {x 1 , . . ., x n } having the dissimilarity matrix A ∈ P n×n .If m is the least number such that A m = A m+1 , then the mapping δ : S × S −→ P defined by δ(x i , x j ) = (A m ) ij is the subdominant ultrametric for the dissimilarity d.
Proof.As we observed, A m is an ultrametric matrix, so δ is an ultrametric on S. Since A m A, it follows that d(x i , x j ) δ(x i , x j ) for all x i , x j ∈ S.
Suppose that C ∈ P n×n is an ultrametric matrix such that A C, which implies A m C m C. Thus, A m dominates any ultrametric that is dominated by d.Consequently, the dissimilarity defined by A m is the subdominant ultrametric for d.
The subdominant ultrametric of a dissimilarity is usually studied in the framework of weighted graphs [14].
A weighted graph is a triple (V, E, w), where V is the set of vertices of G, E is a set of two-element subsets of V called edges.and w : E −→ P is the weight of the edges.If e ∈ E, then e = {u, v}, where u, v are distinct vertices in V .The weight is extended to all 2-elements subsets of V as Our hypothesis is supported by previous results obtained in [1], where the clusterability of 9 databases were statistically examined using the Dip and Silverman tests of unimodality.The approach used in [1] starts with the hypothesis that the presence of multiple modes in the uni-dimensional set of pairwise distances indicates that the original data set is clusterable.Multimodality is assessed using the tests mentioned above.The time required by this evaluation is quadratic in the number of objects.
The first four data sets, iris, swiss, faithful and rivers were deemed to be clusterable; the last five were evaluated as not clusterable.Tests published in [6] have produced low p-values for the first four datasets, which is an indication of clusterability.The last five data sets, USArrests, attitude, cars, and trees produce much larger p-values, which show a lack of clusterability.Table 1 shows that all data sets deemed clusterable by the unimodality statistical test have values of the clusterability index that exceed 5.
In our approach clusterability of a data set D is expressed primarily through the "stabilization power" m(A D ) of the dissimilarity matrix A D ; in addition, the histogram of the dissimilarity values is less differentiated when the data is not clusterable.

Experimental Evidence on Small Artificial Data Sets
Another series of experiments involved a series of small datasets having the same number of points in R 2 arranged in lattices.The points have integer coordinates and the distance between points is the Manhattan distance.
By shifting the data points to different locations, we create several distinct structured clusterings that consists of rectangular clusters.
Figures 2 and 3 show an example of a series of datasets with a total of 36 data points.Initially, the data set has 4 rectangular clusters containing 9 data points each with a gap of 3 distance units between the clusters.The ultrametricity of the dataset and, therefore, its clusterability is affected by the number of clusters, the size of the clusters, and the inter-cluster distances.Figure 3 shows that m(A) reaches its highest value and, therefore, the clusterability is the lowest, when there is only one cluster in the dataset (see the third row of Figure 3).If points are uniformly distributed, as it is the case in the third row of Figure 3, the clustering structure disappears and clust(D) has the lowest value.
Histograms are used by some authors [10,2] to identify the degree of clusterability.Note however that in the case of the data shown in Figures 2 and 3, the histograms of original dissimilarity of the dataset do not offer guidance on the clusterability(second column of Figure 2 and 3).By applying the "min-max" power operation on the original matrix, we get an ultrametric matrix.The new histogram of the ultrametric shows a clear difference on each dataset.In the third column of Figures 2 and 3, the histogram of the ultrametric matrix for each dataset shows a decrease of the number of distinct distances after the "power" operation.
If the dataset has no clustering structure the histogram of the ultrametric distance has only one bar.
The number of pics p of the histogram indicate the minimum number of clusters k in the ultrametric space specified by the matrix A * using the equality If a data set contains a large number of small clusters, these clusters can be regarded as outliers and the clusterability of the data set is reduced.This is the case in the third line of Figure 4 which shows a particular case for 9 clusters with 36 data points.Since the size of each cluster is too small to be considered as a real cluster, all of them together are merely regarded as a one cluster dataset with 9 points.

A
dissimilarity on a set S is a mapping d : S × S −→ R such that (i) d(x, y) 0 and d(x, y) = 0 if and only if x = y; (ii) d(x, y) = d(y, x); A dissimilarity on S that satisfies the triangular inequality d(x, y) d(x, z) + d(z, y) for every x, y, z ∈ S is a metric.If, instead, the stronger inequality d(x, y) max{d(x, z), d(z, y)} is satisfied, d is said to be an ultrametric and the pair (S, d) is an ultrametric space.A closed sphere in (S, d) is a set B[x, r] defined by B[x, r] = {y ∈ S | d(x, y) r}.

Example 2 . 1 .
Let S = {x i | 1 i8} and let (S, d) be the ultrametric space, where the ultrametric d is defined by the following table:

Theorem 2 . 3 .
Let d be a dissimilarity on a set S and let U d be the set of ultrametrics U d = {e ∈ U S | e d}.The set U d has a largest element in the poset (U S , ).

Figure 1 : 2 Figure 2 : 4 Figure 3 :
Figure 1: The process of distance equalization for successive powers of the incidence matrix.The matrix A 3 D is ultrametric.

Table 1 :
All clusterable datasets have values greater than 5 for their clusterability; all non-clusterable datasets have values no larger than 5.