Hierarchical Clustering with OWA-based Linkages, the Lance–Williams Formula, and Dendrogram Inversions

Agglomerative hierarchical clustering based on Ordered Weighted Averaging (OWA) operators not only generalises the single, complete, and average linkages, but also includes intercluster distances based on a few nearest or farthest neighbours, trimmed and winsorised means of pair-wise point similarities, amongst many others. We explore the relationships between the famous Lance–Williams update formula and the extended OWA-based linkages with weights generated via infinite coefficient sequences. Furthermore, we provide some conditions for the weight generators to guarantee the resulting dendrograms to be free from unaesthetic inversions.


Introduction
Cluster analysis (e.g., [18]) is a machine learning technique where we discover interesting or otherwise useful partitions of a given dataset in a purely unsupervised way.
Hierarchical agglomerative clustering algorithms (e.g., [16]) allow for partitioning the datasets for which merely a pairwise distance function (e.g., a metric) is defined.Most importantly, the number of clusters is not set in advance -a whole hierarchy of nested partitions can be generated with ease and then depicted on a tree-like diagram called a dendrogram.
Hierarchical agglomerative clustering evolves around one simple idea: in each step, we merge the pair of closest clusters.To measure the proximity between two point sets, the intracluster distance is defined as an extension of the point-pairwise distance called a linkage function.
For instance, in the single linkage approach, the distance between a cluster pair is given by the distance between the closest pair of points, one from the first cluster, the other from the second one.In complete linkage, on the other hand, we take the farthest-away pair.In average linkage, we compute the arithmetic mean between all the pairwise distances.
In Section 3, we recall two wide classes of linkage functions that generalise the three aforementioned cases.The first group consists of linkages generated by the well-known Lance-Williams formula [13,15].The second class considers convex combinations of ordered pairwise distances between clusters, i.e., the OWA operators (ordered weighted averages [19]).Such OWA-based linkages were introduced by Yager in [20] (see also [17] where they were re-invented) and include many linkage generators that are not covered by the classical Lance-Williams setting.
From the practical side, their usefulness has been thoroughly evaluated in [4], where also some further tweaks were proposed to increase the quality of the generated results, e.g., by including in Genie correction for cluster size inequality [7].
However, OWA-based linkages have not been studied thoroughly from the theoretical perspective.This paper aims to fill this gap.
In particular, after recalling the basic definitions in Section 2 and Section 3, in Section 4 we characterise the relationship between the Lance-Williams linkage update formula and OWA linkages with two different weight generators.Then, in Section 5, we give some conditions for the linkages to yield clusterings which can be represented aesthetically on dendrograms (without the so-called inversions).Section 6 concludes the paper.

Hierarchical Agglomerative Clustering
Let ∈ ℝ × be the input data set comprised of points in ℝ , where = ( ,1 , … , , ) for each .We from now on assume that ‖ − ‖ is the Euclidean distance between two given points.However, let us note that most of the results presented below hold for any set of objects equipped with a semimetric.
Classical choices of include, see, e.g., [16]: • the single linkage: • the complete linkage: • the average linkage (UPGMA; Unweighted Pair Group Method with Arithmetic Mean): • the Ward linkage: where , are the respective clusters' centroids (componentwise arithmetic means), • weighted average linkage (WPGMA; Weighted Pair Group Method with Arithmetic Mean): ∪ in one of the previous iterations, • the median linkage (WPGMC; Weighted Pair Group Method Centroid) given by: assuming that = ∪ in one of the previous iterations.Some of the above cases can be generalised through the Lance-Williams formula [13,15] or the OWA-based linkages [20], which we will discuss in Section 3.

Linkage Classes
Let us recall two noteworthy linkage classes, based respectively on the the Lance-Williams formula and the OWA operator-based intercluster distances.

Lance-Williams Linkages
Lance and Williams proposed in [13] an iterative formula that generalises many common linkages and allows for a fast update of the intercluster distances after each cluster pair merge.

OWA-based linkages
OWA operators -convex combinations of order statistics -were introduced in the aggregation and decision making context by Yager in [19].Let us introduce their version that acts on element sequences of any length: Definition 1.An extended [3,14] OWA operator is defined as: where a given weighting triangle is By definition, OWA △ is symmetric, idempotent, and nondecreasing in each variable (and hence also internal); compare [1,5,10].
Definition 2. For a given weighting triangle △, the OWA △ -based linkage is defined as: for any point sets and , i.e., it is the OWA operator applied on all the = | || | pairwise distances between the two sets.Remark 3. In particular, for weights fulfilling: • , = 1 : we obtain the average linkage, • , = 1 and , = 0 for < : we get the single linkage, • 1, = 1 and , = 0 for > 1: we enjoy the complete linkage.
Note that numerous new linkages that do not fit the Lance-Williams formula are now possible.This includes, e.g., these that correspond to the median and any other quantiles, trimmed and winsorised means, the arithmetic mean of a few first smallest observations, a fuzzified/smoothed minimum, and so forth; see [4] for many example classes.
There are a few generic ways to generate the weighting triangles known in the literature; see [11,19] and [2,9].In this paper, we will be interested in studying: The assumption that 1 = 1 is with no loss in generality.

OWA linkages and the Lance-Williams formula
In this section we characterise which OWA-based linkages (defined via OWA and OWA ′ ) can be expressed by the Lance-Williams formula, and vice versa.It turns out that under the two assumed weight generation schemes, these only include the single, average, and complete linkages.Theorem 4. Assume that is generated by the Lance-Williams formula, i.e., for any , , it holds: Then there exists = ( 1 , 2 , … ) with 1 = 1 and 2 , 3 ⩾ 0 such that for every , it holds: if and only if is either: Proof.That these conditions are sufficient is evident.Hence, assume that is generated by the Lance-Williams formula and that: , where is the distance between an -th pair of points in × and is the distance between an -th pair of points in × .Due to the symmetry of OWAs, it does not matter how we enumerate the pairs, hence we assume
Assuming 1 = 1 and = 2, this yields: With 1 = 1 and studying further > 2, we get each time that = −1 2 . But from the previous equation we have hence 1 + 2 = 2 + 2 2 and thus 2 = 1, which implies that necessarily for all it must hold = 1, which corresponds to the average linkage.As we have already considered the complete linkage case separately, the proof is complete.
Theorem 5. Assume that is generated by the Lance-Williams formula, just like above.Then there exists = ( 1 , 2 , … ) with 1 = 1 and 2 , 3 ⩾ 0 such that for every , we have: if and only if is either: Proof.Sufficiency of the above is obvious.The reasoning required to show the necessary part is very similar to the one we have conveyed in the proof of Theorem 4.
In other words, it is the distance (as given by the chosen linkage ) between the two clusters merged at the -th step.
See [15] for a proof.Hence, single, complete, average, weighted average, and Ward linkages yield increasing ℎ s.
Let us move on to the OWA linkage case.We note that already even very simple weighting triangles can fail to produce increasing OWA-based ℎ s.
Focusing on the first type of extended OWA, we shall therefore be interested in identifying coefficient vectors such that: OWA ( , ) ⩾ min{OWA ( ), OWA ( )} (7) for all , of any cardinalities, where ( , ) denotes vector concatenation.
The first result we present toward this will help us establish some necessary conditions.
Theorem 11.Let = ( 1 , 2 , … ) be a coefficient vector with 1 = 1 and 2 , 3 , … ⩾ 0. Then for any , and then for all , such that: , we must have: Proof.For any and , with and For any ⩽ , ⩽ , consider = ̄ and = ̄ .As a necessary condition, we must have: , the above is equivalent to stating that: Particular cases of the above allow us to establish the following.
While these two conditions are necessary, the following example illustrates that they are not sufficient.
We therefore look towards establishing a sufficient condition, which we formulate as follows.
The proof is based on several lemmas presented below.
Proof.It is sufficient to show this for ∈ { ∈ [0, 1] ∶ 1 ⩾ 2 ⩾ … ⩾ } using symmetry and homogeneity of OWA functions.The graph of OWA on that subdomain is a fragment of a plane, and the inequalities at all the vertices are necessary and sufficient for one graph dominating the other.
Note that the above relationship can also be stated in terms of the cumulative sum of OWA weights, and related to the idea of stochastic dominance, i.e., for it holds that: Let us introduce notation + = ( +1 , +2 , … , + +1 ).
Sketch of the proof.Apply Lemma 15 with = +0 and = + and note that OWA + ( ) = . The trivial cases of the sums in (12) being 0 are considered separately.
Another piece of notation, let ÕWA ( , ) = 1 − ), whose difference to OWA is that no sorting step after concatenation takes place (it is assumed that , are sorted separately). .Hence, due to idempotency of the OWA and since we are taking a convex combination, we have: We can now formulate the proof of the above theorem.
Proof of Theorem 14. Assume without loss in generality that in (7), OWA ( ) = OWA ( ) = ̄ , as any increase in, say, will lead to an increase on the left but not on the right, as well as 1 ⩾ 1 .Then ⩽OWA ( , ) (decreasing weights starting from the 2nd one).
Note that 1 will not end up in the first position in the sorted list ( , ), hence even if 1 < +1 , it does not negate the last inequality.
Note that, as long as there are no zeroes in the denominators, we can write our sufficient condition as: for all > 1, > .This amounts to all effective weighting vectors obtained from the first arguments of being stochastically dominated by all weighting vectors of the same length starting from a different index.
Note two differences to the necessary conditions in Theorem 11.First, there is a condition attached to (11).Second, we have < .Thus, the condition in ( 14) is unfortunately stronger than the necessary conditions previously established.We can weaken the sufficient conditions ( 14) by modifying Theorem 14 to: for some < (owing to the fact that 1 will be positioned ahead of some , = , … , in the sorted list ( , ).
On the other hand, we observe that a fairly simple rule that will ensure satisfaction of ( 14) is for ratios between sequential values to be increasing, i.e., ∕ +1 ⩽ +1 ∕ +2 .However, this is stronger than the sufficient condition.While Examples 18 and 19 above adhere to this rule, the following example shows that decreasing ratios also cannot be framed as a necessary requirement.
Thus, the degree of the allowed tightening of the necessary and loosening of the sufficient conditions are still an open problem.

Conclusion and Future Work
In [4], the practical usefulness of OWA-based clustering was evaluated thoroughly on numerous benchmark datasets from the suite described in [6].It was noted that adding the Genie correction for cluster size inequality [7] leads to high-quality partitions, especially based on linkages that rely on a few closest point pairs (e.g., the single linkages and fuzzified/smoothened minimum).
In this paper we have presented some previously missing theoretical results concerning the OWA-based linkages.First, we have shown that the OWA-based linkages and the Lance-Williams formula only have three instances in common: the single, average, and complete linkages.Both classes enable a very fast (linear-time) update between iterations and thus are of potential practical interest.
Then, we gave some necessary and sufficient conditions for the coefficient generating schemes to guarantee the resulting dendrograms' being free from unaesthetic inversions (note that the mentioned Genie correction might additionally introduce inversions by itself).How to tighten these condition sets in the form of "if and only if" statements is still an open problem -follow-up research is welcome.
Furthermore, in [4], a generalised, three-stage OWA linkage scheme was introduced.There are also generalisations of the Lance-Williams formula, e.g., [8].Inspecting the relationships between them should also be conveyed.

Figure 1 :
Figure 1: Dendrograms for complete (lefthand side) and centroid (righthand side) linkage-based clustering of the same sample of the iris dataset, as plotted by the stats:::plot.hclustfunction in R; note the inversions