Normalised clustering accuracy: An asymmetric external cluster validity measure

There is no, nor will there ever be, single best clustering algorithm. Nevertheless, we would still like to be able to distinguish between methods that work well on certain task types and those that systematically underperform. Clustering algorithms are traditionally evaluated using either internal or external validity measures. Internal measures quantify different aspects of the obtained partitions, e.g., the average degree of cluster compactness or point separability. However, their validity is questionable because the clusterings they endorse can sometimes be meaningless. External measures, on the other hand, compare the algorithms' outputs to fixed ground truth groupings provided by experts. In this paper, we argue that the commonly used classical partition similarity scores, such as the normalised mutual information, Fowlkes-Mallows, or adjusted Rand index, miss some desirable properties. In particular, they do not identify worst-case scenarios correctly, nor are they easily interpretable. As a consequence, the evaluation of clustering algorithms on diverse benchmark datasets can be difficult. To remedy these issues, we propose and analyse a new measure: a version of the optimal set-matching accuracy, which is normalised, monotonic with respect to some similarity relation, scale-invariant, and corrected for the imbalancedness of cluster sizes (but neither symmetric nor adjusted for chance).


Introduction
Clustering is an unsupervised learning technique that aims at identifying semantically useful partitions of a given dataset (Hennig, 2015;von Luxburg et al., 2012).Up to this date, many clustering algorithms have been proposed, but the problem of how to evaluate the overall quality of the outputs they generate is still open for discussion (Xiong & Li, 2014;Tavakkol et al., 2022;Ullmann et al., 2022;van Mechelen et al., 2023).We know that there will never be a single "best" method (Ackerman et al., 2021;Strobl & Leisch, 2022).Nevertheless, at the very least, we would like to be able to identify the algorithms that are somewhat sensible on certain classes of datasets, or filter out those that consistently yield disappointing results.
Internal validity measures, such as the Caliński-Harabasz, Dunn, or Silhouette index (Caliński & Harabasz, 1974;Dunn, 1974;Rousseeuw, 1987), are often used to quantify how well a given partition reflects the structure of the underlying unknown data distribution, for instance, the degree of compactness or separability (e.g., Milligan and Cooper, 1985;Maulik and Bandyopadhyay, 2002;Halkidi et al., 2001;Arbelaitz et al., 2013;Xu et al., 2020).However, they have been recently criticised by Gagolewski et al. (2021) who noted that some popular measures often promote clusterings that are not meaningful, e.g., they return cluster memberships that resemble noise or should rather be employed as outlier detectors (see Fig. 1

therein).
External validity measures, on the other hand, operate under the assumption that benchmark datasets are equipped with expert-given labels; this is true for many recently introduced test batteries (Graves & Pedrycz, 2010;Thrun & Ultsch, 2020;Dua & Graff, 2022;Fränti & Sieranoja, 2018;Gagolewski, 2022).Moreover, they presume that it would be best if an algorithm returned a clustering that is as similar to the reference one as possible.In the literature, it has become customary to use partition similarity scores as external measures, e.g., the adjusted Rand, normalised mutual information, or pair sets indices (Wagner & Wagner, 2006;Horta & Campello, 2015;Rezaei & Fränti, 2016).However, in this paper, we argue that simpler objects can actually be more suitable.
Let X 1 , . . ., X k be a reference (ground truth, indicated by experts) partition of a set X of n objects into k nonempty and pairwise disjoint clusters.Moreover, let X1 , . . ., Xk be a predicted (generated by a clustering algorithm) partition of X into k disjoint clusters.Commonly, the knowledge about which clusters correspond to one another is summarised by a k × k confusion matrix C = (c i, j ) whose entry in the i-th row and the j-th column gives the number of elements in the i-th true cluster that an algorithm allocated to the j-th predicted cluster, i.e., c i, j = #(X i ∩ X j ).
In this paper, we are interested in studying real-valued functions aiming at quantifying how similar a predicted partition is to the reference one; compare Fig. 1.For a measure to be useful in the task at hand, it should meet a number of desirable properties (postulates).For the purpose of this introduction, we will now state them only descriptively; the formalism will follow in Sect.3: • [MON] The more similar the partitions are, the higher the score should be (in our case, we will consider monotonicity with respect to the diagonal max-dominance relation that we define below).• [B1] Only two identical partitions yield the highest possible similarity score, which we conventionally assume to be equal to 1. • [PER] The score does not change if we relabel the cluster IDs (i.e., swapping X i ↔ X j or Xi ↔ X j for any i = j; this comports with a rearrangement of rows or columns in the corresponding confusion matrix).
The third property stems from the fact that a partition is a set of clusters, and sets (equivalence classes in the equivalence relation "two points belong to the same cluster") are unordered by definition.From this perspective, "Gaussian mixture" and "Gaussian mixture relabelled" in Fig. 1 represent identical partitions.Overall, clustering is an unsupervised learning problem; therefore, we cannot expect an algorithm to guess the order of reference labels.

Remark 1
Traditional approaches to defining classes of partition similarity scores usually require the fulfilment of the above [MON], [B1], and [PER].However, additionally (see, Fig. 1 A reference (ground truth; k = 3) partition of an example dataset (WUT/x2; n = 120; see Sect. 4) and some predicted clusterings that we would like to relate to it.We also report the confusion matrices and the values of a few external cluster validity measures defined in Sects. 2 and 3.The Gaussian mixture algorithm misclassifies only five points (ca 4%), but the indices' values are quite different.In our case, the non-normalised measures (here: CA, FM, and R) do not distinguish between the cases of two undesirable partition types: assigning most points into a single cluster (as returned by single linkage) and memberships assigned at random.Note that there can be many possible reference partitions for a given dataset (Dasgupta & Ng, 2009;Gagolewski, 2022): the clusterings returned both by the Gaussian mixture method and K-Means can be considered meaningful (at least, in the current author's opinion).However, we are only relating a predicted partition to one reference clustering at a time e.g., Meilȃ, 2005;Wagner and Wagner, 2006;Xiong and Li, 2014;Rezaei and Fränti, 2016;Arinik et al. 2021) they assume that: The score is the same if we swap the roles of the two partitions.
In other words, none of the two partitions is treated specially.However, in the context of our paper, we consider X 1 , . . ., X k as a fixed point of reference for X1 , . . ., Xk .Therefore, requiring this condition is not necessary.
Further postulates are related to the worst-case scenarios.When evaluating clustering algorithms on many diverse datasets, we report aggregated similarity scores.It may thus be desirable to have the lower index bound not dependent on, amongst others, the number of clusters k, which may vary across benchmark instances.In other words, we may1 want to have the indices on the same scale: a common choice is the unit interval.And thus, in Sect.3, we will discuss the following properties.
• [B0] The smallest possible value of the index is 0.
• [U0] Perfectly uniform assignment of points to predicted clusters, i.e., when c i, j = c i,i for all i, j, results in the similarity score of 0. • [O0] Assigning all the points to a single cluster gives the similarity score of 0.
We will also relate the above to the adjustment for chance, where the expected value of an index is 0 given two independent partitions having the same marginal frequencies (property [E0]).Nevertheless, we will note that, in general, it contradicts the three above properties.
Moreover, we will also be interested in the following types of scale invariances.

• [SU]
The similarity score should remain the same when we double, triple, etc., the number of points in each subset X i ∩ X j , without changing the structure of the discovered clusters (the output for sC should be the same as the one for C for any s > 0).

• [SC]
The similarity score should be the same if we multiply points in one reference cluster.In particular, it should not be affected by the imbalancedness of the true cluster sizes, whether some clusters are relatively small or large (the output for C should be the same as diag(s 1 , . . ., s k ) C for any s 1 , . . ., s k > 0).
The paper is structured as follows.In the next section, we introduce three basic classes of partition similarity scores.In Sect.3, we formalise the above properties, study which indices fulfil them, and discuss their various modifications.In particular, we derive a new score called normalised clustering accuracy, which is given by: NCA(C) = max σ :{1,...,k} biject.
→ {1,...,k} where c i,• is the number of elements in the i-th true cluster.NCA is the averaged percentage of correctly classified points in each cluster above the perfectly uniform cluster membership assignment.NCA relies on finding the permutation (relabelling) that gives the optimal matching of cluster IDs between the reference and the predicted set, which can be computed in O(k 3 ) time.We will prove that it is normalised to the unit interval, monotonic with respect to a particular similarity relation, and corrected for cluster sizes' imbalancedness.
Further, in Sect.4, we analyse the degree of association between pairs of indices on an example benchmark dataset battery and inspect how the choice of a similarity measure affects the rankings of a few well-known clustering algorithms.Section 5 sketches possible extensions of the introduced measure to the case of clusterings of different cardinalities.
The implementations of the discussed measures are included in the open-source genieclust (Gagolewski, 2021) package for Python and R.

External Cluster Validity Measures
Let C = (c i, j ) be a confusion (matching) matrix of size k × k, whose rows correspond to the reference clusters, and columns summarise the predicted cluster memberships: For brevity, we denote the total sum by n = k i=1 k j=1 c i, j , and the row-and columnwise sums by c i,• = k j=1 c i, j and c •, j = k i=1 c i, j .In the classical (crisp) clustering setting, all c i, j s are nonnegative integers. 2In such a case, c i, j denotes the number of points from the i-th reference cluster that an algorithm assigns to the j-th predicted group.Moreover, then n gives the total number of points, c i,• is the cardinality of the i-th reference cluster, and c •, j is the size of the j-th predicted one.Here, the clusterings can be represented by means of two label vectors y = (y u ) and ŷ = ( ŷu ) of length n, where y u , ŷu ∈ {1, . . ., k} denote the true and predicted memberships of the u-th point.We thus have c i, j = #{u : y u = i and ŷu = j}, c i,• = #{u : y u = i}, and c •, j = #{u : ŷu = j}.
Let us review some of the most seminal classes of crisp partition similarity scores to which we will relate our proposal.More measures are discussed by, amongst others, Xiong and Li (2014); Wagner and Wagner (2006); Rezaei and Fränti (2016); Arinik et al. (2021).

Counting Concordant and Discordant Point Pairs
The first class of indices is based on counting point pairs that are concordant: and those that are discordant: see the paper by Hubert and Arabie (1985) for discussion.
In particular, the Rand index (Rand, 1971) is defined as the classification accuracy: Moreover, the Fowlkes-Mallows index (Fowlkes & Mallows, 1983) is the geometric mean between precision and recall: A detailed overview of the behaviour of the indices based on counting object pairs is presented by Warrens and van der Hoef (2022); Horta and Campello (2015); Lei et al. (2017).
In the sequel, we will consider various variations of the above indices, e.g., their normalised and adjusted for chance versions.

Information-Theoretic Measures
Another group of partition similarity scores consists of the so-called information-theoretic measures.As a point of departure for further derivations, let us recall the mutual information score (Horibe, 1985): with convention 0 log x = 0 for any x ≥ 0. Two variants of this index will be presented below; for further discussion, see the papers by Vinh et al. (2010); van der Hoef and Warrens (2019).

Accuracy-like Set-matching Measures
Let us note that the Rand and the Fowlkes-Mallows scores use 1/ n 2 as the unit of information.Sometimes, we might prefer working on the 1/n-based scale.However, it would be a mistake to rely on the standard accuracy as known from the evaluation of classification models: which is the proportion of correctly classified points.This measure should not be used in clustering because clusters are defined up to a permutation of the sets' IDs (compare the desired property [PER]).
In our context, the predicted clusters need to be matched with the true ones somehow.For instance, the confusion matrix corresponding to the label vectors y = (1,2,2,1,2,3,1,1,1) and ŷ = (3,1,1,3,1,2,3,3,3), represents a perfect match.Hence, in this case, we could use (c 2,1 + c 3,2 + c 1,3 )/n as a measure of accuracy.Consequently, we need an algorithmic way to translate between the predicted and reference cluster IDs.The simplest choice involves greedy pairing: or: These measures are sometimes referred to as purity.Unfortunately, there is no guarantee that all clusters will welcome a match.
If we want to remedy this issue, we should seek a one-to-one correspondence between the cluster IDs, σ , which is a solution to the following optimisation problem: where S k is the set of all permutations of the set {1, . . ., k}, i.e., bijections from {1, . . ., k} to itself.The above guarantees that each column is paired with one and only one row in the confusion matrix.Such optimal pairing leads to, what we call here, the pivoted accuracy (Meilȃ and Heckerman, 2001, Eq. (13); Steinley, 2004, Eq. ( 10); Charon et al., 2006;Chacón, 2021): It relies on the best matching of the labels in the reference partition to the labels in the predicted grouping so as to maximise the standard accuracy.At first glance, it is a very attractive measure because of its interpretability (a feature whose importance was already noted by Goodman and Kruskal (1979)).However, soon, we will note that its values for the worst possible clusterings depend on the number of clusters k.
Remark 2 Eq.11 can be expressed as the following 0-1 integer linear programming problem: ) is a binary matrix with one and only one value 1 in every row and in every column, i.e., b i,• = 1 and b •, j = 1 for all i, j.It is called the maximal linear sum assignment problem or maximum bipartite graph matching.It can be solved using, e.g., the so-called Hungarian algorithm, which requires O(k 3 ) time (Crouse, 2016).
Remark 3 Note that optimal matching is not the same as the greedy recursive pairing, where we pick the largest element, and then continue the same procedure in the yet-to-be-selected parts of the matrix.For example, given: In what follows, we will discuss many modifications of the aforementioned indices.

Desirable Properties and Features of Indices
Let C k×k denote the set of admissible confusion matrices of size k × k such that if C = (c i, j ) ∈ C k×k , then c i, j ≥ 0 and c i,• > 0 for all i, j.Note that the case where the confusion matrix is non-square will be mentioned separately later (Sect.5).
Even though we assume that our reference partitions are crisp, in general, we do not want to restrict ourselves to classical (crisp) clustering only.Thus, whether we allow c i, j s to be arbitrary nonnegative real numbers (not just integer ones), will depend on the index.This will not only simplify the further analysis, but also allow us to accommodate, amongst others, the case of weighted (fuzzy) predicted clusterings, where point memberships are described by probability vectors; compare, e.g., the papers by Hüllermeier et al. ( 2012 2021); Horta and Campello (2015).We will see that under the scale invariance [SU] property discussed below, this will come without loss in generality.
Additionally, we assume that none of the reference clusters is empty.However, we actually allow a clustering algorithm to return a partition of lower cardinality than k, which is represented by the case of some c •, j 's being equal to 0.
In Sect. 1, we outlined a few desirable properties of cluster validity measures in a rather informal manner.Let us formalise them now so that we can introduce various adjustments to the indices.For brevity, all the definitions below should be read as "We say that an index I meets Property [X], whenever for any k ≥ 2 and…".A summary will be given in Table 2.

Permutation Invariance
For any σ ∈ S k , denote by P σ = ( p i, j ) the corresponding permutation matrix, i.e., a k × k binary matrix with p i, j = 1 if and only if i = σ ( j).We stated that clusters are defined up to a permutation of cluster IDs.Therefore, rearranging rows or columns of a confusion matrix should not affect the index value.

Definition 1 (Property [PER]
) For all C ∈ C k×k and every permutation σ ∈ S k , we have As the standard classification accuracy Ä (7) is the only index amongst the ones considered in this paper that does not fulfil this property, we will no longer be considering it a viable cluster validity measure candidate.

Symmetry
In traditional partition similarity scores, both partitions should be treated equally.This is reflected by the symmetry property.Disregarding from now on the two versions of purity P and PT , which are semantically problematic, all the aforementioned scores enjoy this condition.However, we have already argued that in the context of external cluster validation, [SYM] can be omitted.For a given benchmark problem, the reference partition is fixed.Thus, we treat it differently from the predicted ones, because the latter vary across the algorithms under scrutiny.

Scale Invariance
Another property we might find useful is that any scaling of the confusion matrix should not change the similarity assessment.Scaling can be interpreted as adding or removing new points to the detected clusters without disturbing the discovered structure while maintaining the proportions of cluster sizes.In other words, if we have 50% points correctly classified, whether this was achieved for n = 100 or n = 10,000 should not matter.

Definition 3 (Property [SU]
) For all C ∈ C k×k and every s > 0, we have I(sC) = I(C).
Amongst the indices studied so far, only those based on counting concordant/discordant point pairs do not enjoy this property.
Example 1 For instance, for C given by Eq. 14, we have FM(C) ≈ 0.35297 but FM(3C) ≈ 0.35727, and R(C) ≈ 0.56928 but R(3C) ≈ 0.57023.The difference becomes even more significant when we consider non-integer confusion matrices.Assuming that we generalise x 2 for arbitrary reals as x(x − 1)/2, we get, e.g., FM( 1 900 C) ≈ −1.08054 and R( 1 900 C) ≈ 1.21464.This is why, in the context of R and FM scores, the study of their properties will be limited to integer matrices only.
However, we can consider R (C) = lim s→∞ R(sC) and FM (C) = lim s→∞ FM(sC).Noting that in Eqs. 3 and 5, if the total sum of C is n, then the sum of sC is sn, these limits are given by: The above arise by replacing x 2 with x 2 /2 in Eqs. 2 and 4. Therefore, the price for the fulfilment of the scale invariance property is the loss of the original interpretation related to the counting of concordant pairs.

Upper Bound
From now on, let C k×k s 1 ,...,s k ⊆ C k×k denote the set of confusion matrices like C = (c i, j ) whose i-th row's sum c i,• is equal to s i , for all i = 1, . . ., k.
In order for an index value to be interpretable, we need to calibrate it so that we know which scores indicate low and high-quality outcomes.In the literature, it is customary to assign the value of 1 when there is a perfect match, and preferably only then.In clustering problems, it is the case when each predicted set can be mapped to precisely one reference cluster, and vice versa.This is represented by confusion matrices with 0 s everywhere but on the diagonal or (by [PER]) their permuted versions; compare, e.g., Eq. 8.
The Rand and Fowlkes-Mallows indices enjoy [B1] when restricted to the case of integer matrices.Mutual information is not normalised; e.g., for a matrix like I = diag(s, . . ., s), we have MI(I) = log k for all s > 0. Other indices discussed so far fulfil this property.

Example 2
The MI score can be rescaled so as to meet [B1].For instance, the normalised mutual information score denoted by I /D 2 by Kvalseth (1987) and NMI sum by Vinh et al. (2010) is given by: , and [B1].

Adjustment for Chance
In the context of comparing partitions, some statistical literature finds it desirable if two groupings generated independently at random are assigned a similarity score of 0 on average (Hubert & Arabie, 1985;Vinh et al., 2010).
For instance, under the hypergeometric model for randomness discussed by Fowlkes and Mallows (1983) (for possible alternatives, see Steinley (2004) and Gates and Ahn (2017)), the two partitions are assumed to be picked independently at random subject to having the original n, k, and counts of objects in each cluster, i.e., c 1,• , . . ., c k,• , and c •,1 , . . ., c •,k .Given this assumption, the p-th raw moment is E c p i, j = c p i,• c p •, j /n p .We can thus introduce the following property.
Definition 5 (Property [E0]) For all s 1 , . . ., s k > 0, t 1 , . . ., t k > 0, and the random variable C ∈ C k×k generated from the hypergeometric model having c i,• = s i and c •, j = t j for all i, j, we have E I(C) = 0.
Given any index I, its adjusted-for chance version can be constructed by taking: where Ī is the maximal index value (e.g., 1 for indices fulfilling [B1]) and I is the expected index given the assumed randomness model.This way, the maximal value of AI is 1 and its expectation is 0 when two partitions are unrelated.

Example 3
Based on the relation: denoting the expected Rand index with R(C), we get the adjusted Rand index proposed by Hubert and Arabie (1985): .
The FM index can be adjusted in a similar manner, leading to: . Note that Eqs.20 and 21 are very similar: the difference is that in the former, we have the arithmetic mean in the denominator, whilst in the latter, we see the geometric mean of two terms.In practice, the two indices behave very similarly (see Sect. 4).
Example 4 An adjusted version of the NMI score was studied by Vinh et al. (2010) (who denoted it by AMI sum ).By noting that: we obtain: , being a formula we will not be expanding further because of its complexity.
None of the three aforementioned adjusted indices are scale-invariant.By construction, they are defined only for integer confusion matrices.

Lower Bounds and Normalisation of Indices
An index that is adjusted for chance has the expected value of 0 for partitions generated at random from an assumed model of randomness.In order for that to be possible, it must take negative values in cases "worse than average when picked at random".From the current paper's perspective, the former case is not as undesirable as the latter, where the predicted cluster memberships in each true cluster are assigned uniformly.The adjusted Rand index indicates this correctly: it yields AR(C) = 0 and AR(U) ≈ −0.019.However, we argue that a negative index value might be difficult to interpret, 3 especially if its lower bound depends on the scale of the confusion matrix.
Instead of looking from a statistical viewpoint, we can take an algebraic perspective, where bounding the index from below, e.g., by 0 could be more informative 4 (Charon et al., 2006;Chacón & Rastrojo, 2023).This is beneficial in the case where we run a clustering algorithm on many benchmark datasets and thus obtain numerous similarity scores that should be aggregated into a single number (e.g., by considering sample quantiles or the arithmetic mean).In such a way, we bring all the indices to the same range, e.g., [0, 1], which is dependent on neither k nor n.If the minimum is difficult to obtain or is uninformative, the value of 0 should be attained by confusion matrices that we identify as undesirable outcomes.
We proclaim that there are two worst-case outcomes of a clustering algorithm when its results are compared with a fixed reference partition.The first scenario is where the elements in each row of a confusion matrix are equal to each other: It corresponds to a clustering method that assigns the cluster memberships uniformly, in an uninformed manner 5 .
The second undesirable case is where all columns but one are equal to 0: It represents predicted partitions where all the points are allocated to the same (one6 ) cluster, assuming given true cluster sizes s 1 , . . ., s k > 0.
These two cases lead us to the following desirable properties.
In Table 1, we have summarised the index values for matrices given by Eqs.24 and 25.The derivations are quite elementary; hence, they were omitted.Based on these results, we can imply that out of the indices considered so far, only MI and NMI fulfil [U0].On the other hand, [O0] is true for MI, NMI, AR, AFM, and AMI.
From Table 1, we see that for matrices like U (24), AR, AFM, and AMI actually take negative values; Chacón and Rastrojo (2023) give the formula for the minimum of AR.Furthermore, R , FM , and A are bounded from below by 1/k; see Appendix A.2 for a proof in the case of the A index.Thus, that the lowest possible value of an index is equal to 0 must be introduced as a separate property.Out of the indices considered so far, only MI and NMI fulfil [B0].Note that even if an index fulfils both [U0] and [O0], there is no guarantee that it is bounded from below by 0.

Example 6
Consider the limiting version of AR (which is a formula equivalent to the one proposed by Morey and Agresti (1984)), as well as the limiting version of AFM:  The exact formulae denoted by standalone "< 0"s were omitted due to their complexity (they are all negative and approach 0 as n → ∞) At the cost of [E0], we have gained [SU], [U0], and [O0] (compare Table 1).However, [B0] does not hold; for instance, if: we have NR ( ) = NFM ( ) = − 1 15 .
If we are able to identify the minimum of an index I over all matrices with given row sums, denoted by I, we can introduce its normalised version by taking: where again Ī is the maximal index value.This way we obtain the index that has the minimum value of 0 and maximum of 1.
Example 7 For all s 1 , . . ., s k > 0, we have min C∈C k×k s 1 ,...,s k A(C) = 1/k; see Appendix A.2. 7We can thus introduce the normalised variant of the pivoted accuracy: NA is an example of an index that fulfils [U0] and [B0], but not [O0] (compare Table 1).

Invariance to Cluster Sizes
Here is a more restrictive version of scale invariance: we posit that increasing the number of points in any reference cluster and assigning the new points to the predicted sets without disturbing the proportions of allocations should not change a given cluster validity index; compare the paper by Rezaei and Fränti (2016), who studied a similar postulate.For instance, if 50% of the points are correctly classified in the first reference cluster, then it should not matter whether its cardinality is, say, c 1,• = 100 or c 1,• = 10,000.
Note that SC is a version of the original matrix whose i-th row is multiplied by s i .
If an index satisfies [SC], we will say that it is invariant to inequalities in the cluster sizes.Example 9 Using the same transformation, but this time applied on the pivoted accuracy given by Eq. 12, we can also introduce the clustering accuracy: It is the average of the proportions of correctly classified points in each true cluster.
Example 10 As min k×k s 1 ,...,s k 2), we can finally arrive at the new normalised clustering accuracy that we announced in the introduction, which is the normalised version of CA: and [B0].
Remark 4 On a side note, a different type of scaling was considered by Rezaei and Fränti (2016, Sec. 6.1).It is based on the Braun-Blanuqet similarity coefficient (Braun-Blanquet, 1932) and leads to the index given by: It does not guarantee the fulfilment of our [SC], but preserves symmetry.The authors suggested a normalisation via: where c (i),• is the i-th largest row sum and c •,(i) is the i-th largest column sum, which leads to the index: It fulfils [U0] and [O0], but fails to meet [B0]; e.g., for the matrix given by Eq. 28, we get NBA( ) = −1/3.Interestingly, the authors overcome this limitation by simple clipping, defining the pair sets index as:

Monotonicity
It is not rare in the literature to seek functions like f : Z → R, which preserve a certain partial order on a set Z , i.e., expect that f (z) ≤ f (z ) whenever z z .For instance, generalised means (e.g., the arithmetic mean and the median) are nondecreasing in each variable (Bullen, 2003, p. xxvi;Grabisch et al., 2009, Chap. 4), and economic inequality indices (e.g., the Gini or Bonferroni ones) preserve the majorisation relation (are Schur convex, satisfy the principle of progressive transfers; Arnold, 2015, Chap. 4).
In the literature, the monotonicity property of clustering similarity indices is often studied in the context of merging or splitting clusters; see the overview by Arinik et al. (2021).Inspired by an empirical sensitivity analysis by Rezaei and Fränti (2016, Sec. 7.3), we would now like to propose one of the possible ways to formalise the property descriptively formulated as "the more similar the predicted partition is to the reference one, the higher the score should be".
Let us define a class of confusion matrices where the maximal elements in each row lie on the main diagonal.
Definition 10 We call a matrix C ∈ C k×k diagonally max-dominant, whenever for all i, j, we have c i,i ≥ c i, j .
In our context, such matrices represent the case where there is no doubt as to which predicted cluster corresponds to which reference one.We proclaim that, then, correcting the belongingness of a misclassified point (moving it from cluster j to cluster i, when it indeed belongs to i), should, at the very least, not result in a decrease of an external cluster validity measure.
Let the partial order DMD on C ∈ C k,k be defined in such a way that C DMD C if and only if C and C have identical row sums (c i,• = c i,• , i = 1, . . ., k), are diagonally max-dominant, and c i,i ≤ c i,i for all i.When I satisfies [MON], then for all diagonally max-dominant C ∈ C k,k , every i = j, and c i, j ≥ t > 0, we have I(C) ≤ I(C ), where C is a version of C except c i,i = c i,i + t and c i, j = c i, j − t.
We can define strict monotonicity similarly, by replacing ≤ and with < and ≺(i.e., and not =), respectively, in the above definition.
For a diagonally max-dominant matrix C, we have max . Therefore, it is easily seen that A, NA, CA, and NCA are naturally strictly monotonic.Moreover, improving the memberships in the i-th true cluster always results in the same change of the index: they increase linearly (the increment depends on c i,• for CA and NCA).Example 11 For any integer s 1 , . . ., s k ≥ 1, let C ∈ C k×k be a randomly generated matrix whose i-th row is obtained using the following procedure: This guarantees that C is diagonally max-dominant, has positive integer elements, and its consecutive row sums are s 1 , . . ., s k .
We generate 10,000 random matrices like C with k = 4 and the row sums c 1,• = c 2,• = c 3,• = 100, and c 4,• = 700, and then improve the membership of only one point, obtaining C with c 1,1 = c 1,1 + 1 and c 2,1 = c 2,1 − 1. Figure 2 presents a box and whisker plot depicting the empirical distribution of the 10,000 differences I(C ) − I(C) for each index I.If any increment is negative, and this is the case for all the indices but A, CA, NA, and NCA, then we conclude that [MON] does not hold.Moreover, we note how unpredictably the indices respond to such a simple adjustment of the confusion matrix.

Discussion
Table 2 summarises which index enjoys each of the above properties.Overall, the proposed measure -normalised clustering accuracy (NCA) -fulfils all properties except [SYM] and [E0].
As we have already noted, losing [SYM] is the price we pay for enforcing the invariance to cluster sizes [SC] using the employed normalisation.
Moreover, a nontrivial index cannot have the expected value of 0 for random partitions ([E0]) if it is bounded from below by 0 ([B0]).As far as matrices following the hypergeometric model are concerned, Fig. 3 depicts how the expected values change as a function of k in the case of clusters of equal sizes.Overall, for all indices except R, R , and MI, we observe a decreasing trend.
Example 12 To show that most of the properties are actually mild and do not contradict one another, let us consider the index Z given by: Z fulfils all the properties considered except [E0] (however, it is only weakly, not strictly, monotone).
Let us now gain some insight into how to interpret concrete index values.We have already said that clustering is not classification: clusters are defined up to a permutation of set IDs.The price we pay for normalisation is the loss of 1 "degree of freedom" in the latter formula.This is what makes the normalised measures somewhat less intuitive.From this perspective, NA and NCA have the most appealing interpretation, because they can be equivalently rewritten as classification rates (averages of proportions of correctly classified points in each cluster) above the random cluster membership assignment and with the optimal matching of cluster IDs: , with the difference being that NA relates the current partition to the one where all true clusters are of the same size.
Example 14 Let us study the behaviour of the normalised indices when we transition between the case of all points allocated to one predicted cluster (O k×k s 1 ,...,s k ), through all points correctly classified (identity matrix), to the uniform assignment (U k×k s 1 ,...,s k ), one point at a time.
Overall, we obtained 74 • 10 = 740 readings of each of the similarity scores.We will not take MI into account because it does not meet [B1].
Let us first relate the indices to each other pairwisely (95 cases of perfect agreement between true and predicted labels were removed as this trivially corresponds to all scores equal to 1).There is a very high degree of correlation between specific pairs (Pearson's r > 0.999): R and R ; FM and FM ; AFM, NFM , AR, and NR ; NCFM and NCR ; AMI and NMI.Therefore, we will only be considering the last index from each group from now on.
Some non-normalised indices correlate quite strongly with their normalised counterparts.The following index pairs have ρ > 0.95: R and NR (ρ ≈ 0.970); R and NMI (ρ ≈ 0.963); CA and NCA (ρ ≈ 0.960); A and NA (ρ ≈ 0.951).Nevertheless, we note that the true partitions in our benchmark set have diverse cardinalities (ranging from 2 to 50), and the lower bounds of the non-normalised indices are a function of k.
Focusing on the normalised indices, the scatter plot matrix in Fig. 6 shows how different indices relate to one another.
Let us note that 29 of the 74 true label vectors define clusters of equal sizes.Overall, for 39 label vectors, the Gini index of true cluster sizes is less than 0.05 (with the average Gini index being 0.189).Therefore, in our study, the correction for cluster sizes has no or very small effect in many cases.This is why NR and NCR , NMI and NCMI, as well as NA and NCA correlate highly with one another.We note a high degree of rank correlation between: NCR and NCA; NCR and NCMI; NR and NA; NR and NMI; NBA and NA.Thence, we can try to express one index from each pair by a monotone function of another with not too big an error.
We may also be interested in determining how the different external validity measures allow us to compare the overall quality of the 10 algorithms.Table 3 gives the rankings based on the median scores.Dasgupta and Ng (2009) and Gagolewski (2022) noted that one dataset can have many possible equally valid clusterings.Thus, in what follows, in the case of datasets with more than one reference labelling, for each algorithm, the best score is taken into account.Let us stress that the purpose of this experiment is not to discover good algorithms, but to assess the effect of index choice on the rankings Only two methods, Gaussian mixtures and spectral clustering, are consistently ranked as the best ones.
If we employ Kendall's τ , a correlation measure based on counting concordant and discordant object pairs, to assess the similarity of the rankings, we discover that the rankings generated by NMI and NCMI are the least correlated with those by other indices.For instance, they both rank the single linkage method third, whereas most other indices consider it the worst.The most similar pair is NA and NBA.Also, NCR correlates highly with NR , NBA, NA, and NCA.

Partitions of Different Cardinalities
Many indices can be extended to the case of confusion matrices with the number of rows k being different from the number of columns k .For instance, in the definitions of R, FM, MI, and their derivatives, we can replace the summation k i=1 k j=1 with k i=1 k j=1 .We can generalise the normalised clustering accuracy as: We leave the exploration of the extensions of the indices to the case k = k and their properties as a topic of further research, because it deserves a more exhaustive treatment, for which no room is left in the current contribution.

Conclusion
We reasoned that using symmetric partition similarity scores to compare predicted partitions with fixed reference ones might not be ideal.Thus, we proposed a new measure called normalised clustering accuracy, which uses optimal cluster matching and is corrected for cluster sizes.We showed that it enjoys a number of desirable properties, including scale invariance, boundedness, and monotonicity.As far as interpretability is concerned, e.g., NCA=0.5 means that above the perfectly uniform cluster membership assignment, on average, 50% of points in each cluster are correctly discovered.
As a topic for further research, we would like to use some of the proposed properties as a basis for characterising certain indices.We would also like to study other desirable features discussed in the literature (Hennig, 2015;Wagner & Wagner, 2006;Warrens & van der Hoef, 2022;Rezaei & Fränti, 2016;Luna-Romera et al., 2019;Gates et al., 2019;Arinik et al., 2021;Xiang, 2012).It will also be interesting to derive the formula for the expected value of max σ i c i,σ (i) and max σ i c i,σ (i) /c i,• in the hypergeometric and other random models.
Furthermore, we will extend our notes from Sect. 5 to the case where the predicted clustering can be finer-or coarser-grained than the reference one (partitions of different cardinalities), including whole cluster hierarchies.We will also formulate similar properties that are more tailored to comparing fuzzy/soft/probabilistic (e.g., overlapping) or other types of partitions (Horta & Campello, 2015;Campagner et al., 2023;Andrews et al., 2022;D'Ambrosio et al., 2021;Hüllermeier et al., 2012), also in the case where the reference clustering is not crisp (compare Flynt et al., 2019).
Moreover, we may want to study similar metrics adjusted to the problem of semisupervised learning, i.e., where some cluster memberships are known to the algorithm a priori.≤ k k i=1 c 2 i, j holds for all j is implied by the Cauchy-Schwarz inequality which states that u • v ≤ u v with u = (c 1, j , . . ., c k, j ) and v = (1, . . . , 1).As NCFM and NCR fulfil [U0] and [O0] (compare Table 1), this implies that they also meet [B0].

A.2 Proof that A Is Bounded from Below by 1/k
We want to prove that for all C ∈ C k×k , we have A(C) = max σ ∈S k k j=1 c σ ( j), j n ≥ 1/k.
As σ 1 is the maximal permutation, we have k j=1 c σ 1 ( j), j ≥ k j=1 c σ i ( j), j for all i.Hence, k k j=1 c σ 1 ( j), j ≥ k i=1 k j=1 c σ i ( j), j = k i=1 k j=1 c i, j , QED.Note that this also implies that CA is bounded from below by 1/k and that NCA is bounded from below by 0. These are actually minima; as they are attained at U k×k s 1 ,...,s k ; see Table 1.

Definition 2 (
Property [SYM]) For all C ∈ C k×k , we have I(C) = I(C T ).

Example 5
Canvass the two following matrices with the same corresponding row sums:

Fig. 3 Example 13
Fig.3 Expected values (based on 1,000 Monte Carlo samples) of indices as functions of k for n = 100k 2 , reference partitions of equal cardinalities, and predicted labels assigned at random (the hypergeometric model).Note the logarithmic scale on the y-axis.AMI, AR, and AFM are adjusted for chance (have expectation 0) and hence were omitted.Recall that, under c 1,• = • • • = c k,• , NCFM =NFM , NCR =NR , NCMI=NMI, NCA=NA, and CA=A.For indices fulfilling [U0] (i.e., those in the right subfigure), we additionally observe that, for all k, their expected decrease towards 0

Fig. 4
Fig.4  Behaviour of selected indices when moving from all points in a single cluster through the perfect match to the uniform assignment, one point at a time (k = 2, clusters of equal sizes; hence, NA=NCA, NR =NCR , etc.).Only N(C)A yields a predictable, linear response.Note that N(C)R and N(C)FM are very similar: their curves are almost overlapping

Fig. 6
Fig.6  Scatter plot matrix for selected normalised indices (we noted that some index groups are very highly correlated: AFM, NFM , AR, and NR ; NCFM and NCR ; AMI and NMI).In the lower right and top left corners, respectively, Pearson's r and Spearman's ρ correlation coefficients are reported.Circa 53% of the true partitions in our data sample consist of clusters of almost equal sizes; hence, the correction for [SC] (e.g., transforming NA to obtain NCA) has little impact in these cases )under the assumption c i, j = 0 for j > k .If k < k , some predicted clusters are not matched with true ones.Hence, they are not counted, which decreases the computed similarity.If k > k , we treat the missing k − k predicted clusters as empty: the number of points matched is zero.Overall, this version of the index penalises a clustering algorithm that outputs a grouping of cardinality k different from the desired k.Yet, we might also be interested in the case where k = k is what we actually ask all the benchmarked algorithms for.In such a case, we can consider:NCA (C) = ⎧ ⎨ ⎩ max σ :{1,...,k } onto → {1,...,k} 1 k−1 k j=1 c σ ( j), j c σ ( j),• − 1 for k ≥ k, max σ :{1,...,k} onto → {1,...,k } 1 k−1 k i=1 c i,σ (i) c i,• − 1 otherwise.

u
=v c u, j c v, j .This also easily implies that NR (C) = NFM (C) = 1 if and only if C = P σ S for some σ ∈ S k and S = diag(s 1 , . . ., s k ).The same holds for NCR and NCFM .Hence, they meet property [B1].To show that NCFM and NCR are nonnegative, it suffices to prove that i, j under c 1,• = • • • = c k,• = 1, i.e., for matrices adjusted for cluster sizes.As n = k, the inequality can be rewritten as k

Table 1
Values of for confusion matrices given by Eqs.24 (uniform assignment; see property [U0]) and 25 (all points assigned to a single cluster; see [O0]), where S 2 = k i=1 s 2 i .The R, FM, AR, AFM, and AMI indices assume that input matrices are integer

Table 2
Properties of the indices Index

Table 3
Rankings of the 10 clustering algorithms based on median values (in round brackets) of different external cluster validity measures across the 64 benchmark datasets