On k -means iterations and Gaussian clusters

.


Introduction
Cluster analysis is a key tool in data science as it can be used to reveal the class structure of a data set, without requiring labelled samples.Hence, it has been applied to many different problems in areas such as cybersecurity, streaming, Internet-of-Things, anomaly detection, and others (see for instance [3][4][5][6], and references therein).
Clustering algorithms can be divided into different categories, such as partitional, hierarchical, and density-based.The latter refers to algorithms following the idea that a cluster is a high-density region of data points (i.e.objects or entities), and that clusters are separated by contiguous regions of low-density.Hierarchical algorithms produce a clustering, usually represented as a tree-like partition of the given data points as well as information regarding the relationship among clusters.Such tree-like relationships can be visualised using a dendogram.Further information regarding hierarchical and density-based algorithms can be found in many sources (see for instance [7][8][9], and references therein).Here, we focus on partitional clustering, and more specifically, on the k-means algorithm [10,11], which is arguably the most popular partitional clustering algorithm [1,2].
Given a data set X containing n data points x i ∈ R m ,k-means produces a partition S = {S 1 , S 2 , …, S k } of X, such that each cluster S l ∈ S is non-empty, and each x i ∈ X is included in exactly one subset of S. In any clustering algorithm the general aim is that data points in the same cluster (i.e. a subset of S) should be similar, and data points between clusters should be dissimilar.In the case of k-means, the (dis) similarity between data points is measured using the Euclidean squared distance.That is, given two data points x i , x j ∈ X, the distance between them is calculated as follows: where x iv and x jv are respectively the values of x i and x j corresponding to feature (i.e variable) v, and m is the number of features.An optimal k-means clustering is that which minimises the value of the following objective function: where z l is the centroid of cluster S l ∈ S, computed as the componentwise mean over all x i ∈ S l .In order to minimise (2), k-means follows three simple steps: 1. Select k data points in X at random, and copy their values to the k initial centroids z 1 , z 2 , …, z k .2. Assign each x i ∈ X to the cluster represented by its nearest centroid.
3. Update each centroid z l to the component-wise mean over all data points in S l .If any of the centroids changes its value, go to Step 2.
Let (ω t ) τ t=1 be a sequence of values such that ω t holds the value of (2) at iteration t.We have that ω t ⩾0 (i.e. this sequence has a lower bound) and ω t+1 < w t (i.e. it decreases monotonically).Also, since X is finite, there exists a finite number of different partitions of X.Hence, this sequence (and k-means) not only converges, but it does so in a finite number of iterations.This number of iterations (τ) has, to our knowledge, always been discarded in the data analysis.
The time complexity of k-means is O(knmτ), where k is the number of clusters, n is the number of data points, m is the number of features, and τ is the number of iterations in the internal loop of the algorithm.Often the maximum number of iterations in the internal loop of k-means is limited by a constant (e.g.τ = 100).However, if the number of iterations τ is not limited, and depends on the speed of the convergence of the objective function (2) to a certain local or global minimum, then the known upper bound on the running time of k-means is O(n km ) (i.e. it is in general exponential in the number of data points when km = Ω(n/logn)) [12], whereas the improved lower bound determined by Vattani is 2 Ω(n) [13].These results suggest that in its worst-case scenario k-means requires exponentially many iterations even in the plane.However, in practice, the number of iterations k-means takes to converge is rather linear in the number of data points.
The main contribution of this paper is to show that the number of kmeans iterations, τ, is in fact very informative.In our experiments, we use data sets containing Gaussian clusters under different parameter configurations as well as data sets containing uniformly random values.
In our four sets of experiments, we show that: (i) with certain data set configurations, τ has a negative correlation with the quality of the recovered clustering (see Section 3); (ii) the lower the covariance within Gaussian clusters, the lower the average value of τ in small data sets, and the opposite holds in larger data sets (see Section 4); (iii) τ can help identify noise features (here, an irrelevant feature composed of uniformly random values) in small data sets containing Gaussian clusters (see Section 5); (iv) there is a relationship between τ and the number of clusters in data sets containing Gaussian clusters (see Section 6).

Data sets
In this paper, we perform a number of experiments using k-means (for details see Section 1) and and some other clustering and feature selection algorithms (see Sections 5.1 and 5.2).Generally speaking, we consider three different types of data sets, which we present below.
We began by creating 18 basic data configurations, containing 50 synthetic data sets each.We follow the standard n x m -k.For instance, any data set in the configuration 1000x20-3 contains 1,000 data points, each described over 20 features, with the data points distributed over three clusters.There are three versions of each of these data configurations, leading to a total of 3 × 18 = 54.Each version contains spherical Gaussian clusters with diagonal covariance matrices of 0.5,1.0, and 1.5, respectively.A within-cluster covariance of 0.5 leads to tighter clusters, while a within-cluster covariance of 1.5 leads to clusters of higher spread (and by consequence higher overlap), see Fig. 1.In all cases, we generated each centroid component independently from a Gaussian distribution N(0, 1), and each data point was uniformly distributed over the clusters.In total, we created 54 × 50 = 2, 700 data sets.
For comparison, we also generated data sets under the n x m -k standard containing solely noise features.Here, a noise feature is a feature containing solely uniformly random values.Given these have no within-cluster covariance, there are 18 such configurations with 50 data sets each.The results of our experiments for these data sets appear under "Random data" in Table 2 (more details are given in Section 4).For our experiments with feature selection (see Section 5), we considered the 2, 700 data sets containing Gaussian clusters, adding a noise feature to each of them (hence the "+1 NF" in their names).This way, if we request a feature selection algorithm to identify a single irrelevant (noise) feature, we know which one should be selected.Fig. 1 shows some Fig. 1.Examples of data sets under the 1000x10-5 configuration (1,000 data points, 10 features, and 5 clusters) with different within-cluster covariancesand their versions with one added noise feature.We plot the data sets over their first and second principal components.
informative examples of the impact of a single noise feature on data structures.

Iterations and cluster recovery
In this section, we explore the relationship between the number of iterations k-means takes to converge (τ) and the quality of the clusters recovered.Here, we measure clustering quality using the popular Adjusted Rand Index (ARI) [14].This is a corrected-for-chance version of the Rand Index [15].An ARI close to zero implies the clustering is (nearly) as poor as a uniformly random partition.Our choice of the evaluation measure is supported by the literature, which shows ARI to be superior to other measures when evaluating clustering quality [16].In this set of experiments, we carried out k-means 100 times on each of the 50 considered data sets and for each of parameter configurations we experimented with.For each k-means run, we calculated the value of ARI of the produced clustering, and saved the corresponding number of iterations τ.That is, for each data set we had 100 values of ARI and 100 values of τ.With these, we were able to calculate the correlation index between the values of ARI and τ for each single data set.
Table 1 reports the average correlations over the 50 data sets for each configuration considered, and the related standard deviations.Our experiments show that in some cases there is a considerable negative correlation.For instance, in the case of 100000x100-3 on data sets whose within-cluster covariance is 0.5, we have the most noticeable average correlation of − 0.93.One can observe a general pattern that the negative correlation is higher on larger data sets (i.e.those with 10,000 data points or more).In these data sets the lowest negative correlation is − 0.55, which is rather meaningful.The correlations in smaller data sets (i.e.those with 1,000 data points) are not always as indicative but we can see a clear pattern.The negative correlation in smaller data sets is higher for the configurations with a higher number of features, a lower number of clusters, and a lower within-cluster covariance.This result follows intuition as such data sets would have a small number of wellformed tight clusters, leading to a higher chance of well-placed initial centroids in k-means and by consequence a lower τ.Table 1 also shows that if one adds a single noise feature to a data set, this can have drastic consequences to the correlation we analyse.For instance, 1000x10-5 data sets present a correlation of − 0.5 for configuration with clusters whose within-cluster covariance equals 0.5.However, if one adds a single irrelevant feature to such data sets (i.e.configuration 1000x10-5+1NF), then the corresponding correlation drops to 0.07.The same can be observed in other data sets, such as 20000x30-10 and its added noise version 20000x30-10+1NF.Fig. 2 shows the distribution of ARIs with respect to τ (or if the reader prefers, that of τ with respect to ARI) for some of the data sets we experiment with.To generate these results, we ran k-means 100 times on each of the 50 data sets for a given configuration.We recorded all values of ARI in relation to the ground truth, and the related values of τ.Each point in Fig. 2 represents a pair (ARI, τ), and there are 100 × 50 = 5, 000 points on each subfigure.Figs. 2 (a)-(c) present the results for 1000x20-3 with the within-cluster covariances of 0.5, 1.0, and 1.5, respectively.These results show the effect that an increased within-cluster covariance has on the correlation between ARI and τ.These figures also illustrate that the pairs (ARI,τ) form two well-separated clusters, one close to a perfect ARI of 1, and another around the ARI close to 0.45.This supports our view that the high negative correlation we observed (see Table 1) is indeed related to how good the initial centroids of k-means are in these small data sets.With good initial centroids the algorithm converges quickly, and because of the nature of the data, this convergence tends to a good clustering solution.If the initial centroids are not good (e.g.all the three initial centroids belong to the same ground truth cluster), then τ is higher and the convergence tends to a poorer clustering solution.
Figs. 2 (d)-(f) illustrate the correlation results for the 1000x20-3 + 1NF configuration (i.e. the same data sets with an added noise feature).For these noisy data, the correlation is indeed poorer compared to our previous results.Fig. 2 also shows our results for the 1000x10-5 data configuration for a visual comparison.
Our main conclusion in this section is that there are data sets for which the number of iterations, τ, negatively correlates with ARI.This property concerns particularly two classes of data sets: (i) those that are large (i.e. with at least 10,000 data points); (ii) smaller data sets (i.e.those with 1,000 data points) with a higher number of features, a lower number of clusters, and a lower within-cluster covariance.However, this negative correlation can be lost if the clusters have a higher spread (higher within-cluster covariance) or some noise is added to the data set, even if this takes the form of a single noise feature.

Iterations and within-cluster covariance
The k-means algorithm is popular, but it is not without weaknesses.One such weakness is that k-means will produce a clustering for any data set, even if the data set itself has no cluster structure.This issue raises two questions that we aim to answer in this section: (i) If we were to apply k-means to two data sets of equal size (with n objects and m features) -one containing Gaussian clusters and the other composed solely of uniformly random values, would the τ values be approximately the same?; (ii) For data sets containing Gaussian clusters, will the withincluster variance have an impact on τ?. ).To answer both questions raised above, we ran k-means 100 times on each of the 50 data sets we generated for each considered parameter configuration.For each run, we saved the number of iterations k-means took to converge (τ), and calculated their average (τ) as well as the standard deviation over these values.Table 2 reports the results of this set of experiments.Here, we can observe some interesting patterns.For instance, τ is much higher for data sets containing uniformly random values than for data sets containing Gaussian clusters.This suggests that k-means takes, on average, more iterations to converge on data sets that lack structure.The presence of areas of low-density is not a sufficient condition to indicate cluster structure, but such presence is a necessary condition.Hence, a data set containing a cluster structure will have areas of high-density and areas of low-density.With more areas of low-density, we have a higher probability of a centroid moving to (or starting at) an area with less neighbours.Hence, the number of centroid updates is likely to be less than if the data set were to contain solely uniformly random values.When we analyse the results for data sets containing Gaussian clusters (see Table 2), we can observe another interesting pattern.In small data sets (those with 1,000 data points) the lower the within-cluster variance, the lower the value of τ (and the related standard deviation, in the majority of cases).This seems well-aligned with intuition as the clusters are tighter.However, in larger data sets (those with at least 10,000 data points) the pattern is the opposite.That is, the higher the within-cluster variance, the lower the value of τ.It is tempting to think this happens because a within-cluster variance of 1.5 leads k-means to converge quickly but to wrong clusterings in large data sets.This cannot be true as Table 1 shows that large data sets have a higher inverse correlation between τ and cluster quality.Hence, a significant proportion of these low τ convergences are to correct clusterings.With these results, we can now move to a couple of interesting applications.

Iterations and feature selection
Feature selection is a major area of research in data science.A feature is one of the m components shared by all data points x i ∈ X.The general idea behind any feature selection algorithm is to identify whether a given feature is relevant, and to remove it from all x i ∈ X should this not be the case.The literature on feature selection is rather vast but tends to focus on supervised methods (see for instance [17,18], and references therein).Here, we are particularly interested in unsupervised feature selection.That is, algorithms capable of assigning a degree of relevance to each feature without relying on labelled samples.In this section, we deal solely with data sets containing Gaussian clusters (to ensure the presence of relevant features).Our two main objectives are to show that there are data set configurations in which: (i) τ can be used on its own to identify a noise feature, producing competitive results; (ii) it is possible to use the information present in τ to improve the results of feature selection algorithms.We tackle these issues by experimenting on different data set configurations to which we added a single noise feature (composed of uniformly random values).We then ask each algorithm considered to identify a single irrelevant feature and calculate the proportion of times it identified the added noise feature as irrelevant.

Background on feature selection
In this section, we recall the main properties of a few feature selection algorithms, including those we believe to be the most popular.These algorithms allow one to determine how many features should be selected for data analysis.
Feature selection using feature similarity (FSFS) [19] is, arguably, the most popular unsupervised feature selection algorithm.It aims at identifying a set of maximally independent features by calculating pairwise feature similarities using the maximum information compression index and applying k-nearest neighbours (k-NN) [20].Given two features v 1 and v 2 , this index is defined as follows: where ρ(v 1 , v 2 ) is the Pearson correlation coefficient between v 1 and v 2 , and σ 2 vj represents the variance of a feature v j with 1⩽j⩽m.The value of λ 2 is inversely proportional to the dependency between v 1 and v 2 , with the greatest lower bound of zero.Another interesting point, which is rather useful to us, is that this algorithm takes the number of features to be removed as a parameter.

Table 2
The average numbers of iterations k-means takes to converge (τ) and the related standard deviations (sd).There are 50 data sets for each parameter configuration, and k-means was carried out 100 times per data set (i.e. 100 random starts per data set were performed).Random data columns report the results obtained for data set containing solely uniformly random features.
That is, the dissimilarity between v and its k th nearest neighbour in V using (3).

Identify the feature v
Step 9. 8. Go to Step 2. 9.Return the set of selected features, V.
The intelligent Minkowski weighted k-means (IMWK) [21,3] calculates, independently, the degree of relevance of each feature at each Fig. 2. Correlation between ARI and the number of k-means iterations under the 1000x10-5 and 1000x20-3 parameter configurations, with different within-cluster covariancesand their versions with one added irrelevant feature.The data were plotted over their first and second principal components.cluster (w lv ).This algorithm applies a weighted version of the Minkowski distance: and follows the intuitive idea that a given feature may have different degrees of relevance at different clusters.This is modelled using the within-cluster dispersion of each feature given by D lv = ∑ xi∈Sl |x iv − z lv | p , where z lv is the v th component of the centroid of cluster S l .After calculating all dispersions, the degree of relevance (weight) of each feature at each cluster is given by: ) − 1 . (5) As per the above, the lower the within-cluster dispersion of a feature, the higher its weight [22].Hence, uniformly distributed features receive a lower weight than those concentrated around their centroids.We can then set the feature with the lowest average weight over all clusters as irrelevant.
Intelligent Minkowski weighted k-means (IMWK) The Minkowski centre of a feature v at a cluster S l with an exponent p is the value μ that minimises ∑ xi∈S l |x iv − μ| p .This algorithm has a parameter, p, used as the Minkowski and weight exponent.In our experiments, we set p equal to two, leading to the squared Euclidean distance.This parameter could be optimised (see [23]) but given our objectives we see no need to do so.Multi-cluster feature selection (MCFS) [24] is an interesting and popular algorithm that, unlike others, takes into consideration possible correlations between features.It does so by making use of developments in spectral analysis (in particular, manifold learning) and L1-regularised models for subset selection (for details, see [25][26][27][28]), preserving the multi-cluster structure of the data set.MCFS requires three parameters, one of which is the number of features to be selected.The other two parameters are the number of eigenfunctions used, and the number of neighbours for a k-NN graph.The original MCFS authors suggest default values of five for both of these parameters but unfortunately this setting produced poor results.With this in mind, we decided to search for the best values for these two parameters (between one and five) for each run of our experiments using the existing labels (the best pair of parameters is that which leads to the best cluster recovery when removing one feature).This clearly biases our MCFS experiments but does not obstruct our objectives.
Multi-cluster feature selection (MCFS) Further details regarding the steps above can be found in the original paper [24].The above algorithms produce a good baseline for the analysis of our experiments.

Feature selection using k-means iterations
In order to meet our objectives for this section (stated under Section 5), we introduce two unsupervised feature selection methods.The first of them is solely based on the average number of iterations k-means takes to converge (that is, τ).We assume that the data set X has a structure containing Gaussian clusters, which is being concealed by a noise feature.Moreover, we know that the average τ is lower on data sets with a cluster structure than on data sets containing solely noise features (see Section 4).Hence, the feature identified as irrelevant in X is that corresponding to the maximum value of τ.

Feature selection via k-means iterations (FSKI)
1.For each v = 1, 2, …, m (a) Set X ′ ←X, and remove from each x i ∈ X ′ the feature v.
(b) Run k-means on X ′ 100 times, saving the average number of iterations, (τ v ), it takes to converge.
2. Return the feature v ′ corresponding to the lowest value of τ v ′ .
The second algorithm we introduce aims at showing that it is possible to use τ to increase the feature selection capabilities of existing algorithms.In our example, τ is used to re-scale a data set before applying a feature selection algorithm.
(b) Run k-means on X ′ 100 times, set τ v to be the average number of iterations k-means takes to converge.

For
, that is, normalise each r v .

For each
Apply IMWK to X.
In the above, Step 2 ensures that if feature v leads to a low average number of k-means iterations (τ v ), its r v will be high.This increases the re-scaled value of v in X (Step 4) and by consequence the dispersion of v. Hence, IMWK (Step 5) is more likely to give a lower weight to v. Thus, the probability of v being chosen as irrelevant increases.

Experiments and discussion
In this section, we present the results of our experiments applying the algorithms described in Sections 5.1 and 5.2 on data sets containing Gaussian clusters to which we added one noise feature (an irrelevant feature composed of uniformly random values).Table 3 and Fig. 3 report the proportion of times each competing algorithm correctly identified the added noise feature as being the irrelevant one.In our experiments, we ran the non-deterministic algorithms (that is, FSKI and IMWKR) 100 times on each data set.We present only the results for small data sets (i.e. those with 1,000 data points) because the others presented mixed results without a clear pattern one way or the other.
The first thing we can observe by analysing the results presented in Table 3 and Fig. 3 is that FSKI is rather competitive, on small data sets, when compared to the popular feature selection methods.On the small data sets containing Gaussian clusters with 0.5 within-cluster covariance (i.e.tight clusters), FSKI outperformed FSFS for eight of the nine considered data configurations.On the same data sets, FSKI performed at least as well as MCFS for seven of the nine parameter configurations.This is an impressive performance, particularly given that FSKI is a simple method that uses nothing but the number of iterations k-means takes to converge (τ).We can also see that IMWKR (a version of IMWK that uses τ to improve its performance) is the best overall performer, with a noticeably low standard deviation.
Regarding the experiments on small data sets containing Gaussian clusters with a within-cluster covariance of 1.0, the results related to τ (FSKI and IMWKR) are still promising.FSKI performed better than FSFS for six of the nine parameter configurations.This time FSFS performed better on data sets with a low number of features (five).FSKI performed at least as well as MCFS for seven of the nine parameter configurations.Overall, the best algorithm was still IMWKR -this becomes clear after observing the obtained results for data sets with at least 10 features (i.e. the best overall results were obtained for six out of nine configurations).
The small data sets with within-cluster covariance of 1.5 contain clusters that are much more likely to overlap.Hence, it is fair to expect a decrease in performance of τ-based methods.However, FSKI still outperformed FSFS for five and MCFS for six of the nine parameter configurations.In terms of overall performance, FSKI performed at least as well as all the other methods for five parameter configurations, and IMWKR for four of them.Generally speaking, the τ-based methods seem to perform particularly well, on small data sets, when the number of features is higher.

Iterations and the number of clusters
Identifying the number of clusters in a data set is one of the major problems faced in clustering.Some clustering algorithms, such as kmeans require this number to be known by the user beforehand.Others, attempt to identify the number of clusters as part of the clustering Table 3 The proportion of times the correct noise feature has been identified by each algorithm (and standard deviation, when appropriate).There are 50 data sets for each of the nine parameter configurations below.A single uniformly random feature, which had to be identified by each algorithm, was added to each data set considered.process or as a pre-clustering step (see for instance, [29][30][31][32], and references therein).Identifying the number of clusters in a data set is a difficult task, and there is no standard method that works in all cases.This difficulty probably stems from the fact that there is no an agreed definition of what a cluster is, and that a precise definition may depend on the context and the clustering aims (for a full discussion, see [33]).
Here, we take the view that clusters are approximately spherical and compact, and can be approximated well by Gaussian distributions.This is not an arbitrary view but rather one based on the algorithm we are analysing here, i.e. k-means.In order to identify other types of clusters, one should probably apply a different clustering algorithm.
In this section, we aim at verifying whether there is a relationship between the average number of iterations k-means takes to converge (τ) and the number of clusters k in X.To meet our objective we ran k-means 100 times on each of our data sets (there are 50 data sets for each parameter configuration) with and without supplying k-means with the correct number of clusters.In this study, each of our data sets has a correct k ∈ {3,5,10}, so we experiment with these numbers.That is, for a parameter configuration such as 1000x5-10 (correct k = 10), we carried out k-means, supplying it with the following numbers of clusters k ∈ {3, 5, 10} -in this example k = 3 and k = 5 are the incorrect numbers of clusters.
Table 4 presents the results of our experiments.An interesting pattern can be observed here.Given any parameter configuration n x mk (e.g.1000x5-3, 1000x10-5, etc.), its correct number of clusters is usually that producing the lowest τ in comparison to the data sets with the same n, m, and within-cluster covariance, but different k.For example, the correct k for 1000x10-5 (that is, k = 5) has a lower τ than those of the configurations 1000x10-3 and 1000x10-10 (under the same within-cluster covariance).Table 4 shows this is true for 41 out of 54 cases.
Let us analyse the above a bit further.In the case of small data sets (i.e. those with 1,000 data points) the pattern we state holds for 24 out of 27 cases.In the three cases the pattern did not hold, the differences were rather small.For instance, in the column k = 10 for data sets with a within-cluster covariance of 1.0 we can see the pattern incorrectly suggests 1000x5-5 has 10 clusters (it has five) given τ = 26.01 is the lowest value for rows 1000x5-3, 1000x5-5, and 1000x5-10.However, the correct (see the row for 1000x5-10) has τ = 27.09which is just slightly higher.In the case of larger data sets (i.e.those with 10,000 data points or more), our pattern identified the correct number of clusters in 17 out of 27 cases but this is mostly because of poor performance on data

Table 4
Average number of iterations (τ) per value of k supplied to k-means.The shaded cells represent cells with the correct k for a given parameter configuration.The experiments in this table were conducted on synthetic data sets containing Gaussian clusters with within-cluster covariance of 0.5, 1.0, and 1.5.
R. Cordeiro de Amorim and V. Makarenkov sets containing clusters with within-cluster variance of 0.5.If we were to ignore these, the pattern would hold true for 14 out of 18 cases.Taking all of the results into account, we can see there is a relationship between τ and the correct number of clusters of X.We find this to be a particularly interesting result because we calculate τ using solely k-means and the given data set X.
Internal cluster validity indices, that is indices requiring nothing external to X and claiming to be related to how good a clustering is (for an extensive review and comparison, see [34]) are often used to determine the number of clusters.However, the way these are applied is rather different.Given a data set X, one usually runs k-means on X with different values of k and then applies one such index to the obtained clusterings in order to determine best one (and by consequence the best value of k).When we want to use the number of k-means iterations τ as a criterion, the approach is different.To find the best number of clusters for X, one should generate data sets as close to X as possible but with different numbers of clustersthe most appropriate number of clusters for X is that which leads to the lowest value of τ.

Conclusion
K-means remains arguably the most popular clustering algorithm used in scientific and industrial applications [35].In this paper, we focused on τ, that is the average number of iterations k-means takes to converge on a given data set X.This value is always produced when applying k-means, but to the best of our knowledge, it has never been used in the data analysis process.Here, we showed that τ is in fact a very informative parameter of k-means.
First, we discovered that in some cases there is a strong negative correlation between τ and the quality of the recovered clusters, if running k-means multiple times on the same data set.This trend is particularly noticeable in large data sets, or in the case of smaller data sets in those with a high number of features, a low number of clusters, and containing Gaussian clusters with a low within-cluster covariance.However, our experiments also showed that this correlation can be lost if the clusters had a higher spread (higher within-cluster covariance, particularly in small data sets) or if a single noise feature was added to them (see Section 3).We also found that τ is lower on X if the latter contains Gaussian clusters, as opposed to uniformly random values.In fact, we went even further and showed that on small data sets containing Gaussian clusters, the lower the within-cluster covariance (i.e. the tighter the clusters are), the lower τ is (see Section 4).Interestingly, we also showed that the pattern is the opposite for larger data sets.In any case, the structure of X (or the lack of) has an impact on τ.Moreover, we also investigated two interesting applications of τ.First, τ can be used to help identify an irrelevant feature (e.g. a feature composed of uniformly random values) on small data sets containing Gaussian clusters.We showed this by experimenting with two new methods: (i) removing from X the feature that increases τ the most; (ii) using τ as the base of a feature rescaling procedure in the data pre-processing stage, improving an existing feature selection algorithmand comparing these two with some popular unsupervised feature selection methods (see Section 5).
Second, we showed that there is a close relationship between τ and the number of clusters in data sets containing Gaussian clusters (see Section 6).We see our work in this paper as potentially leading to improvements in unsupervised feature selection and in the identification of the number of clusters in data sets.In terms of future work, we plan to investigate whether τ could have other practical applications.

Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Fig. 3 .
Fig.3.Boxplot diagrams summarising our results for the proportion of times the correct noise feature was identified by each of the five competing algorithms.A single uniformly random noise feature, which had to be identified by each algorithm, was added to each data set considered.

Table 1
Correlation (ρ)between the ARI values (representing clustering quality) and the numbers of k-means iterations calculated for each considered parameter configuration.The average values over 50 data sets per configuration are reported.
R. Cordeiro de Amorim and V. MakarenkovIt is intuitive to think that the structure of a data set (or the lack of it) has an impact on τ.We first demonstrate that this is indeed the case, and later show how this property can be used in practice (see Sections 5 and 6 1. Produce a k-nearest neighbours graph.2. Solve a generalised eigen-problem and obtain the k top eigenvectors with respect to the smallest eigenvalues.3. Solve k L1-regularised regression problems, obtaining k sparse coefficient vectors.4. For each feature v = {1, 2, …, m}, compute its MCFS score. 5.Return the m ′ features with the highest MCFS scores, where m ′ is a user-defined parameter and m ′ < m.