An Approximate Maximin-Directed Random Sampling for Clustering Applications

.


INTRODUCTION
Social networking giants like Facebook and Twitter boast billions of users, generating hundreds of gigabytes of content every minute.Retail establishments continuously amass extensive customer data, while platforms like YouTube, with over 1 billion unique users, churn out 100 hours of video content every hour.To illustrate the sheer magnitude, YouTube's content ID service scans an astounding 400 years' worth of video content each day [1,2].Notably, scientists and researchers refer to it as "Big Data".In the face of this deluge of data, the need for robust tools for knowledge discovery becomes imperative.Data mining techniques have firmly established themselves as indispensable instruments for this purpose.Among these techniques, clustering stands out as a method whereby data is partitioned into groups, ensuring that objects within each group share more similarity with one another than with objects in other groups [1].
Suppose n objects are represented as feature vectors . Classic cluster analysis for this kind of static data is discussed in many texts and numerous articles [3][4][5][6][7][8][9][10][11].If the number of samples precludes clustering the data directly, there are two popular ways to approach the problem.First, we may split the data into chunks, process the chunks independently, and aggregate the results [12,13].
A second popular approach is to sample the data, cluster the sample, and then extend the results to the rest of the data set non-iteratively by labeling the remaining points with the nearest prototype method [14].The question addressed in this paper is: what method of sampling produces the "best" samples to use in this context?Certainly (true) Random Sampling (RS) is the best-known method.Progressive sampling using various termination criteria is advocated in [15][16][17].The specification of the MMDRS algorithm requires a bit of notation.
Assume c is an integer number such that 1<c<N.The set  ℎ = { ∈ ℜ  : 0 ≤   ≤ 1 ∀, ; ∑   =  1 ∀ ; ∑    > 0 ∀  } contains all of the crisp c-partitions of N objects, represented as cN  matrices.Equivalently, each U (membership) can be represented as   = ∪    =1 ;   ∩   = ∅ ∀  ≠ , where {Xi} are the crisp subsets comprising the c clusters.We write  ↔ {  }.The MMDRS partition of XN is U MM ∈ M hc ′ N where c' is the desired number of smaples to be selected by maximin sampling (MM).
A third approach for sampling is based on a three step process comprising: (i) determination of c' Maximin (MM) prototypes X MM = {  1 , . . .,   ′ } ⊂   ; (ii) erection of the nearest prototype partition UMM of XN; and (iii) drawing a specified number of samples from each of the subsets in UMM.This third method is not true random sampling; rather, it is random sampling constrained by drawing samples from specified locations.Since this RS scheme is directed by the MM samples, we will call it the Maximin-Directed Random Sampling (MMDRS) method which was first discussed in the study [18].Since then, this method or some derivative of it have been used frequently in the literature of cluster analysis for static data.One of the challenges with MMDRS is that it is computationally expensive.Therefore, to enhance this aspect of MMDRS, we introduce a new approximate MMDRS (AMMDRS) sampling scheme.The goal of AMMDRS is to be faster and more applicable for big data applications.
So, this article has the following contributions.First, we will introduce the new AMMDRS scheme.Then we will conduct some numerical experiments to compare the quality of samples produced by the three sampling methods: RS, MMDRS, and AMMDRS.Ultimately, we will demonstrate that adopting our approach yields sample quality comparable to MMDRS, all while requiring less computational complexity.The remainder of the paper is organized as follows.In Section 2, we dive into the MM and MMDRS algorithms.We then flesh out the new AMM scheme in Section 3, building on the foundations of the original MMDRS method.In Section 4, we tackle the nuanced idea of what "best" sample really means in the context of cluster analysis.Section 5 sheds light on the datasets used in the analysis and the metrics that gauge their quality.The details of our findings are in Section 6, and we wrap things up with our takeaways in Section 7.

THE MM AND MMDRS ALGORITHMS
The concept of MM sampling was initially introduced in the study [19], where it is characterized as a method for initializing a set of c prototypes, also known as cluster centers, for clustering purposes.Casey and Nagy [20] conducted an overview of the MM algorithm for setting up initial prototypes, which we refer to as the MM principle.
[MM Principle].The initial sample in the batch serves as our first cluster center.From there, we calculate the distances of the other samples from this initial center.The sample farthest away becomes our second center.For every other sample, we consider the shorter of the two distances from these centers.The sample with the largest of these minimum distances is then selected next.Subsequent centers are selected to ensure maximum separation from those already chosen.This ensures that our initial cluster centers are spread widely across the sample space-a property that's intuitively appealing.
Hathaway et al. [18] appended two steps to this sampling scheme.First, the crisp nearest prototype rule (NPR) partition is computed using the MM samples as prototypes.Second, each of the subsets in this partition is subsequently sampled randomly a number of times proportional to the number of points in the subset.This produces a small subset of the larger parent set for approximate clustering and tendency assessment.The resultant sample is called a Maximin Directed Random Sample (MMDRS).The complete pseudo code for the MMDRS algorithm is depicted in The literature contains at least six ways to initialize MM sampling in Line 3. A recent study of this issue [21] determined that, on average, the original and fastest scheme (line 3) is as reliable as the other five methods, so that is the initialization we use.The primary requirement for good samples in the present context is that the cluster proportions in the c' samples from XN be representative of the corresponding proportions for the subsets in XN.If the data are unlabeled, there is no way to ascertain whether any sampling scheme satisfies this desire.But if the data are labeled, we can determine how well the samples match the distribution of the labeled subsets in XN.This intuitive objective informs our definition for what constitutes a best set of samples.Our expectation is that the DRS methods which begin with MM sampling will produce better samples of labeled data than simple RS in terms of matching proportions of sample and parent (in this article we call XN the parent of samples of it made by the three methods).There are three minor results about MMDRS sampling that provide weak guarantees that fuel our expectations.To describe the results, we need Dunn's index [22], discussed next.
Consider two non-empty subsets, S and T∈ ℜ  , with an arbitrary metric denoted by : ℜ  × ℜ  ↦ ℜ + .The diameter of S can be defined as S as Δ(S) = max ⏟ ,∈S {d(, )}.Similarly, we define the set distance δ between S and T as δ(S, T) = min ⏟ ∈S ∈T {d(, )}.For any given partition U ∈ M hcN ↔ {X i }, the separation index of U, widely recognized as Dunn's index (DI, [22]) is: Dunn characterized set U as compact and separated (CS) in relation to metric d under the following conditions: For all subsets s, q, and r, where q≠r, any pair of points x and y from XS are closer to each other (based on metric d) than any other pair u and v, where u is from Xq and v is from Xr. Dunn established that a set X possesses a clear CS partition with respect to d if and only if max ︸ U∈M hcn { DI(U; X)} > 1, the maximum of DI(U;X) over all U in MhcN is greater than 1.Subsequent results tie this particular characteristic of Dunn's index to the MMDRS samples extracted obtained from XN by Algorithm 1: Then lines 1-9 of the MMDRS Algorithm will select at least one object from each of the c clusters.
The MM theorem tells us that when the input da have c CS clusters, lines 1-9 of Algorithm 1 will extract at least one sample from each cluster.Please observe that proposition MM applies to the seeds (the prototypes) which are used to build the MMDRS partition.
. If XN can be partitioned into c compact and separated clusters CS clusters, and c'=c, then .Suppose XN can be partitioned into c CS clusters for c'≥c, and suppose that |St|/N is an integer for all t.Then the proportion of objects in the MMDRS sample from subset t equals the proportion of objects in the parent population for t=1 to c. Proof.Proposition 2, Hathaway et al. [18].These three results have limited utility because the majority of input datasets lack the CS property, and even when they do possess it, it is usually impossible to verify that this is the case.On the other hand, these results do provide some reassurance about the MMDRS procedure, in the sense that at least in some cases, Algorithm 1 obtains samples that do represent all c clusters in the data.Consequently, we expect the MMDRS samples to provide fairly representative proportions of the distribution of the input data.
As a final note, we remark that the actual MM samples drawn by MM lines 1-9 are not part of the sample output, but can easily be included in the output if this is desired.Our experience is that inclusion of the MM samples doesn't make much difference to their quality in terms of representing the distribution of the input data.
In summary, MMDRS demonstrates its effectiveness in generating representative samples from a dataset XN when the cluster proportions in the c' samples derived from XN align closely with the proportions found within the subsets of XN.The generated samples can be used as input to any clustering algorithm to find structure in the data without the need to iteratively accessing the whole data samples.Thus, making it feasible to run most clustering algorithms for very large datasets which is impossible without sampling.However, one drawback of MMDRS is that it needs to span all the data which makes it challenging and time consuming for large datasets.Therefore, reducing the time complexity for this approach will be essential for big data applications.[24,25], but since they don't use directed random sampling as a second step, these methods will not be considered here.Table 2

describes our approximate version of MM sampling:
Lines 1-10 extract the c' AMM samples from XN.The first AMM sample, selected in Line 3 of Algorithm 2, is the first sample in the data.For each additional MM sample, the data is shuffled and split into T chunks.Each successive MM sample is chosen from the new chunk (Xw) instead of the whole input data set (XN).This process is repeated until c' samples are obtained.The DRS procedure (lines10-20 of Algorithm 1) is then used to find ns AMMDRS samples.To summarize, the AMM procedure simply replaces the input data set XN by a chunk Xw at each iteration in the MM part of the MMDRS algorithm.This reduces the computation time for the MM part of the sampling procedure.Now we turn to some ways to measure sampling quality, where the samples are explicitly constructed to support cluster analysis.
It is evident that AMDRS leverages its primary advantages in line 6, where the data is randomly partitioned into multiple segments.Subsequently, AMM operates on each of these chunks, obviating the need to access the entire dataset for sampling.This efficient approach significantly lowers the time complexity by diminishing the volume of data that needs to be processed, reducing it from N (the size of the data) to N/T, where T represents the number of partitions employed by AMDRS.

SAMPLE QUALITY
In our experiments, the datasets are labeled, which means they possess ground-truth c'-partitions, denoted by  ∈  ℎ ′  of XN.Assume ni represents the count of points in subset-i, then the total number of points is given by  = ∑

𝑐 ′ 𝑖=1
. From this, we can define the proportion vector of XN in ℜ  ′ as: Algorithm 1 or Algorithm 2, respectively, extracts c' MMDRS samples XMMDRS, or AMMDR samples XAMMDRS from the input data.Let   ′ ,  ′  ′ denote the number of samples drawn from the t-th subset, 1≤t≤c' by these two algorithms.For these samples we have the corresponding sample proportion vectors in  c : Our objective is to evaluate the degree of alignment between VMMDRS and VAMMDRS.Given that these samples are derived from labeled data, it is feasible to create histograms that contrast the counts of points within each labeled subset with those in the samples.This visual approach offers an assessment of how closely the proportions in the original dataset match those in the sample, all while being independent of both N and p. Especially for smaller values of c, a visual comparison can provide a fairly precise gauge of this alignment.
There are multiple methods to analytically compare VMMDRS or VAMMDRS with VN.One straightforward approach involves calculating the distances d(VN, VMMDRS) and d(VN, VAMMDRS), using a suitable metric in ℜ  ′ × ℜ  ′ .A distance of zero signifies an impeccable alignment between the proportions in the main dataset and the sample.Secondly, the similarity between the two distributions (VMMDRS or VAMMDRS to VN) can be calculated via different methods.The Kolmogorov-Smirnov (KS) test is a statistical test used to compare a sample distribution with a reference probability distribution, or to compare two sample distributions [26].It is a non-parametric test, which means it does not make any assumptions about the shape or parameters of the distributions being compared.It can determine whether two independent samples are drawn from the same population or different populations.This is useful in comparing the characteristics of two groups.Therefore, KS is used to test against the null hypothesis that (VN, VMMDRS) or (VN, VAMMDRS) come from the same distribution.The returned pvalue is used to interpret the results.For our experiments, we will choose a default significance level of α=0.05.Consequently, if p>α=0.05,we uphold the hypothesis that the sample originates from the same distribution as the parent data.In such cases, we will note that the sample has successfully passed the KS test.It is worth mentioning that in our experiments, the number of "samples" for the KS test equates to c', the total count of labeled subsets.Given that the KS test tends to be less precise for smaller sample sizes, it might not offer highly informative outcomes in our context.We will consider a sample to "cover" the input data if every labeled subset gets represented at least once.

NUMERICAL EXPERIMENTS
We conducted all experiments on a system equipped with an INTEL Core i7-8700K CPU and 64 GB of RAM, utilizing MATLAB for implementation.The value of T used in line 6 of Algorithm 2 was 10.The horizontal axis on all of the histograms is the cluster number in the labeled data.So, for example, the horizontal axis for the X15 histograms has 15 ticks at k=1 to 15 corresponding to the 15 labeled subsets in the data.The vertical axis on all of the histograms is the ratio of the number of data points (ni) in subset-i (or sample thereof) to the number of input points (N).  3 lists the four datasets utilized in our experiments.These include three datasets, named as follows: X15 [27], X31 [28], and X6 [29], as well as the Wisconsin Diagnostic Breast Cancer (WDBC) dataset [30].While each of these datasets underwent identical analysis, due to space constraints, we cannot showcase all the figures in this article.However, a comprehensive collection of graphs can be obtained upon request from the second author.
X15, as seen in Figure 1, showcases clusters visibly distinct, stemming from Gaussian distributions with varied means and covariance matrices.Each cluster has a size varying between 300 and 350. Figure 2 presents six histograms for the dataset X15 when c'=20.The input data's histogram is positioned on the upper left, while the random sample is on the upper right.Each histogram is labeled with two values: ED denotes the value of d(VN,VMM(*)) where d represents the Euclidean distance; p signifies the result of the 2-sample KS test (as provided by Matlab) against the significance level α=0.05.A p-value less than the significance level prompts us to reject the 05null hypothesis that both samples come from the same distribution.Conversely, we accept the two samples as being from the same distribution if p>0.05.The values of Euclidean distance in Figure 2 show that Random Sampling produces a much higher value of ED (and hence, a lower quality match to the input distribution) than all four of the MM based methods.Comparing MM to AMM, we see that MM does slightly (but only slightly) better for the c' samples.After applying DRS to the two sets (MM and AMM), the ED values are an order of magnitude smaller, and AMMDR does slightly better than MMDRS.Visually, the two DRS sets are much closer to the input distribution than the RS, MM and AMM sets, confirming that the DRS portion of these two algorithms really improves the quality of the samples drawn.The KS test accepts all 5 samples, but clearly prefers the two DRS methods (equal p values of 0.8899) to the MM and AMM samples (p~0.060).The p value for RS (0.307) lies in-between these two pairs of values, which agrees with the visual assessment that RS matches the distribution of X15 better than both MM methods, but not as well as both DRS methods.
The dataset X31, illustrated in Figure 3, comprises 100 points distributed across 31 Gaussian clusters.As a result, the histogram representing the input data exhibits a uniform profile, each bin containing 1/31~0.0322 of the points, as seen in the upper left view of Figure 4, which exhibits the histograms and statistics (ED and KS test) for the five sampling methods at c'=50.The two DRS methods yield visually superior samples, and the ED for these two samples favors MMDRS, albeit slightly.The RS is visually inferior to the other four methods.The p-values for all 5 samples are quite small; the statistical implication of this is to reject the null hypothesis that any of these samples matches the input distribution at significance level α=0.05.   the right, a thick cluster of magenta points nestles within a sparser blue subset.Notably, the lower left section of the scatterplot presents a unique clustering reminiscent of a "fried egg".This configuration consists of a vibrant yellow center (depicting the "yolk") encased by a cyan perimeter, symbolizing the "egg white".The specific sizes of these six clusters are as follows: 50, 92, 38, 45, 158, and 16.

Figure 6. MMDRS and AMMDRS samples of X6: c'=10, ns=100
From Figure 6, first, notice that RS produces a much better visual match to the input data than either MM or AMM, but when DRS is added to the sampling procedure, the visual match of both DRS schemes is slightly better than RS.The ED values agree: RS is better than MM or AMM, but not as good as MMDRS or AMMDRS.All 5 samples pass the KS test, i.e., they accept the match between the samples and parent distributions.In our final experiment, we utilized the Wisconsin Diagnostic Breast Cancer dataset.Figure 7 contains the results.This data set is an odd one, because it has feature vectors in 30 dimensions (p=30), but only N=569 samples.All 5 samples yield the same p value for the KS test, so it is not a useful discriminator for sample quality.Visually, the MM, AMM, and RS samples are poor matches to the input data, while the two DRS samples all look the same and are a better match the actual data.The ED values for the two DRS methods are lower than the MM values and the RS value.From the ED values, we conclude that for this experiment, MMDRS was the best and AMMDRS was next best.
Table 4 shows the CPU time used to compute samples for data set X31.The time required to compute AMM samples is about 1/7 of the time required for MM samples because AMM works on a subset of size N/T of the original dataset, which has N samples.The smaller the subset size (the larger the T value), the smaller will be the time required to compute the AMM samples.But the cost of large T values is the risk of missing samples from the partition of the datasets that does not exist in that subset.Since AMMDRS relies on AMM, it is slightly faster than MMDRS, as can be seen in Table 4.

CONCLUSIONS
In this manuscript, we introduced an innovative Approximate MMDRS (AMMDRS) algorithm designed to facilitate the generation of faithful and representative samples from large datasets.This approach empowers the application of traditional clustering algorithms without the necessity of processing the entire dataset, a critical advantage in scenarios where accessing the complete dataset is computationally challenging or impossible due to resources constraint.The significance of this research lies in its potential to make datadriven decision-making more accessible and practical, particularly in situations where working with big data sets is otherwise infeasible.Consequently, this manuscript contributes to the growing body of knowledge aimed at bridging the gap between data analysis and real-world applications, further underscoring the importance of efficient and accurate sampling techniques for handling big data challenges.
The experiments presented here do suggest that the approximate MM method is faster than MM, without a significant loss in sampling accuracy.This is especially important for big data applications where processing the entire datasets is not feasible.Table 4 shows that simple (undirected) random sampling is faster than either of the MM based DRS methods because no time is expended in building the NPR partition.This will be true for any input data set.But in terms of sample quality for cluster analysis, both of the DRS methods produce samples that provide a more faithful representation of the distribution of the input structure than simple random sampling in the experiments reported here.We have used several with different number of cluster and samples for our experiments, but our experience with these methods suggests that as the size of the input data grows, AMDRS will eventually be superior to MMDRS due to computation complexity of MMDRS which needs to access the whole data.We will test this conjecture with a more extensive empirical study in the future.

REFERENCES
the CS partition of XN.Proof.Theorem 1, Hathaway et al. [23].Proposition MMDRS-1 tells that when the input data have c CS clusters and we choose c'=c, that lines 10-19 of Algorithm 1 find the CS clusters.The number of samples drawn from the t-th subset in Line 16 of the MMDRS algorithm is   = ⌈  ( |  |  )⌉ ; 1 ≤  ≤  ′ .The number |St|/N scales the number of desired samples   drawn from the t-th row of UMM by the proportion of samples in that row.Because of the ceiling function, the overall number of samples is approximate, ∑    ′ =1 ≈   .The number and the proportions drawn will be exact under the extra condition that the sampling proportions are all integers, so the ceiling function is not used and ∑    ′ =1 =   .

of MM samples: ns=desired number of MMDRS samples MM 2 Initialize: 𝑿
Table 1 below where it is split into two sections, one is the MM sampling and the other one is the DRS sampling.Lines 1-9 extract the c' MM samples from XN. Ties in Line 6 are broken arbitrarily.Lines 10-19 build the elements of the crisp partition   ∈  ℎ ′  of XN.The matrix UMM appearing in lines 10, 12 and 20 is commented out since it is not needed to secure the desired MMDRS samples outputted in line 20.We show it to instruct readers on how the partition is used to direct the random sampling.Hopefully this lends some transparency to the DRS scheme.You may recognize UMM as the "k-means" or nearest prototype rule (NPR) partition of XN built by applying Lloyd's algorithm[1]to the input data with k=c' using the c' MM samples as cluster centers.

Table 4 .
Computational times of the proposed sampling methods on X31 dataset