Fast and Eager k-Medoids Clustering: O(k) Runtime Improvement of the PAM, CLARA, and CLARANS Algorithms

Clustering non-Euclidean data is difficult, and one of the most used algorithms besides hierarchical clustering is the popular algorithm Partitioning Around Medoids (PAM), also simply referred to as k-medoids clustering. In Euclidean geometry the mean-as used in k-means-is a good estimator for the cluster center, but this does not exist for arbitrary dissimilarities. PAM uses the medoid instead, the object with the smallest dissimilarity to all others in the cluster. This notion of centrality can be used with any (dis-)similarity, and thus is of high relevance to many domains and applications. A key issue with PAM is its high run time cost. We propose modifications to the PAM algorithm that achieve an O(k)-fold speedup in the second ("SWAP") phase of the algorithm, but will still find the same results as the original PAM algorithm. If we relax the choice of swaps performed (while retaining comparable quality), we can further accelerate the algorithm by eagerly performing additional swaps in each iteration. With the substantially faster SWAP, we can now explore faster initialization strategies, because (i) the classic ("BUILD") initialization now becomes the bottleneck, and (ii) our swap is fast enough to compensate for worse starting conditions. We also show how the CLARA and CLARANS algorithms benefit from the proposed modifications. While we do not study the parallelization of our approach in this work, it can easily be combined with earlier approaches to use PAM and CLARA on big data (some of which use PAM as a subroutine, hence can immediately benefit from these improvements), where the performance with high k becomes increasingly important. In experiments on real data with k=100,200, we observed a 458x respectively 1191x speedup compared to the original PAM SWAP algorithm, making PAM applicable to larger data sets, and in particular to higher k.


Introduction
Clustering is a common unsupervised machine learning task, in which the data set has to be automatically partitioned into "clusters", such that objects within the same cluster are more similar, while objects in different clusters are more different.There is not (and likely never will be) a generally accepted definition of a cluster (Bonner, 1964), because "clusters are, in large part, in the eye of the beholder" (Estivill-Castro, 2002), meaning that every user may have different enough needs and intentions to want a different algorithm and notion of cluster.And therefore, over many years of research, hundreds of clustering algorithms and evaluation measures have been proposed, each with their merits and drawbacks.Nevertheless, a few seminal methods such as hierarchical clustering, k-means, PAM (Kaufman andRousseeuw, 1987, 1990c), and DBSCAN (Ester et al., 1996) have received repeated and widespread use.One may be tempted to think that these classic methods have all been well researched and understood, but there are still many scientific publications trying to explain these algorithms better (e.g., Schubert et al., 2017), trying to parallelize and scale them to larger data sets (e.g., Lijffijt et al., 2015;Yang and Lian, 2014), trying to better understand similarities and relationships among the published methods (e.g., Schubert et al., 2018), or proposing further improvements -and so does this paper for the widely used PAM algorithm, also often referred to as k-medoids clustering.
In hierarchical agglomerative clustering (HAC), each object is initially its own cluster.The two closest clusters are then merged repeatedly to build a cluster tree called dendrogram.HAC is a very flexible method: it can be used with any distance or (dis-)similarity, and it allows for different rules of aggregating the object distances into cluster distances, such as the minimum ("single linkage"), average, or maximum ("complete linkage").Single linkage directly corresponds to the minimum spanning tree of the distance graph.While the dendrogram is a powerful visualization for small data sets, extracting flat partitions from hierarchical clustering is not trivial, and thus users often turn to simpler methods.
A classic method taught in textbooks is k-means (for an overview of the complicated history of k-means, refer to Bock, 2007), where the data is modeled using k cluster means, that are iteratively refined by assigning all objects to the nearest mean, then recomputing the mean of each cluster.This converges to a local optimum because the mean is the least squares estimator of location, and both steps reduce the same quantity, a measure known as sum-of-squared errors: In k-medoids, the data is modeled similarly, using k representative objects m i called medoids (chosen from the data set; defined below) that serve as "prototypes" for the clusters instead of means in order to allow using arbitrary other dissimilarities and arbitrary input domains (not restricted to vector spaces), using the absolute error criterion ("total deviation", TD) as objective: which is the sum of dissimilarities of each point x c ∈ C i to the medoid m i of its cluster.
If we use squared Euclidean as distance function (i.e., d(x, m) = ||x − m|| 2 2 ), we almost In comparison to the original conference version, the improved version presented here modifies the algorithm in a way that allows to prove the speedup factor of O(k) compared to the original PAM algorithm by completely eliminating the nested loop of length k.We furthermore study a new variant using eager swapping that further improves runtime.We also include a brief recap on the history of the PAM algorithm, updated and more extensive benchmarks on additional data sets, and cover additional related work.
1.1.On the History of PAM In the early eighties many clustering methods were restricted to dealing with metric data, i.e., coordinates of geometric points.The PAM algorithm was developed at that time (but only published later).PAM was part of a project to construct clustering methods that could deal with arbitrary dissimilarity matrices (subjective judgments, confusion matrices, . . . ) that did not even have to satisfy the triangle inequality.Kaufman and Rousseeuw (1987) proposed the name medoid for an object with lowest total dissimilarity to the other objects of its cluster, in order to distinguish it from the (geometric) median for metric data (for a survey see Fritz et al., 2012), defined by minimizing the sum of Euclidean distances over all geometric points, not only the data points.Of course PAM can handle metric data as well by first computing a dissimlarity matrix from them, e.g., using Euclidean or Manhattan distance.The program DAISY (Kaufman and Rousseeuw, 1990b) also computed dissimilarity measures for data with non-numerical variables.
Originally PAM was run on the first generation of IBM PC's that only had two floppy disks of 360KB each (the left one containing the DOS operating system), 64kB of internal memory, and no hard drive.The Fortran code of PAM could only be compiled after splitting it in pieces.At that time the main limitation of PAM was not so much its computation time but mainly the O(n 2 ) memory required, allowing to analyze datasets with up to about n = 150 cases only.This restriction was then circumvented by the program CLARA (Kaufman and Rousseeuw, 1986).
As for the naming of the k-medoid algorithm, it was first thought to combine the initials of K-Medoid into KIM, after the daughter of one of the authors.But this only matched two out of three letters.Thinking a bit longer led to the descriptive name Partitioning Around Medoids with abbreviation PAM, then the well-known name of a character played by Victoria Principal in the television series Dallas.In fact all algorithms in the subsequent book (Kaufman and Rousseeuw, 1990b) were given female abbreviations, with for instance DAISY inspired by after HAL's song in Kubrick's film 2001: A Space Odyssey.
Around the same time, a family of related problems received a lot of attention in a different domain, trying to find the optimal matching of consumer locations and potential facility locations.But of course finding related work was, at that time, much harder than it is today with electronic access, Wikipedia, and full text search engines.Today, we can more easily find such connections across different domains.

Related Location-Allocation Problems
Closely related approaches and problems can be found in other domains such as in operations research and management science with different names such as "p-medians" (not to be confused with geometric medians).The k-medoids clustering problem can be seen as a symmetric and discrete special case of p-medians, and the uncapacitated facility location problem (UFLP), where the number of facilities to be opened, k respectively p, is constant; where all facilities have the same opening costs and where all customers have equal demand.A survey and annotated bibliography of the p-median problem and some of its variations in facility location can be found in Reese (2006).
In such domains, authors such as Teitz and Bart (1968) and Maranzana (1963) have considered various heuristics for this version of the problem.Parts of the PAM algorithm can be found in this literature by the name "greedy" for the BUILD initialization of PAM; "interchange" or "vertex substitution" for the SWAP part of PAM; and "alternate" for the k-means-style iteration technique also discussed in data mining literature (e.g., Hastie et al. 2001;Park and Jun 2009).Several variations have been suggested, such as the "fast interchange" heuristic of Whitaker (1983).Beasley (1985) used a Cray-1S supercomputer to find the exact solution for such problems with up to 900 instances with a branch-and-bound approach.For some of the ORlib problems, the exact solutions could still not be determined within 600 seconds at that time.Because of these similarities, our algorithms may be of interest for researchers from these domains too, as it should be possible to add support for varying demand and asymmetric problems (possibly even for capacitated facility location), and retain the O(k) speedup over the standard local search heuristic popular in these domains, at least for the initialization of more complex search methods.

Partitioning Around Medoids (PAM) and its Variants
The "Program PAM" (Kaufman andRousseeuw, 1987, 1990c) consists of two algorithms, BUILD to choose an initial clustering, and SWAP to improve the clustering towards a local optimum (finding the global optimum of the k-medoids problem is, unfortunately, NP-hard as shown by Kariv and Hakimi, 1979).The algorithms require a dissimilarity matrix (for example computed using the routine DAISY of Kaufman and Rousseeuw, 1990b), which requires O(n 2 ) memory and typically for many popular distance functions in d dimensional data O(n 2 d) time to compute (but potentially much more for expensive distances such as earth mover's distance also known as Wasserstein metric).Computing the distance matrix often is already a bottleneck in many cases.
Algorithm 1: PAM BUILD: Find initial cluster centers.In order to find a good initial clustering (rather than relying on a random sampling strategy as commonly used with k-means), BUILD chooses k times the point which yields the smallest distance sum TD (this means first choosing the point with the smallest distance to all others; afterwards always adding the point that reduces TD most).We give a pseudocode in Algorithm 1, where we use ∆TD as symbol for the change in TD (which should be negative to be beneficial), and add an asterisk * for the best values found so far.A subtle but important detail used in BUILD (independently suggested by Whitaker 1983 in the "fast greedy" approach) is to cache the distance to the nearest medoid in line 6, then use this in the loop in line 13, and update it when a new medoid has been chosen in line 18.This avoids an additional nested loop over k inside the computation of ∆TD, and reduces the runtime of the naive implementation from O(n 2 k 2 ) to O(n 2 k) time.Nevertheless, this remains a fairly expensive algorithm.The motivation here was to find a good starting point, in order to require fewer iterations of the refinement procedure.In the experiments, we will also study whether randomized initialization approaches are an interesting alternative.
The second part of PAM, which is the main focus of this paper, was named SWAP.It improves the clustering by considering all possible simple changes to the set of k medoids, which effectively means replacing (swapping) some medoid with some non-medoid, which gives k • (n − k) candidate swaps.If it reduces TD, the best such change is then applied, in the spirit of a steepest-descent method, and this process is repeated until no further improvements are found.The process then has reached a local (but not necessarily the global) optimum where no solution that agress on k−1 medoids is better.Because the possible search space of medoids is finite (although large: n k ), a steepest-descent method must converge with a finite number of iterations.
We give a pseudocode of this in Algorithm 2. In line 1 we compute-and cachethe necessary data to compute the ∆(x o , m i , x c ) function (Equation 5, explained in Sec-Algorithm 2: PAM SWAP: Iterative improvement.tion 3) in line 8 efficiently.These values then need to be updated after performing a swap in line 13.With the cached values, the run time of the main loop of this algorithm is O(k(n−k) 2 ) for each iteration.(A similar optimization can also be found in Whitaker, 1983.)While the authors of PAM assumed that only few iterations will be needed (if the algorithm is already initialized well, using the BUILD algorithm above), we do see an increasing number of iterations with increasing amounts of data and increasing k and cannot give a non-trivial bound for the maximum number of iterations necessary, but usually we observe fewer than k iterations (as also observed by Whitaker, 1983).
Both the pseudocode for BUILD and SWAP given here omit the details of managing the cached distances.For each object, we need to store the index of the nearest medoid nearest(o) and its distance d nearest (o), and also the distance to the second nearest medoid d second (o) (for brevity, we will also use d n (o) and d s (o) in some places, because of space restrictions).In line 13 (or better during line 12, when executing the best swap), we need to carefully update these cached values (if we additionally store the index of the second nearest center, we may be able to avoid some more distance computations if the new medoid becomes the nearest or second nearest after a swap).

Variants and Extensions of PAM
The algorithm CLARA (Kaufman andRousseeuw, 1986, 1990a) repeatedly applies PAM on a subsample with n n objects, with the suggested value n = 40 + 2k.Afterwards, the remaining objects are assigned to their closest medoid.The run with the least TD (on the entire data) is returned.If the sample size is chosen n ∈ O(k) as suggested, the run time reduces to about O(k 3 + n), which explains why the approach is typically used only with small k (Lucasius et al., 1993).Because CLARA uses PAM internally, it will directly benefit from our improvements proposed in this article.Lucasius et al. (1993) propose a genetic algorithm for k-medoids, by performing a randomized exploration of the search space based on "mutation" of the best solutions found so far.Crossover mutations correspond to taking some medoids from both "parents", whereas mutations replace medoids with random objects.It is not obvious that this will efficiently provide a sufficient coverage of the enormous search space (there are n k = n! k!(n−k)!possible sets of medoids) for a large k.In order to benefit from the proposed improvements, a more systematic mutation strategy would need to be adopted, making the method similar to CLARANS below.Wei et al. (2003) found the genetic methods to work only for small data sets, small k, and well separated symmetric clusters, and it was usually outperformed by CLARANS.The algorithm CLARANS (Ng andHan, 1994, 2002) interprets the search space as a high-dimensional hypergraph, where each edge corresponds to swapping a medoid and non-medoid.On this graph it performs a randomized greedy exploration, where the first edge that reduces the loss TD is followed until no edge can be found with p = 1.25% • k(n − k) attempts.In Section 3.4 we will outline how our approach can be used to explore the k edges corresponding to all medoids at a time efficiently; this will allow exploring a larger part of the search space in similar time, but we expect the savings to be relatively small compared to the improvements gained in PAM.
Other proposals include optimizations for Euclidean space (e.g., Estivill-Castro and Houle, 2001;Estivill-Castro and Yang, 2004), simulated annealing (Murray and Loss: 10 Swap 1: Loss: 7 Swap 2: Loss: 6 Figure 1: Problematic example for the alternating heuristic: the k-means style approach is stuck in the top solution, while the SWAP heuristic can reach better solutions by reassinging points during the swap.Church, 1996), variable neighborhood search (Mladenovic and Hansen, 1997), and tabu search heuristics (e.g., Rolland et al., 1997).Estivill-Castro and Murray (1998) suggested stopping early when observing diminishing returns with a "fast interchange" heuristic as used by Whitaker (1983).Newling and Fleuret (2017b) propose an interesting sub-quadratic algorithm, but it requires the distances to be metric (an additional problem is discussed in the following section).For a broad survey of related techniques in operations research, see Reese (2006).Reynolds et al. (2006) discuss an interesting trick to speed up PAM.They show how to decompose the change in the loss function into two components, where the first depends only on the medoid removed, the second part only on the new point.This decomposition forms the base for our approach, and we will thus discuss it in Section 3 in more detail.

Alternating k-medoids Algorithm
Park and Jun (2009) propose a "k-means like" algorithm for k-medoids that is supposedly "simple and fast" (actually this was already considered before by, e.g., Maranzana, 1963;Hastie et al., 2001;Reynolds et al., 2006).Newling and Fleuret (2017b) propose a sub-quadratic variant for this algorithm for metric data, but unfortunately do not compare the result quality to alternatives (in Newling and Fleuret, 2017a, the authors observed that CLARANS produced much better results than this approach, in neither work they included PAM).This approach is well known in operations research literature by the name "alternate" because it consists of two alternating steps, where in each iteration the medoid is chosen to be the object with the smallest distance sum to other members of the cluster, then each point is assigned to the nearest medoid until TD no longer decreases.Choosing cluster medoids with a k-means like strategy takes O(n 2 ) time per iteration because we have to assume the clusters to be unbalanced, and contain up to O(n) objects; making this k times faster than regular PAM.(The runtime of O(nk) claimed by Park and Jun, 2009, clearly is incorrect.)This heuristic is, unfortunately, not very effective at improving the clustering: new medoids always have to cover the entire current cluster.This misses many improvements where cluster members can be reassigned to other clusters with little loss; such improvements are considered by SWAP.Furthermore, the arithmetic mean as used in k-means changes when we move any point to a different cluster, but the medoids will very often remain the same even when objects are added to or removed from the cluster.Because of its discrete nature, k-medoids is much more likely to get stuck in a local optimum using this strategy.Performing only few swaps reduces the number of iterations (and hence reduces the run time), but also produces significantly worse results, as previously observed for example by Teitz and Bart (1968), Rosing et al. (1979), andReynolds et al. (2006).The problem of the k-means style "alternating" heuristic is illustrated in Figure 1.Given the solution in the top row, assigning each object to the closest medoid yields the indicated boxes as partitions; choosing the medoid (or median, as this is a one-dimensional data set) in each box reproduces the same medoidsthe "alternating" heuristic is stuck in a local optimum.The SWAP heuristic can escape this situation easily: swapping A with medoid M 1 = B yields an improvement, because objects M 1 = B and C are reassigned to M 2 = E now.In a second swap, M 2 = E will then be swapped with D. The essential difference between the capabilities of the "alternating" and the SWAP heuristic is that in the "alternating" heuristic, the new medoid must cover all assigned points, whereas when swapping medoids, points will be reassigned during the swap.Since SWAP can reassign B and C to M 2 = E during the swap, it can choose a much better replacement medoid than in the k-means style approach.This example also can serve as a proof that the solutions found with this heuristic can be arbitrarily much worse than the solution found by PAM, if we move point A further to the left.(A similar situation can be found in Figure 1 of Newling and Fleuret, 2017a, which serves as their motivation to use CLARANS for initializing k-means, and a similar example is used by Hochbaum, 1982 to discuss known theoretical bounds on the quality: the longer the maximal edges in the graph are, the worse approximations to the result become; Kanungo et al., 2004 used a similar example to show that the worst case of the standard k-means algorithm is also not bounded, and propose a PAM-style swapping approach for k-means to guarantee an approximation quality).
If we try to improve the "alternating" heuristic by allowing to reassign points to other medoids, it essentially becomes a restricted variant of SWAP, where points may only be swapped with the medoid they are currently assigned to.This yields little benefit over the original SWAP, and even less over the accelerated version discussed in Section 3.1 where we can consider all medoids at once.

Alternative Initializations
While PAM's BUILD is considered a state of the art heuristic to initialize k-medoids (it is also known by the name "Greedy" in operations research), some alternatives have been used, such as randomly choosing initial medoids.Captivo (1991) integrates the "Alternating" approach into the greedy BUILD heuristic, updating the medoid of the existing clusters when the next center is added.This reportedly gives better starting conditions than BUILD, but also requires more effort.This approach is known as "GreedyG", and we will include it in our experiments.Arthur and Vassilvitskii (2007) noted that their k-means++ initialization heuristic can also be used with linear error, and hence it can be used for k-medoids.Schubert and Rousseeuw (2019) experimented with this heuristic, but noted on the benchmarks that it is very slow when used with PAM; we will investigate this below.Park and Jun (2009) propose an unusual O(n 2 ) initialization, that unfortunately tends to choose all initial medoids close to the center of the data set.They choose the k objects with the smallest normalized distance sums, but ignore the dependency of the selected medoids to each other; and hence this usually ends up choosing all medoids close to the 1-medoid.This is not a very beneficial way of initializing these algorithms, and indeed this tends to perform even worse than random sampling.This was also previously observed by Newling and Fleuret (2017b).

Variants for Large Data Sets
Since PAM needs O(n 2 ) memory for the distance matrix, it is not usable on big data.Therefore, people have proposed various approximations to PAM, such as CLARA and CLARANS discussed before.Yang and Lian (2014) parallelize the "k-means like" variant with map-reduce, parallelizing over the cluster in the reduce step.When cluster sizes vary substantially, this needs O(n 2 ) memory in the reducer, and may yield next to no speedup in the worst case.CLARA can be trivially parallelized by randomly partitioning the data, then running PAM on each partition (Kaufman et al., 1988).This approach will obviously benefit from our improvements the same way as CLARA and PAM benefit.A recent example is PAMAE (Song et al., 2017), which essentially is CLARA with an additional refinement step: it draws random samples and runs any k-medoids approach on each; chooses the best medoids found, and refines them with a single iteration of an approximate parallel version of the "k-means like" update; this will benefit from our improvements to CLARA.Papers have rarely considered using large k values, although this makes sense in the context of approximating a big data set, where you want to reduce the data set to k representative samples.Many of the attempts at distributing and parallelizing PAM employ PAM as a subroutine on a subset of the data, and hence can trivially integrate our improvements.

Finding the Best Swap
We focus on improving the original PAM algorithm here, which is a commonly used subroutine even in the faster variants such as CLARA.We also discuss how we can obtain similar improvements for CLARANS in Section 3.4.
PAM's algorithm SWAP evaluates every possible swap of each medoid m i with any non-medoid candidate x c .Recomputing the resulting TD using Equation 2 every time would require finding the nearest medoid for every point every time, which causes many redundant computations.Instead, PAM computes the change in TD for each object x o if we swap m i with x c : In the function ∆(x o , m i , x c ) we can often detect when a point remains assigned to its current medoid (if nearest(o) i, and this distance is also smaller than the distance to x c ), and then immediately return 0. We rewrite the original "if" case distinctions used in Kaufman and Rousseeuw (1990c) into the equation: where d n (o) is the distance to the nearest medoid of o, and d s (o) is the distance to the second nearest medoid.The labels (a), (b1), (b2), and (c) indicate the if cases considered by Kaufman and Rousseeuw (1990c).If nearest(o), d n (o), and d s (o) are known, we can compute ∆(x o , m i , x c ) using Equation 5in O(1), and hence ∆TD using Equation 4in O(n − k) by skipping the selected medoids.A naive approach would require O(k) for ∆(x o , m i , x c ) respectively O(nk) for computing ∆TD.
Reynolds et al. ( 2006) note that we can decompose ∆TD into: (i) the (positive) loss of removing medoid m i , and assigning all of its members to the next best alternative, which can be computed as , corresponding to case (b2) above, and (ii) the (negative) loss of adding the replacement medoid x c , and reassigning all objects closest to this new medoid, computed as max{d( is the distance to the nearest medoid except m i which we are replacing.Since (i) as well as d −i n (o) do not depend on the choice of x c , we can make the loop over all medoids m i outermost, reassign all points of the current medoid to the second nearest medoid, cache these distances to the now nearest neighbor as d −i n (o), and compute the resulting loss as: This combines cases (a) with ( b2) and ( b1) with (c), moving the case distinction within these pairs of situations outside of the innermost loop.These two cases still need to be distinguished in form of the max operation.The authors observed roughly a two-fold speedup using this approach, and so do we in our experiments.The FastPAM idea is based on a similar idea of exploiting redundancy in these computations, but we want to eliminate the nested loop over the medoids m i , hence removing the factor k in the run time complexity.This is possible because the four cases are not occurring equally often.No change is the most common situation (at least for large k), and nearest(o) = i only holds for roughly one of k cases.The common cases (a) and (c) in Equation 5do not depend on the exact choice of i, as long as it is not the nearest medoid.In fact, in a loop over all medoids i, all but one iteration perform the same computations.When d(x o , x c ) < d n (o) we get the same result independently of i from cases (b1) and (c).Hence we can rewrite this to: In Schubert and Rousseeuw 2019 we still handled the first case with an if-guarded nested loop over the medoids, and argued that the loop is executed only for a subset of cases.This does not allow for a worst-case guarantee, but based on the suggestion by Karl Bringmann in personal communication we found a way to eliminate the nested loop altogether: The first case can be efficiently handled by adding the change to a shared accumulator ∆TD +x c .For the next two cases, we use an array with one entry for each medoid m i , and if d(x o , x c ) ≥ d n (o), we collect the loss change for this one medoid m i there separately.The last case does not need to be handled, because it is 0.
The final loss can then be obtained by adding the shared accumulator ∆TD +x c to all array entries (but only needed for the best).Formally, this can be expressed as which can be computed in a single pass over the objects.But when benchmarking this method, we found that it was slower than that of Reynolds et al. for k ≤ 3, while for larger k it was much faster.Then we realized that we can integrate this idea, too.While it remains slower for k = 2, it further increases the performance of the algorithm; and k = 2 could become a special case in an optimized implementation.Our final FastPAM SWAP proposal uses where ∆TD −m i is the same as in our presentation of Reynolds et al., except that we compute and store this for all m i in one pass, and do not use d −m i n .∆TD +x c is the same as just introduced.The last term-the only part which depends on x o -has become a more complicated expression, but it is frequently zero (and that is when we save effort), and the first condition already needs to be evaluated for computing ∆TD +x c .In Appendix A we prove the relationship of Equation 11 to Equation 4.
In this article, we combine several optimization techniques: (A) removal of the nested loop over the medoids, the key contribution of the initial FastPAM presented in Schubert and Rousseeuw (2019).(B) introduction of the shared accumulator ∆TD +x c (necessary to completely remove the loop, in initial FastPAM it just became conditional) (C) precomputation of removal loss based on the idea of Reynolds et al. (2006) (D) eager execution of swaps, going beyond the additional swaps we performed in Schubert and Rousseeuw (2019), but similar to, e.g., CLARANS With only techniques (A)-(C) implemented, we can still guarantee to find the exact same results as the original PAM algorithm, we will denote this interim step as Fast-PAM1.The full variant implementing (A)-(D) will be called FastEagerPAM, or short: "FasterPAM".Schubert and Rousseeuw (2019) used only (A) and for FastPAM2 a simpler version of (D) that would perform up to k swaps per iteration, one for each medoid.

Making PAM SWAP Faster: FastPAM1
Algorithm 3 shows the improved SWAP algorithm.As we do not yet decide which medoid to remove, we use an array of ∆TD i s for each possible medoid to replace, initialized with the per-medoid removal loss ∆TD −m i .Additionally, we employ a shared accumulator ∆TD +x c to collect the loss change independent of the medoid removed.At the beginning of the iteration in line 6, we use these precomputed values as initial permedoid accumulator values.We can now iterate over all points, and check which of the three cases discussed above applies, and the cases can be easily distinguished by using the distance to the new candidate medoid d(x o , x c ), and the two cached distances d nearest (o), and d second (o).If the new medoid is closest, the change applies to ∆TD +x c (case (i), line 11), but we have adjustments for the case that the nearest is removed (to not double-count this).If the new medoid is second closest, we only have to handle the case where its current nearest is removed (line 14), and assign it to the new medoid then instead.After iterating over all points we choose the best medoid, add the shared loss accumulator ∆TD +x c , and remember the overall best swap.Note that if we always prefer the smaller index i on ties, FastPAM1 carries out exactly the same swap as the original PAM algorithm.
At the slight cost of precomputing the k removal loss ∆TD −m i s, temporarily storing the accumulator ∆TD +x c and one ∆TD i for each medoid m i (compared to the cost of the Algorithm 3: FastPAM1: Improved SWAP algorithm distance matrix and the distances to the nearest and second nearest medoids, the cost of this is negligible), we are able to remove the nested loop over all medoids, hence making the PAM algorithm O(k) faster.

FasterPAM: FastPAM1 with Eager Swapping
Choosing the single best swap is not crucial to the optimization problem, and for example "fast interchange" of Whitaker (1983) and CLARANS (Ng andHan, 1994, 2002) are two approaches that greedily perform the first swap that yields some improvement of the loss function.FastPAM2 (Schubert and Rousseeuw, 2019) was an interim solution: it first computed the best swap for each medoid with a small modification to FastPAM1; then executed up to k swaps per iteration (one for each medoid).In this section we will show that the simple eager approach is desirable to use.Estivill-Castro and Murray (1998) point out that such "local hill-climbing" methods nevertheless guarantee to find a local optimum, but may provide much better runtime.In facility location, both original PAM and FastPAM belong to the family of "local search" algorithms.A detailed analysis of result quality obtainable with such methods can be found, for example, in Arya et al. (2001Arya et al. ( , 2004)), who showed that the results obtained by singleswap interchanges have a locality gap of exactly 5 as k tends to infinity; by considering multiple swaps this can be improved to a 3 + ε bound at considerable computational effort.The k-means++ algorithm is only O(log k) competitive (Arthur and Vassilvitskii, 2007), meaning that it can potentially yield much worse results than a swap heuristic (c.f., Kanungo et al., 2004).
We can modify PAM as well as FastPAM to immediately perform any swap that yields an improvement.We then continue searching at the next object, until we have performed one entire pass over the data without finding an improvement.Clearly this strategy can find many swaps per iteration.Again the improvement of our modification saves a loop over the medoids, meaning that FastEagerPAM is O(k) faster than EagerPAM per iteration.Because we compute all k medoids in parallel, we can swap each candidate point with the best of the k medoids (if any yields a loss decrease).We hence do not strictly perform the first swap, but the best of each batch of k.This means this approach will find different results than both PAM and EagerPAM, but there is no reason to assume that picking the best of k choices instead of the first yields worse results (i.e., that the results would be worse than those of eager swapping with regular PAM).It is less obvious that this will yield as good results as a non-eager approach (PAM, respectively FastPAM1) that considers the best swap only; the concern that we may get stuck in a worse local optimum is larger when we do not try hard to choose the best swap.But if the solution space is fairly smooth -and we would assume that the true best solution is fairly well separated from inferior solutions if the clusters are substantial and not random (this is a bit different from facility location, where we are interested in finding optimal solutions even for data that does not cluster; we may be interested to service all customers in the U.S. from a given k facilities even though the "natural" clustering would place one facility in each larger city).
In Algorithm 4 we provide the pseudocode for this algorithm, based on FastPAM1 (as we have only one candidate x c at any time).

Faster Initialization
With these optimizations to the PAM SWAP algorithm, reducing the run time from O(k(n − k) 2 ) to O((n − k)n), the bottleneck of PAM becomes the BUILD phase.In the experiments with the 100 plant species leaves data sets in Section 4 with k = 100, PAM spends 95% of the run time in SWAP.With FastPAM1, this reduced to about 19%; and with FasterPAM only 3.7% are spent in SWAP.Matrix computation takes 0.24%, 3.6%, respectively 4.1% of the total runtime.The amount of time spent in BUILD however increases from 5.3% to 76% respectively 91%.Since the complexity of BUILD is in O(kn 2 ), this should not come at a surprise that it now has become the dominant bottleneck in the algorithm.But because we made SWAP much faster, we can afford to begin with slightly worse starting conditions, although this means we need more iterations of SWAP afterwards (in the experiments of Section 4, we will see that we need to do many more swaps with worse starting conditions, and that BUILD was indeed beneficial to use with the original PAM).
An elegant way of initializing k-means is k-means++ (Arthur and Vassilvitskii, 2007).The beautiful idea of this approach is to choose seeds with the probability proportional to their squared distance to the nearest seed (the first seed is picked uniformly).While this approach is commonly known as k-means++, an earlier version Algorithm 4: FasterPAM: FastPAM1 with eager swapping and analysis of this idea can already be found in Meyerson (2001) in the context of the online facility location problem, while Ostrovsky et al. (2006) published a variant that also uses importance weighting for the first seed.If we assume there exists a cluster of several points and no seed nearby, the aggregated probability mass of this cluster is substantial, and we are likely to place a seed there; afterwards the probability mass of this cluster reduces and we are unlikely to place a second seed there.Outliers on the other hand will have a high individual weight, but as they are rare their total mass remains low enough to usually not be chosen.This initialization is (in expectation) O(log k) competitive to the optimal solution, so it will theoretically generate better starting conditions than uniform random sampling.But as seen in our experiments, this guarantee is pretty loose; and BUILD empirically produces much better starting conditions than k-means++ (although theoretical bounds known due to Cornuejols, Fisher and Nemhauser, 1977;Hochbaum, 1982 are not encouraging).But it is easy to see that in BUILD each medoid is chosen as a current optimum with respect to TD; whereas k-means++ picks the first point randomly, and subsequent points are (in expectation) random points from different clusters, but k-means++ makes no effort to find good centers of the clusters (which is not that important for k-means, where the mean is in between of the data points).Therefore, with k-means++-style initialization we need around k additional swaps to pick the medoid of each cluster (and hence, k SWAP iterations of original PAM and FastPAM1).Because a single iteration of SWAP used to take as much time as BUILD, the k-means++ initialization only begins to shine if we use FastPAM1 to reduce the cost of iterating together with the eager swapping of FasterPAM doing as many swaps as possible in each iteration.A second important benefit of k-means++ is that the algorithm is randomized; and we can run it multiple times and keep the best result.This helps if there is some local optimium that we might get caught in, and we can use multiple runs to increase our likelihood of finding the true optimum.Lijffijt et al. (2015) previously used k-means++ for PAM and CLARA; but they mistake the "alternating" algorithm for PAM, and their experiments only used small k.Our experiments (in Section 4.4) show that k-means++ initialization takes many more iterations to converge than with the original BUILD initialization; so without the improvements introduced in this article, it is usually not beneficial to use k-means++ with the original SWAP algorithm for speed (the benefit of randomness, the ability to get different results, remains; and so do the theoretical guarantees).Schubert and Rousseeuw (2019) proposed a strategy called LAB (Linear Approximative BUILD), a linear approximation of the original PAM BUILD (c.f., Algorithm 1).In order to achieve runtime linear in n, we simply subsample the data set.Before choosing each medoid, we sample 10+ √ n points from all non-medoid points.From this subsample we choose the one with the largest decrease ∆TD with respect to the current subsample only, similar to BUILD.Other similar sampling-based heuristics have been used, for example, by Resende and Werneck (2004), whose strategy yields a complexity of O(kn log 2 n k ), and this approach performed best in their experiments.While LAB reduced the number of swaps necessary for convergence (and hence the number of iterations for PAM and FastPAM1) substantially, this benefit was offset when we introduced eager swapping in FasterPAM.With FasterPAM, we recommend using either uniform random sampling or k-means++-style initialization.

Integration: FastCLARA and FastCLARANS
Since CLARA (Kaufman andRousseeuw, 1986, 1990a) uses PAM as a subroutine, we can trivially use our improved FastPAM with CLARA.In the experiments (the implementations are provided as open-source) we will denote this variant as FastCLARA.
Because of the sampling strategy used in CLARA, the runtime is O(k3 • i) because it performs PAM clustering with a sample size of s = 40 + 2k (the number of iterations i does, however, contain some hidden dependency on k).By replacing PAM with Fast-PAM, we immediately obtain a runtime of O(k 2 • i) for CLARA this way.With modern hardware we do, however, suggest to also use at least twice as many samples as in the original recommendation, i.e., s = 80 + 4k.By choosing a sample size in O( √ n), we can also obtain a O(n • i) time approximation to PAM, for example we may choose to use s = √ n + 4k.While we must assume that the worst case for i is, unfortunately, similar to k-means and hence in the order of i ∈ O(2 √ s ) for sample size s (c.f., Arthur and Vassilvitskii, 2006), this will give decent results in seemingly linear time for many practical purposes.
CLARANS (Ng andHan, 1994, 2002) uses a randomized search instead of considering all possible swaps.For this, it chooses a random pair of a non-medoid object and a medoid, computes whether this improves the current loss, and then eagerly performs this swap.Adapting the idea from FastPAM1 to the random exploration approach of CLARANS, we pick only the non-medoid object at random, but can consider all medoids at a similar run time to looking at a single medoid.This means we can either explore k times as many edges of the graph, or we can reduce the number of samples to draw by a factor of k.In our experiments we opted for the second choice (sampling 2.5% • (n − k) non-medoids, rather than 1.25% • k • (n − k) edges), to make the results scale similar to the original CLARANS.By varying the subsampling rate, the user can control the tradeoff between computation time and exploration.

Experiments
By eliminating the nested loop over the medoids, we must expect an O(k) speedup of FastPAM1 over the original PAM algorithm (in contrast to much work published in recent years, the speedup is not just empirical).Nevertheless constant factors and implementation details can make a big difference (Kriegel et al., 2017), and we want to ensure that we do not pay big overheads for theoretical gains that would only manifest for infinite data. 3Because of constant factors, it could for example be possible that we need a certain minimum k for this approach to be beneficial over the original PAM.For the additional benefit of eager swapping in FastEagerPAM we do not have theoretical guarantees for an additional speedup over FastPAM1; the resulting speedup is expected to be a much smaller factor due to the reduction in iterations, at the price of performing more swaps.In contrast to FastPAM1, these do not guarantee the identicial results as original PAM; therefore we also want to verify that they are of the expected equivalent quality.But because we observed the initialization time to become a bottleneck now, and we also propose to use different initialization techniques, we will first of all evaluate picking the initial medoids; then proceed with the experiments regarding the scalability in k and n as well as result quality.
As discussed before, all these algorithms belong to the family of local search optimization algorithms.As long as there is a reachable state that yields an improvement, this change is applied.The approaches considered in our experiments differ by (i) which local changes are considered, and (ii) whether only the best change is applied, or the first.Because of the local search, these algorithms can get stuck in local minima; where one would expect that (i) has much more effect on the result quality, whereas (ii) primarily affects performance.

Data Sets
In order to test the quality of the different initialization methods and approaches, we use a classic library of 40 k-median problems from operations research.These can be found in the OR-Library4 (Beasley, 1990), which is known for its traveling salesman problem sets.By todays standards these problems are fairly small (up to 900 instances), but for these problems the true optimum solution has been determined.Hence, we can compute the gap between the solution found by the algorithm and the true optimum solution on some well-known example problems.In Section 3.2 we discussed that a local minimum can be up to 5 times larger than the optimum, but we will show here that the difference usually is much smaller, making these approximations sufficient for many applications.In addition to the ORlib problems, we also consider three variants of problems from this data set first used by Senne and Lorena (2000): sl700, sl800, and sl900 are problems pmed34, pmed37, and pmed40 respectively, but with a larger k > 200 (making our improvements even more effective for these problems).The data sets gr100 and gr150 are graph data sets that originate from Galvão and ReVelle (1996), and have been used by Senne and Lorena (2000) for benchmarking k-medoids.There are true optimum results given for k = 5, 10, 15, 20, 25, 30, 40, 50 by Resende and Werneck (2004),5 which we use for evaluation.
To test scalability we need larger data sets than these classic problems.We will be using three data sets from the well-known UCI repsitory (Dua and Graff, 2019): First we will showcase results using the texture features from the "one-hundred plant species leaves" data set, which we chose because it has 100 classes, and 1600 instances, a fairly small size that regular PAM can still easily handle.Naively, one would expect that k = 100 is a good choice on this data set, but some leaf species are likely not distinguishable by unsupervised learning.The same experiment is repated on the "Optical Recognition of Handwritten Digits" data set with n = 5620 instances, d = 64 variables, and 10 natural classes and the well-known MNIST data set, which has 784 variables (each corresponding to a pixel in a 28×28 grid) and 60000 instances (PAM will not be able to handle this size in reasonable time anymore).
We used the ELKI open-source data mining toolkit (Schubert and Zimek, 2019) in Java to develop our version.In prior work (Schubert and Rousseeuw, 2019), we also reproduced the results with the R cluster package, which is based on the original PAM source code and written in C. We omit results of the R version for redundancy in this article.The experiments were processed using server Intel Xeon E5-2697v2 CPUs at 2.70GHz in a small cluster.There is an additional (but not significant) speedup possible by closer integration of the initialization with the algorithm in some cases, where the closest and second medoid are already computed in the initialization, and hence line 1 in SWAP (Algorithm 2, similar in the other algorithms) could be avoided for some initialization methods (but not for uniform random initialization, obviously).For the special case of k = 2, several additional optimizations are possible, as the second nearest medoid must always be the only other medoid; we also did not want to over-optimize for this case in our experiments, although it may arise in many practical applications.

Initialization
In Table 1 we compare the quality of different initializations, and how they affect the outcome of different algorithms.As data sets we use the ORlib problems and the SL variants thereof, along with the Galvão and ReVelle graph data sets because the optimum result is known for these problems.For our quality measure we use a normalization inspired by Captivo (1991), because the simpler approach TD/TD optimum can be trivially "improved" by adding a constant to all non-zero distances.A score of 0% is obtained by the optimum solution, while picking medoids uniformly at random yields an error close to 100%.We estimate TD random using 100 random sampled medoids.For methods that involve randomness, we report the average over 10 random restarts and permutations of the data set, and we additionally average over the 59 data sets.We always report the standard deviation over the different data sets; for methods with randomness we also report the average standard deviation over the restarts.Additionally, we report in how many of the 59 problems the optimum solution could be found with 10 restarts.Because some methods can swap multiple medoids in each iteration, we list both the number of iterations and the number of swaps performed.
If we first look at the initialization in isolation, GreedyG is the clear winner, closely followed by BUILD.This is to be expected, because both try to optimize the objective function TD directly, and GreedyG includes an additional refinement step.BUILD and GreedyG were able to find the optimum solution for 4 resp.6 of the problems.These two are, however, also the most costly.While they are deterministic methods, the small variance observed is due to ties when we shuffle the input data.LAB offers a good performance, at a substantial quality improvement over the others.k-means++ is already rather close to random in quality (this is expected, because it attempts to distribute the centers, not to identify the most central object of each partition).Random, by definition, must score close to 100% on this measure.The initialization of Park and Jun (2009) performs even worse than random, because all medoids are placed very close to each other in the center of the data set.It also shows a high standard deviation across data sets, and would often perform twice as bad as random medoids.Scores are normalized to the known optimum solution (0%) and the average of 100 random medoids (100%)."Mean" and "min" are the average and best result over all data sets."Optimal" denotes how many problems were solved optimally.σ rand is the average standard deviation over 10 random restarts each; σ data is the standard deviation of the mean over all data sets.Time and number iterations are averaged over all restarts and data sets.
If we now look at the result after running PAM, we make the unexpected observation that a bad initialization does not have much effect on the result quality of PAM; but it primarily affects run time.By swapping, bad initial medoids can easily be replaced with better alternatives.GreedyG still yields the best results on average, and despite being the slowest initialization, it yields the fastest runtime for PAM.This is due to reducing the number of swaps necessary for convergence significantly.The initialization of Park scores among the best in average quality (despite the bad starting points), but offers the worst runtime because of the high number of iterations.Surprisingly slow with PAM is k-means++ (also replicated on a second system), supposedly because it is more likely to pick far points; and indeed the runtime behavior closely matches that of a farthest-points initialization (not included in the table).So while even bad initialization still allows us to find good results, it has a major impact on runtime.Methods such as BUILD and GreedyG reduce the runtime of PAM significantly, whereas heuristics that picking too-far or too-central points can be worse than random.For PAM, the number of iterations is exactly the number of swaps performed plus one, which is the final iteration where no improving swap can be found anymore.Considering PAM with BUILD and GreedyG, both of which are deterministic methods, we nevertheless observe a standard deviation of 0.2% depending on the input order and duplicate distances.These 0.2% establish a baseline that we must consider to be random deviations when interpreting the entire table.
In k-means clustering experience has shown that multiple restarts can be beneficial; and the random-based initializations (uniform random, k-means++, and LAB because of the sampling) can find better or worse results.It is worth noting that with just 10 random restarts and keeping the result with lowest TD (which is our unsupervised optimization criterion) we can get significantly better results with these three approaches than the averages reported in the first column.As shown in the column titled "optimal", beginning with random centers and 10 restarts was able to find the optimum solution on 33 of the data sets, while the deterministic GreedyG initialization could only solve 25.Unfortunately, the runtime with PAM and random initialization is several times higher, and we may hence not be able to afford many restarts with PAM.
But there is no good reason to use the original PAM anymore -FastPAM1 will find the exact same results faster, and this experiment illustrates both that the results are the same, and that we see substantial speedups.The runtime includes initialization, there is some overhead included, and it is aggregated over data sets of different size, difficulty, and different k so we cannot expect the results to be exactly k times faster.But GreedyG now becomes the slowest method -the runtime for this combination is now dominated by the initialization cost.Because FastPAM1 spends 89% of the time in initialization with GreedyG; the overall speedup is only about 2.8×.With PAM BUILD, the speedup is about 6×, and with random initialization, where almost the entire time is spent in optimization, FastPAM1 was 67× faster.The average k of the problems in this experiment is about 51 (when weighted with n 2 , the average k increases to 83, as larger problems tend to have larger k).
As explained before, we expect sub-optimal swaps to not negatively affect quality; and while we can save the time searching for a better swap, we have to pay the price of performing maybe unnecessary swaps and updating our data structures.For this, we investigate eagerly performing swaps; either based on the original PAM al-gorithm (iterating over existing medoids, then non-medoids) denoted as EagerPAM, or based on the FastPAM variant (iterating over non-medoids; combined computation of all medoids).These two no longer find the exact same result, because EagerPAM iterates over medoids in the outer loop, while FasterPAM only iterates over candidates and then considers the best medoid for swapping only.Interestingly, while FasterPAM is able to choose better swaps (the best of k; and hence it needs fewer swaps and iterations), EagerPAM appears to find slightly better results than the others.But these differences are within the 0.2% margin of error that we attribute to different processing order, and on other data sets the FasterPAM approach also sometimes produces better results.On the ORLib data, FasterPAM ran about twice as fast as FastPAM1.
It is remarkable that the random initializations like LAB, k-means++, and even uniform random become attractive now, because by performing multiple swaps per iteration, bad starting conditions do not affect performance that much anymore.Much of the drawbacks of k-means++ initialization observed in our prior work (Schubert and Rousseeuw, 2019) have now disappeared with FasterPAM, where the average number of iterations is now similar to our proposed LAB initialization.The number of swaps performed to convergence still differs substantially between good and bad starting conditions, but this has much less impact on the number of iterations or the overall runtime now.In this experiment, we cannot identify a clear winner between LAB initialization (the slowest initialization, requiring the fewest iterations), k-means++, and uniform random initialization; this is laregly because the performance of the swapping procedure has become so good that the differences disappear in the measurement uncertainty.Because of this, and because the performance seems to be largely independent of the starting conditions, it seems adequate to prefer the simplest initialization: uniform random sampling; and rather use multiple restarts than computing better starting positions.But we will study starting conditions again for example in Table 2.
Last we want to look at the "Alternating" algorithm.This is also a fast approach, but with a hefty quality decrease.A good initialization with GreedyG and BUILD becomes very important.With random initialization, the performance would usually be only about half-way between the optimum and a completely random result; and with the problematic initialization of Park and Jun (2009) with 69.7% much closer to random than to the optimum.Even random initializations with 10 restarts do not help much here, with the average best result still being 39.7% worse than the optimum (compared to values of about 0.4% with 10 restarts of swap-based approaches).Hence we advise against using the k-means style "Alternating" approach for k-medoids because of result quality (contrary to the claims of Park and Jun 2009, who only used very simple data sets; but in line with earlier observations by, e.g., Teitz and Bart 1968, Rosing et al. 1979, and Reynolds et al. 2006).

Run Time Speedup with Increasing k
Next we want to explore the speedup as we increase k on a data set.While the theoretical speedup of both FastPAM and FasterPAM is O(k), this is only beneficial if it is measureable for realistic values of k, not just asymptotically.We use the "onehundred plant species leaves" data set here, because given one hundred species in this data set, k = 100 should be a reasonable value.In Figure 2, we vary k from 2 to 200, and plot the run time of the PAM SWAP phase only (the cost of computing the distance matrix and the BUILD phase is not included).We compare the original PAM, FastPAM1, EagerPAM, and FasterPAM variants first.
Figure 2a shows the run time in linear space, to visualize the drastic run time differences observed.Because the faster methods cannot be distinguished here, we use log-log-space in Figure 2b.Compared to the other methods, FasterPAM runtime only increases very little with k, making this method particular attractive for large k: for k = 200, the FasterPAM swap run time was 169 milliseconds, while FastPAM1 took 2.4 seconds (both using random initialization) and PAM took 151 seconds (using, but not including, the slower BUILD initialization).
In Figure 2c we plot the speedup over PAM.The FastPAM1 improvement gives an empirical speedup factor of about 0.75•k on this particular data set, while the additional improvements contributed an additional speedup of about 2-7× by reducing the number of iterations.In the most extreme case tested, a speedup of about 1190× at k = 200 for the swap proceduce (using BUILD initialization, 898× with with faster but worse random initialization instead) is measured -but because the speedup is expected to depend on O(k), the exact values are meaningless, furthermore, we excluded the distance matrix computation and initialization in this experiment.
In Figure 2d, we experimentally test the scalability in k on this data set.We normalize the runtime by the parameter k, such that a runtime that is linear in k should approximately yield a horizontal line.While the runtime of PAM SWAP per iteration is O(k(n−k) 2 ), and hence one would assume a sublinear complexity, the number of iterations increases with k, because PAM SWAP only performs a single swap per iteration.EagerPAM still exhibits a runtime that increases faster than linear in k (because the runtime of SWAP is still linear in k, and the number of iterations increases slowly with k) while FastPAM1 and FasterPAM empirically have sub-linear runtime in k because they eliminate the factor of k within the swap iterations.We will study the dependency of the number of iterations to the parameter k in Section 4.4.
If we consider the complete run time and not just the optimization, the result does not change fundamentally: in Figure 3 we include the distance matrix computation and initialization time.We only present the log-log space plots, because of the extreme differences.Because for FasterPAM a substantial amount of time is spent computing the distance matrix, the overall run time appears to be "almost constant".In the speedup factor, FastPAM1 and FasterPAM can now benefit from using random initialization rather than BUILD (for readability of the figures, we do not include combinations such as FasterPAM with BUILD where runtime is dominated by BUILD).
In Table 2 we study different algorithms in combination with different initialization methods.Similar to the experiments on the ORLib problems, GreedyG initialization by itself already yields good results, but is fairly expensive.In combination with PAM, GreedyG outperforms BUILD initialization.With each algorithm, the deterministic initialization with GreedyG yields the best result on average compared to random initialization.Also similar to the ORLib results, the Alternating algorithm does not work very well, and does not improve much over good starting conditions.The approach of Park and Jun (2009) works worst by itself, but the swapping algorithms can still reach good results (usually at a slow runtime, though).The proposed combination with the Alternating algorithm yields results barely better than random initialization.Eager-  PAM, FastPAM1, and FasterPAM yield substantial speedups over the original PAM.FasterPAM is fastest, as it combines both the benefits of EagerPAM and FastPAM1.FastPAM1 yields the exact same results as PAM, but the eager versions yield slightly worse results in this scenario.However, this quality difference is not significant; on the ORLib data sets, the eager variants were slightly better, and we will evaluate this in more detail in Section 4.5.It is worth noting, however, that the results with random initialization can be improved by repeating the procedure multiple times.The overall best solution in this experiment is found with PAM and FastPAM1 using random initialiaztion: the normalized loss is about 0.0639%, while GreedyG initialization only found a solution of 0.0973%.An even slightly better solution was found with the 1000 restarts we used for normalization.Running a fast approach (such as FasterPAM with random initialization, at 337 ms per restart; plus the time to compute the distance matrix once of 115 ms) multiple times will usually find a better solution faster than the deterministic solution with the best average quality at 4232 ms (FastPAM1 with GreedyG initialization) on this data set.While k-means++ initialization has a slight runtime advantage in this experiment, this difference is not substantial; one may as well use uniform random initialization instead.While LAB initialization produced better starting conditions, this would neither result in a measureable runtime or quality advantage.Because of this, we do not present results on LAB initialization in much detail -it does not exhibit a substantial advantage over the trivial random initialization.

Number of Iterations
We are not aware of theoretical results on the number of iterations needed for PAM.The worst case may be superpolynomial like k-means (Arthur and Vassilvitskii, 2006), albeit in practice a "few" iterations are usually enough.Because of this, we are also interested in studying the number of iterations depending on the choice of k and the initialization method.
For the smaller ORLib data sets, we already presented results in Table 1 in Section 4.2. Figure 4 shows the number of iterations needed and the number of swaps performed with different methods.In line with previous empirical results (e.g., Whitaker 1983), only "few" iterations are necessary.Because PAM only performs the best swap in each iteration, a linear dependency on k is to be assumed; interestingly enough we usually observed fewer than k iterations with BUILD initialization, so many medoids remain unchanged from their initial values (note that this may be due to the small data set size).GreedyG initialization manages to reduce the number of iterations and swaps almost by half for large k compared to BUILD.With random initialization, k-means++ and also LAB, the number of iterations becomes approximately k, indicating that for almost every cluster, a better medoid has to be chosen.Because the k-means++ initial-  ization requires roughly 2-4× as many iterations for PAM; with the original algorithm where each iteration would cost about as much as the BUILD initialization, this choice (although suggested by Lijffijt et al., 2015) is detrimental even for small k (c.f.Table 2, where PAM with k-means++ initialization took approximately 2.45× the run time of PAM with BUILD).With the improvements of this paper, these additional iterations are cheaper than the rather slow BUILD initialization by a factor of O(k) now, hence we can now begin with a worse but cheaper starting point.Furthermore, because Ea-gerPAM and FasterPAM perform multiple swaps per pass over the data set, these two drastically reduce the number of iterations.On this data set, the maximum number of iterations observed with EagerPAM was 13 (the worst average was 9.9), and with FasterPAM it was 7 for a single run resp.6 on average (because for large k, FasterPAM performs the best of up to k swaps).
We do not include the Alternating approach in these figures, because while it usually uses the fewest iterations, it also produces substantially worse results, as seen in Table 2 and on the ORLib data.Instead we will compare it to subsample-based algorithms such as CLARA next.

Quality
Any algorithmic change and optimization comes at the risk of breaking some things, or negatively affecting numerics (see, e.g., Schubert and Gertz 2018 on how common numerical issues are, even with basic statistics such as variance in SQL databases).In order to check for such issues, we made sure that our implementations pass the same unit tests as the other algorithms in both ELKI and R. We do not expect numerical problems, and PAM and FastPAM1 are supposed to give the exact same result (and do so in the experiments, so we exclude FastPAM1 from the following plots).EagerPAM and FasterPAM perform the first swap they find to improve the results, and may therefore converge to a different solution.Beginning with uniform random initialization may also yield different results.We expect that all of these are of the same quality (because all are local optima, and neither performs a global optimization), which we will verify experimentally.We use the normalized loss (Equation 12), and to improve readability of the plots we study the PAM variants separately from approximate methods such as CLARA and CLARANS.
In Figure 5a we can see that PAM and its variants (including PAM with random initialization) find results of similar quality.Deterministic initialization with BUILD or GreedyG is usually better than the worst solution found with random initialization.For some k such as 4 and 20, BUILD worked very well, and for k such as 8 and 90 GreedyG managed to find very good starting conditions.But for other values of k such as 10, 30, and 40 the random and eager variants were better; GreedyG was the worst choice for k = 40, demonstrating that it is good to try different starting conditions.Judging from this single experiment, it may be worth considering BUILD and GreedyG for small k < 10.The worst of these results is still substantially better than the average performance of any of the subsample-based methods, as seen in Figure 5b (note the different scale on the y axis).While the Alternating algorithm considers the entire data set, it barely manages to outperform sampling based CLARA and CLARANS in quality.FastCLARA and FasterCLARA yield better results than CLARA because we doubled the sampling rate; otherwise they would be similar (we omit the line from the plot for readability).The decreasing loss of CLARA variants to the right of the plot is because the sampling size, 40 + 2k resp.80 + 4k, approaches 25% respectively 50% of the data set size, at which point these methods become as expensive as using the entire data set.FastCLARANS performs best in this experiment: compared to CLARANS it evaluates k times as many possible swaps at a similar run time cost, but as we will see later it is usually better to rather use FasterPAM instead.As we can see in Figure 5c, at around k = 10, FasterPAM becomes faster than FastCLARANS, and at around k = 100 it outperforms also the CLARA variants.The main benefit of CLARA is the reduced memory requirement for very large n, because they do not use a full distance matrix, but instead only use one for each sample.On a small data set such as this, they are not competitive.

Optical Digits Dataset
We also repeated our experiments on the slightly larger "Optical Recognition of Handwritten Digits" data set from UCI (Dua and Graff, 2019) with n = 5620 instances, d = 64 variables, and 10 natural classes.Example results for this data are shown in Table 3 and Figure 6.This time, we observe better quality for the "eager" methods (finding the best solution independently of the initialization), supporting the theory that both approaches are of the same quality, and only differ in the local optima found due to processing order.FasterPAM offers the best run time (over 10 times faster at k = 10, over 200 times faster total at k = 100 as seen in Figure 6 even when including the time needed to compute the distance matrix), and there is little benefit of using k-means++ or LAB over random initialization.The Alternating approach again produces significantly worse results than the swap-based heuristics, and does not outperform the GreedyG initialization.Clearly, the benefits on this data set are similar.

Scalability Experiments
Just as PAM, our method also uses a precomputed distance matrix; the high number of distance computations necessary makes any different use prohibitive.This will require O(n 2 ) time and memory, making the method as-is unsuitable for big data.Our contributions in this paper focus on reducing the dependency on k, while we claim quadratic runtime in O(n 2 ) per iteration, and must assume the number of iterations to potentially grow with n.For larger data sets, the use of sampling-based methods such as FasterCLARA and FastCLARANS is possible, and many scalable and distributed variations of PAM from literature can be trivially adapted to use FasterPAM instead of In this experiment, we use the well-known MNIST data set from the UCI repository (Dua and Graff, 2019), which has 784 variables (each corresponding to a pixel in a 28 × 28 grid) and 60000 instances.We used the first n = 5000, 10000, . . ., 35000 instances and compare k = 10 and k = 100.The high number of variables makes this data set expensive when that distances are recomputed on the fly rather than using a distance matrix, as done with CLARANS, as we will see next.
The quadratic runtime growth is easily seen in the linear scale plots Figure 7a and 7b.As a reference, we give the time needed for computing the distance matrix as dotted line, which is also quadratic.To make the plots more interpretable, we normalize the runtime by the expected scaling factor of n 2 in Figure 8 and plot this in log-log space.We do not include CLARA in this plot, because it subsamples the data set to a size independent of n.In these plots, we can clearly see that FasterPAM is only slightly more expensive than computing the distance matrix, and improves substantially over FastPAM1 and PAM.CLARANS is slow in this experiment, because it does not use a distance matrix and has to (re-)compute many distances in 784 dimensions.On lowdimensional data, it would perform better.The authors assumed that distances are cheap to compute, and noted that it may be necessary to cache the distances in one way or another.For k = 10, FastCLARANS is able to outperform the distance ma- trix computation (and hence all PAM variants), and hence it clearly does not need to compute all pairwise distances.For k = 100, it is outperformed by the new FasterPAM approach.Hence, for expensive distance functions (on high-dimensional data, but also, e.g., Dynamic Time Warping for time series data) and large k, we recommend using FasterPAM as long as it is possible to keep a distance matrix in memory.While the scalability in n is quadratic, as expected, the most notable result is the following: we observe that if you can afford to compute and store the pairwise distance matrix, then you will now also be able to run FasterPAM.For k = 100 and n = 35000, the average run time of FasterPAM was 743 seconds, of which about 655 seconds or 88% were used for computing the distance matrix.The original PAM algorithm, on the other hand, took 21 hours, over 100 times longer; excluding the matrix computation time, the speedup factor is even 879×.The main scalability problem is the memory consumption and computation of the distance matrix, not the clustering anymore.
If computing the distance matrix is prohibitive, it may still be possible to use FasterCLARA (CLARA with our improved FasterPAM on the individual samples), which will scale linearly in n (in the final assignment step).But as seen in Figure 9a and Figure 9b, CLARA will usually give worse results.Even when doubling the sampling size as suggested before, the results of CLARA (as well as the Alternating algorithm) are about 30% worse than the best results found in this benchmark with k = 10.Wwith k = 100 the quality of CLARA is barely better than using random medoids (because of the low sampling rate).The quality obtainable with CLARANS and FastCLARANS is much better.FastCLARANS outperforms CLARANS with respect to quality because it considers a k times larger search space at a similar run time cost.FastPAM1 and FasterPAM both yield results of the same quality as the original PAM.But on the other hand, FastCLARANS is only advisable for inexpensive distance functions such as (low-dimensional) Euclidean distance, and requires a non-trivial distance cache otherwise.On complex data, we have seen that FasterPAM can be faster and give much better results, hence FasterPAM should be the preferred method for most users.

Outlook
In this article we considered only the classic k-medoids clustering scenario, but we have noted the close relationship to facility location problems.Substantial research effort has been put into these problems (c.f.Reese 2006), but we nevertheless hope that this research is also valuable to operations research.The implementation used for the experiments is available as open-source, and has already been prepared for the bichromatic case where we have separate locations for consumers and possible suppliers, and have to find the optimal facility placement.We use a simple bichromatic adaptation of the medoid definition (Equation 3) for possible locations L and consumers C: medoid(L, C) := arg min x l ∈L x c ∈C d(x c , x l ) .
Other constraints popular in facility location, such as capacity constraints and facility opening costs, are less obvious to integrate.On the other hand, k-medoids clustering can likely benefit from some of this research for clustering data without having to choose the parameter k beforehand.
In the future, we also plan on working towards optimization for sparse instances, where not all consumers can be serviced from all possible supplier locations.Such problems arise naturally when planning or simulating for example power networks (Kays et al., 2017), where cluster centers correspond to power substations, consumers to households, and possible connections are restricted to follow the road network.Because of physical limitations such as voltage drop over distance, not every consumer can be economically serviced from every facility location, and the procedure can be further optimized using sparse data structures.Given a sparse graph with e edges, n consumers, m possible facility locations, we assume that the algorithm presented here can be implemented in O(e + n + m) time for each iteration, which is beneficial if e n • m.However, this poses the additional challenge of finding an admissible starting solution (considering missing edges as infinite loss), as the problem may be unsolvable for a small k.Hence, a solver for sparse problems will likely need to also dynamically adapt k.

Conclusions
In this article we proposed a modification of the popular PAM algorithm that yields a provable O(k) fold speedup, by clever caching of partial results in order to avoid recomputation.By eagerly executing the first improvement found, we also reduced the number of iterations substantially.We provide theoretical arguments and present experimental evidence that this does not cause a loss in quality.The major speedups obtained with this approach enable the use of this classic clustering method on much larger problems, in particular with large k.
Compared to our earlier work (Schubert and Rousseeuw, 2019), the speedup in now provable, and the eager execution approach proposed here is both simpler and more effective than the earlier FastPAM2 method described there.Also the initialization is much simpler now.We also given an example why k-means-like strategy yields much worse results, and should not be used.
This caching was discovered by changing the nesting order of the loops in the algorithm, showing once more how much seemingly minor looking implementation details can matter (Kriegel et al., 2017) and can lead to major improvements.It is hard to devise such things on the drawing board -such solutions more naturally arise when trying to low-level optimize the code, such as when and when not to allocate memory for buffers, and trying to avoid recomputing the same values repeatedly.Today's compilers are reasonably good at performing local optimization (at least when it does not affect numerical precision, Schubert and Gertz, 2018), but will not introduce an additional array to cache such values, nor split the sums automatically.With the faster refinement procedure of FasterPAM, it becomes possible to use cheaper initialization methods.In contrast to our earlier work, the new experiments suggest that combining FasterPAM with a simple uniform random initialization is often fastest and attains high quality.We could not observe systematic improvements with either LAB or k-means++, but the latter yield similar performance and still remain useful.
Methods based on PAM, such as CLARA, CLARANS, and the many parallel and distributed variants of these algorithms for big data, all benefit from this improvement, as they either use PAM as a subroutine (CLARA), or employ a similar swapping method (CLARANS) that can be modified accordingly as seen in Section 3.4.
The proposed methods included in the open-source framework ELKI (Schubert and Zimek, 2019) will be updated to this version, and so will the version included in the R cluster package via the "pamonce" option, to make it easy for others to benefit from these improvements.With the availability in two major clustering tools, we hope that many users will find using PAM, CLARA, and CLARANS, and later derived methods, possible on much larger data sets with higher k than before.
Figure 2: Run time of PAM SWAP (SWAP only, without DAISY, without BUILD)

Figure 3 :
Figure 3: Run time comparison of different variations and derived algorithms.
Number of swaps performed until convergence

Figure 4 :
Figure 4: Number of iterations and swaps Run time of approximative methods

Figure 6 :
Figure 6: Speedup on Optical Digits data

Figure 7 :
Figure 7: Run time on MNIST data

Figure 8 :
Figure 8: Results normalized by n 2 on MNIST data with k = 10 (top) and k = 100 (bottom) Quality compared to the best solution found 9: Results on MNIST data with k = 10 (top) and k = 100 (bottom)

Table 1 :
Comparison of different initialization procedures on the extended ORlib data sets.

Table 2 :
Runtime and relative loss on 100plants with k = 100 (number of classes in this data set).Relative time is with respect to the classic PAM algorithm with BUILD initialization, while loss is normalized such that 0% corresponds to the best solution found with 100 restarts and 100% to the average loss of random initialization.The minimum is the best solution of 10 restarts.

Table 3 :
Runtime for optdigits data with k = 10