Computers and Operations A matheuristic for large-scale capacitated clustering

Clustering addresses the problem of assigning similar objects to groups. Since the size of the clusters is often constrained in practical clustering applications, various capacitated clustering problems have received increasing attention. We consider here the capacitated 𝑝 -median problem (CPMP) in which 𝑝 objects are selected as cluster centers (medians) such that the total distance from these medians to their assigned objects is minimized. Each object is associated with a weight, and the total weight in each cluster must not exceed a given capacity. Numerous exact and heuristic solution approaches have been proposed for the CPMP. The state-of-the-art approach performs well for instances with up to 5 , 000 objects but becomes computationally expensive for instances with a much larger number of objects. We propose a matheuristic with new problem decomposition strategies that can deal with instances comprising up to 500 , 000 objects. In a computational experiment, the proposed matheuristic consistently outperformed the state-of-the-art approach on medium- and large-scale instances while having similar performance for small-scale instances. As an extension, we show that our matheuristic can be applied to related capacitated clustering problems, such as the capacitated centered clustering problem (CCCP). For several test instances of the CCCP, our matheuristic found new best-known solutions.


Introduction
Clustering is the task of assigning similar objects to groups (clusters), where the similarity between a pair of objects is determined by a distance measure based on features of the objects. Since clustering is used in many different domains for a broad range of applications, numerous different clustering problems have been discussed in the literature. The widely studied -median problem is an example of such a clustering problem. This problem consists of selecting a given number of objects as cluster centers (medians) such that the total distance between the objects and their nearest median is minimized. The -median problem has been studied mainly in the context of facility location but also in other contexts, such as large-scale data mining (e.g., Avella et al. 2012). In practical clustering applications, the size of the clusters is often constrained. For example, when grouping customers to form sales force territories, the workload of an individual salesperson must be restricted to guarantee adequate service quality (Mulvey and Beck, 1984). This gives rise to an extension of the -median problem, namely, the capacitated -median problem (CPMP).
The CPMP can be stated as follows (e.g., Lorena and Senne 2004). Given a set of weighted objects that are described by features, the goal is to form a prescribed number of clusters by selecting objects as medians and by assigning the objects to these medians such that the total distance (e.g., Euclidean distance) between the objects and their and improves this initial solution by first solving a mathematical model for the entire problem and then iteratively for subproblems. Reduction heuristics are applied to eliminate variables from the models. In a comprehensive computational experiment based on instances with up to 5,000 objects, Stefanello et al. (2015) demonstrated the superior performance of their approach in comparison to recent benchmark approaches from the literature. For instances that comprise many more than 5,000 objects, however, the approach of Stefanello et al. (2015) becomes computationally expensive for three reasons. First, the randomized procedure requires many iterations to construct a good initial solution, especially when the capacity limit is tight. Second, solving the mathematical model for the entire problem becomes intractable for large-scale instances, despite the reduction heuristics. Third, the subproblem selection procedure does not prioritize subproblems with great potential for improving the objective function value. Another challenge that was not specifically discussed by Stefanello et al. (2015) is that the number of distances between objects and potential medians grows quadratically with an increasing number of objects. For instances with much more than 5,000 objects, the computation of all these distances becomes prohibitively time consuming and exceeds the available memory of most standard workstations.
In this paper, we propose a matheuristic with new problem decomposition strategies that are specifically designed for large-scale instances. These strategies (a) focus on subproblems with the potential for substantially improving the objective function value, (b) exploit the power of binary linear programming to ensure the feasibility with respect to the capacity constraints during the entire solution process, and (c) apply efficient data structures (k-d trees ;Bentley 1975) to avoid computing a large number of pairwise distances. The proposed matheuristic comprises two phases: a global optimization phase in which the subproblems involve all objects and a local optimization phase in which the subproblems involve only a subset of objects. In the global optimization phase, we decompose the CPMP into a series of generalized assignment problems, which are formulated as binary linear programs and solved using a mathematical programming solver. In each of these subproblems, objects are optimally assigned to fixed medians subject to the capacity constraints. The fixed medians are updated between the solution of two consecutive subproblems. By fixing the medians and allowing objects to be assigned only to one of their -nearest fixed medians, the number of required distance computations is reduced from ( −1) 2 to per subproblem, where parameter can be controlled by the user. To efficiently identify the -nearest fixed medians of each object and to compute the corresponding distances, we use k-d trees. In the local optimization phase, we decompose the entire problem into subproblems that comprise groups of clusters only. A binary linear programming formulation of the CPMP is then solved for these groups of clusters individually using a mathematical programming solver. The proposed subproblem selection procedure focuses on groups of clusters with spare capacity and thus prioritizes subproblems with the potential for substantially improving the objective function value. We also use k-d trees in the local optimization phase to considerably reduce the number of required distance computations.
In a computational experiment, we compare the performance of the proposed matheuristic to the performance of the state-of-the-art approach proposed by Stefanello et al. (2015). Furthermore, we provide the results of an exact approach based on the binary linear program presented by Lorena and Senne (2004) and a mathematical programming solver. We apply all three approaches to a set of standard test instances from the literature, including the largest existing instances. In comparison to instances for the uncapacitated -median problem, these largest existing instances for the CPMP are considered small-scale (e.g., Avella et al. 2012). To assess the performance of the three approaches on instances that are comparable in size to large instances of the uncapacitated -median problem, we additionally generate some medium-scale instances with up to approximately 50,000 objects and some large-scale instances with up to approximately Table 1 Notation used for the binary linear program (M-CPMP).

Parameters and sets
Number of objects Set of objects ( = {1, … , }) Number of clusters Feature vector of object ∈ Weight of object ∈ Capacity limit Decision variables * { = 1, if object is assigned to median = 0, otherwise ( , ∈ ) 500,000 objects. It turns out that, for small-scale instances, the proposed matheuristic matches the performance of the state-of-the-art approach, and for medium-and large-scale instances, the proposed matheuristic consistently delivers superior results. For the largest instances, only the proposed matheuristic identifies feasible solutions given the available computational resources. Furthermore, we generated some high-dimensional instances with up to approximately 800 features. The proposed matheuristic performs best among the tested approaches also for these high-dimensional instances. Note that the implementation of the proposed matheuristic and the generated instances are publicly available.
As an extension, we show that the proposed matheuristic can easily be applied to other capacitated clustering problems, such as the capacitated centered clustering problem (CCCP) (Negreiros and Palhano, 2006). In the CCCP, the cluster centers are computed as the geometric mean of the assigned objects and are not selected among the set of objects. For the largest problem instances of the CCCP tested in this paper, we were able to find new best-known solutions.
The remainder of this paper is organized as follows. In Section 2, we describe the CPMP in more detail. In Section 3, we review the related literature. In Section 4, we describe the proposed matheuristic. In Section 5, we report the computational results. In Section 6, we apply the proposed matheuristic to the CCCP, and in Section 7, we provide some conclusions and give an outlook on future research.

Capacitated -median problem
In this section, we describe the CPMP in more detail (Section 2.1) and provide a small illustrative example (Section 2.2).

Description of the problem
The CPMP can be stated as follows (e.g., Lorena and Senne 2004). Given is a set of objects denoted as = {1, … , }. Each object ∈ corresponds to an -dimensional feature vector ∈ R and is associated with a weight . Based on these feature vectors, a distance ( , ) can be computed for each pair of objects , ∈ (e.g., Euclidean distance). Note that the distances do not constitute an input to the problem as they must be calculated based on the feature vectors. The goal is to partition the objects into a prescribed number of clusters by selecting objects as cluster centers (medians) and by assigning the objects to these medians such that the total distance between the medians and their assigned objects is minimized. In doing so, the total weight in each cluster must not exceed a given capacity limit .
The CPMP can be formulated as a binary linear program (Lorena and Senne, 2004); the notation used is summarized in Table 1. Note that an object ∈ is selected as a median if it is assigned to itself, i.e., = 1. Table 2 Coordinates and number of employees of stores in the illustrative example.
The CPMP has various real-world applications that have been discussed in the literature. Many of these real-world applications arise in facility location (e.g., Medaglia et al. 2009, Jánošíková et al. 2017. Other exemplary applications are the consolidation of customer orders into truckload shipments (Koskosidis and Powell, 1992) and the structuring of multiprotocol-label switching networks (El-Alfy, 2007). For a broad overview of real-world applications of the CPMP, we refer to Ahmadi and Osman (2005).

Illustrative example
We present a small example to illustrate the description of the CPMP provided above. Furthermore, we use this example to illustrate the proposed matheuristic in Section 4.4.
We consider a coffeehouse chain that wants to group its stores into a given number of clusters such that stores in the same cluster are close to each other. A manager is then put in charge of each resulting cluster. The selected median of a cluster represents the store at which the office of the assigned manager should be located. To ensure that the stores within a given cluster can be managed adequately, capacity constraints are required that limit the total number of employees within a cluster.
The coffeehouse chain has = 15 stores that must be grouped into = 4 clusters. The coordinates (feature vectors) and the number of employees (weights) of the stores are given in Table 2. The total number of employees within a cluster is limited to = 8. The pairwise distances between the stores are calculated as Euclidean distances. In Fig. 1, an optimal solution for the illustrative example is depicted; the objective function value (OFV) of the depicted solution is provided in the bottom-right corner. The size of a point in the figure represents the number of employees of the corresponding store. Stores that are selected as medians are indicated with a red circle, and the assignments of the stores to the medians are indicated with green lines.

Literature review
Capacitated clustering has received a lot of attention recently. In this section, we focus only on solution approaches for the CPMP. Other closely related capacitated clustering problems differ from the CPMP with respect to the objective function (e.g., Brimberg et al. 2019, Zhou et al. 2019, Puerto et al. 2020, the constraints (e.g., Espejo et al. 2021), or both (e.g., Ríos-Mercado et al. 2021.
We categorize and discuss the papers dealing with the CPMP according to the types of proposed solution approaches. In Sections 3.1 to 3.4, we review exact approaches, classic heuristics, metaheuristics, and matheuristics. Table 3 gives an overview of the discussed approaches and lists the number of objects of the largest instance that was used to test the corresponding approach. Note that given the considerable improvement in available software and hardware, the approaches might be applicable to larger instances than those listed in the last column of Table 3.

Exact approaches
Almost all papers listed in Table 3 provide a formulation of the CPMP as a binary linear program. These formulations can be used to solve small-scale instances to optimality by applying a mathematical programming solver such as Gurobi or CPLEX. In addition, a few problem-specific exact approaches have been proposed. Pirkul (1987) proposed a branch-and-bound algorithm for the capacitated concentrator location problem that can be adapted to the CPMP. Baldacci et al. (2002) presented an exact approach based on a set partitioning formulation of the CPMP, and Ceselli and Righini (2005) proposed a branch-and-price algorithm with different branching strategies and pricing methods. Finally, Boccia et al. (2008) developed a cutting plane algorithm based on Fenchel cuts.

Classic heuristics
The category of classic heuristics comprises problem-specific heuristics that are not based on metaheuristic concepts. Mulvey and Beck (1984) proposed a classic heuristic based on alternately applying an object-assignment step and a median-update step. The objects are assigned in a greedy manner to their nearest median that has sufficient unused capacity. The order in which the objects are assigned is determined based on a regret value that is computed for each object. The regret value of an object is defined as the difference between the distance to its second nearest fixed median and the distance to its first nearest fixed median. Furthermore, an improvement heuristic based on local switches of objects between clusters was proposed. The approach of Mulvey and Beck (1984) was extended in Koskosidis and Powell (1992) by new initialization methods for the initial set of fixed medians and a new definition of the regret value. Lorena and Senne (2003) presented a local search heuristic based on Lagrangian/surrogate relaxation techniques introduced by Senne and Lorena (2000) for the uncapacitated -median problem. The best upper bounds obtained by the local search heuristic of Lorena and Senne (2003) were compared in Lorena and Senne (2004) with lower bounds devised by a column generation approach based on a set partitioning formulation of the CPMP. Finally, Mai et al. (2018) proposed a construction and an improvement heuristic for the CPMP. The construction heuristic uses a Gaussian mixture modeling approach that incorporates the capacity constraints. The improvement heuristic shifts or swaps objects between different clusters.
These classic heuristics are based on the idea of performing many iterations, where each iteration slightly improves the solution quality. For instances that comprise up to approximately 5,000 objects (Table 3), good-quality solutions can be devised since each iteration can be performed extremely fast. For instances that comprise much more than 5,000 objects, however, each iteration becomes expensive in terms of the required running time. Moreover, for large-scale instances with tight capacities, the classic heuristics that are based on a greedy assignment strategy often need many time-consuming attempts to even generate a first feasible solution. Because of these limitations, individual objects are often aggregated to reduce the problem size. Lorena and Senne (2003), for example, reduced the problem size by aggregating houses (apartments) to blocks. This aggregation, however, leads to a loss of information and aggregation errors in the solutions (e.g., Erkut and Bozkaya 1999).

Metaheuristics
In addition to classic heuristics, many metaheuristics have been proposed, such as the bionomic algorithm presented by Maniezzo et al. (1998), the problem-space search algorithm developed by Ahmadi and Osman (2004), the scatter search heuristics proposed by Scheuerer and Wendolsky (2006), the guided construction search heuristics introduced by Osman and Ahmadi (2007), and the grouping evolutionary algorithms developed by Landa-Torres et al. (2012). In addition, various approaches have been proposed that combine multiple metaheuristic concepts. Osman and Christofides (1994) combined the concepts simulated annealing and tabu search, Ahmadi and Osman (2005) merged a greedy random adaptive search procedure and adaptive memory programming, Díaz and Fernandez (2006) proposed an approach that combines scatter search and path relinking, and Chaves et al. (2007) linked the concepts clustering search and simulated annealing.
Like the classic heuristics, these metaheuristics also require many iterations to substantially improve the solution quality, which becomes costly in terms of the required running time for large-scale instances. In addition, they either apply manual checks while generating new solutions to guarantee that the capacity constraints are satisfied, or they apply repair operators to fix newly generated infeasible solutions. Both tasks are time consuming as well for large-scale instances.

Matheuristics
Recently, matheuristics have received increasing attention. Matheuristics, in general, are a powerful tool because they combine heuristic approaches with the continuously improved performance of mathematical programming solvers (e.g., Carrizosa et al. 2018, Gnägi andStrub 2020). Fleszar and Hindi (2008) presented a variable neighborhood search matheuristic. Neighbors are found by randomly switching some selected medians of the current best solution to objects that are currently not selected as medians. To quickly assess the quality of the neighbors, approximation methods are used, such as assigning the objects to their nearest selected median without considering the capacity constraints. For the most promising neighbors, feasible solutions to the CPMP are devised by solving a general assignment problem formulated as a binary linear program. Stefanello et al. (2015) proposed their iterated reduction matheuristic algorithm (IRMA) that comprises three phases. First, a simplified version of the greedy construction heuristic of Mulvey and Beck (1984) is applied. In contrast to the approach of Mulvey and Beck (1984), the order in which the objects are assigned to the fixed medians is drawn randomly and is not determined based on a regret value. Second, a mathematical programming solver is used to solve a binary linear programming formulation of the CPMP until an optimal solution is found or a time limit is reached. Third, if the optimality has not been proven in the second phase, a local search heuristic is applied that iteratively solves a binary linear programming formulation of the CPMP for subsets of clusters only. In the second and third phases, two heuristics (referred to as reduction heuristics) are applied to eliminate variables that are unlikely to be nonzero in an optimal solution. Finally, Jánošíková et al. (2017) presented two combinations of a genetic algorithm with binary linear programming. Binary linear programming is either used to generate elite individuals during the solution process of the genetic algorithm or as a postprocessing technique to improve the best solution returned by the genetic algorithm.
These matheuristics overcome some of the abovementioned limits of classic heuristics and metaheuristics by applying binary linear programming to efficiently handle the capacity constraints. However, they are either combined with a greedy assignment heuristic and/or need to compute and store large distance matrices at some point, which is challenging in terms of the required running time and prohibitive in terms of the required storage space.

Proposed matheuristic
In this section, we present the global optimization phase (Section 4.1) and the local optimization phase (Section 4.2) of our proposed matheuristic in detail. Moreover, we briefly describe the data structure k-d trees (Section 4.3), which we use in both phases of the proposed matheuristic. Finally, we illustrate the proposed matheuristic by means of the illustrative example provided in Section 2.2. For the total running time of the proposed matheuristic, we prescribe a maximum time limit denoted as .

Global optimization phase
In the global optimization phase, we aim to devise a good-quality feasible solution by performing only a few iterations of the procedure described below; this phase builds on the approach of Baumann (2019) that was proposed for the CCCP. First, we identify and fix a set of promising medians, denoted as with | | = . Second, we assign the remaining objects to the fixed medians by using the binary linear program (M-G) provided below. By fixing the medians, we avoid computing all ( −1) 2 pairwise distances such that only distances between objects and fixed medians must be computed. To further reduce the number of required distance computations, we exploit the idea that objects are rarely assigned to medians that are far away and thus only allow objects to be assigned to their -nearest medians. The -nearest medians of each object can be determined efficiently using k-d trees without computing all pairwise distances (Section 4.3). Consequently, only distances between objects and fixed medians must be computed. We denote the set that comprises the -nearest medians of object ∈ ⧵ as with ⊆ . Accordingly, we denote the set consisting of all objects that are not selected as fixed medians and that have the fixed median ∈ among their -nearest medians as with ⊆ ⧵ . The binary linear program that we use to assign the objects to the fixed medians, referred to as (M-G), reads as follows: The objective function given in (2)(a) captures the total distance between the fixed medians and their assigned objects. Constraints (2)(b) ensure that each object is assigned to exactly one fixed median. Constraints (2)(c) impose the capacity limit for each of the fixed medians; the capacity limit for each fixed median ∈ is − because it is assigned to itself a priori and thus must accommodate its own weight.
Finally, the domains of the decision variables are defined in (2)(d). The binary linear program (M-G) represents a special case of the generalized assignment problem in which the weight of an object is independent of the cluster to which the object is assigned. We continue by alternating between a median-update step and an object-assignment step with the goal of improving the solution quality of the initial solution. In the object-assignment step, we again use the binary linear program (M-G) as described above to assign the objects to the currently fixed medians.
In the median-update step, we update the currently fixed medians based on the new assignments obtained in the previous object-assignment step. We determine for each cluster the object that minimizes the total distance to all other objects assigned to this cluster. These objects are then used as the new fixed medians in the next object-assignment step.
We perform these two steps iteratively until the current solution can no longer be improved.
Algorithm 1 describes the global optimization phase in detail. We start by initializing the parameter , which defines the number of nearest medians to which an object can be assigned. Moreover, we initialize the set of fixed medians . For this initialization, we propose two alternative methods which we describe further below. Then, to set up the binary linear program (M-G), we determine the sets for the objects ∈ ⧵ and for the fixed medians ∈ based on the current value of , and we calculate the distances between the objects ∈ ⧵ and the medians ∈ . Thereafter, we attempt to solve the binary linear program (M-G) using a mathematical programming solver. We stop the solver as soon as the MIP gap reaches a value of 1% or lower. This setup exploits our observation that optimal or near-optimal solutions are often found quickly, while a rather long additional running time is spent on proving the optimality of these solutions. If it is found that no feasible solution exists, we double the value of , set up the binary linear program (M-G) based on the increased value of , and try to solve the binary linear program (M-G) again. This process is repeated until a feasible solution, denoted as , has been found. Then, we update each median of the solution (lines 12-17) by determining the objects assigned to the median, calculating the pairwise distances between these assigned objects, and determining the object that minimizes the total distances to all other assigned objects. For instances with > 10,000, we propose applying an approximate median-update step to further reduce the number of distance computations. In this case, we update each median by selecting the object that is nearest to the center of gravity of the assigned objects.
After all medians have been updated, we evaluate whether a new best solution, denoted as * , has been found (lines 18-24). If this is the case, we reset the value of to , we update the set of fixed medians to comprise the updated medians in the new best solution * , and we start the next iteration if the time limit has not been reached.
Otherwise, we stop the algorithm and return the best solution found.
Algorithm 1 Global optimization phase 1: procedure GlobalOptimization( , , ) 2: ← ; 3: ← set of initial medians with | | = using method ; 4: while time limit has not been reached do 5: Determine sets for objects ∈ ⧵ and sets for medians ∈ ; 6: Calculate distances ( , ) between objects ∈ ⧵ and medians ∈ ; 7: Solve (M-G) until MIP Gap ≤ 1%; 8: if no feasible solution exists then 9: ← × 2; 10: else 11: ← new feasible solution found; 12: for medians ∈ do 13: ← set of objects assigned to median in solution ; 14: Calculate distances ( , ′ ) between objects , ′ ∈ ; 15: To determine the initial medians (line 3 of Algorithm 1), we consider two alternative methods: • The -means++ algorithm proposed by Arthur and Vassilvitskii (2007). This method aims at spreading out the initial medians as far as possible. Thereby, objects are iteratively selected to be medians with a probability that is proportional to the squared distance between an object and its closest already selected median. Note that this method does not consider any capacity constraints. • A capacity-based initialization method that corresponds to an accelerated version of the procedure GlobalOptimization( , , ) with = -means++. To accelerate this procedure, we introduce the following two modifications. First, instead of assigning the objects to the fixed medians by using the binary linear program (M-G) (line 7 of Algorithm 1), each object is assigned greedily to its nearest fixed median that has a sufficient amount of unused capacity. The order in which the objects are assigned is determined based on the regret function proposed by Mulvey and Beck (1984), i.e., a regret value is calculated for each object with = ( , ′′ ) − ( , ′ ) where ′ and ′′ are the first and second nearest medians of object . The objects are assigned in decreasing order of their regret values. Second, instead of returning the best feasible solution, only the set of medians in the best feasible solution found is returned.
The -means++ algorithm quickly returns a set of initial medians but neglects the capacity restriction. The novel capacity-based method requires a longer running time but takes into account the capacity restrictions and thus provides more promising initial medians. A set of promising initial medians allows to choose a smaller value for . This is beneficial especially for large-scale instances since it considerably reduces the size of the binary linear program (M-G).
Note that in both initialization methods, we use the -means++ algorithm, which is a randomized procedure. We generate multiple different solutions by running the global optimization phase multiple times, each time with a different random seed. We then return the best solution found over all runs. We denote the number of runs of the global optimization phase as .

Local optimization phase
In the local optimization phase, we iteratively apply the following procedure to further improve the best solution obtained from the global optimization phase. First, we select a subset of clusters from the set of clusters in the current best solution. The cluster-selection procedure starts with selecting the median that has the largest amount of unused capacity. Then, the − 1 medians that are nearest to the already selected median are selected. These nearest medians can be determined efficiently using k-d trees (Section 4.3). Second, we identify the set of objects that are assigned to the selected medians. We denote this set of objects as with ⊆ . Third, we solve the binary linear program (M-L) given below for this subset of objects only. To speed up the solution process of the binary linear program, we consider as potential new medians only the medians of the selected subset of clusters and their -nearest objects. This procedure is similar to the neighborhood median size-reduction heuristic proposed by Stefanello et al. (2015). The -nearest objects of each median can again be determined efficiently using k-d trees (Section 4.3). We denote the set of potential new medians as with ⊆ . Starting with a small subset of clusters, we enlarge the size of the subset after several iterations without improvement. Furthermore, we track the clusters (represented by their medians) of the current best solution for which no further local improvement has been found. The binary linear program that we use during the local optimization phase, referred to as (M-L), reads as follows: The objective function given in (3)(a) captures the total distance between the medians and their assigned objects in the selected subset of clusters. Constraint (3)(b) ensures that exactly objects are selected as medians. Constraints (3)(c) ensure that each object is assigned to a median, and constraints (3)(d) impose the capacity limits. Constraints (3)(e) are valid inequalities that substantially speed up the solution process since they tighten the linear relaxation of the binary linear program (e.g., Ceselli and Righini 2005, Deng and Bard 2011, Kramer et al. 2019. Finally, the domains of the decision variables are defined in (3)(f).
Algorithm 2 describes the local optimization phase in detail. We start by initializing the best solution found so far, denoted as * , with the initial solution , which represents the best solution obtained at the end of the global optimization phase. We additionally initialize the parameter based on the input parameter , which represents a target number of objects in the initial subset of objects . A suitable value for should be chosen such that a mathematical programming solver can solve the resulting binary linear program (M-L) within a reasonable running time. Moreover, we initialize an empty set , which is used throughout the local optimization phase to track the medians for which no further local improvement has been found. To set up the binary linear program (M-L), we perform the following steps (lines 6-15). First, we determine the set of potential new medians and the set of objects that belong to the selected clusters . We start with the median ′ that has the largest amount of unused capacity in the current best solution * and that has not been marked as a median that has been optimized in previous iterations without finding any local improvement. Then, we include the − 1 medians that are nearest to median ′ in the set . At this point, we copy set and denote the copy as . Then, we determine the set by including all objects assigned to the medians ∈ in the current best solution * . We then further enlarge set by including the -nearest objects ∈ of each median ∈ . Second, we calculate the distances between the objects ∈ and the medians ∈ . Thereafter, we try to improve the current best solution using the binary linear program (M-L) and a mathematical programming solver. To avoid very long running times for single iterations, we stop the solver after a time limit of has been reached, or as soon as the MIP gap reaches a value of 1% or lower. Furthermore, we provide a warm start based on the current best solution * and the set of selected clusters. This typically speeds up the solution process of the solver since a first feasible solution is already provided. If the current best solution has been improved, we update set by excluding all medians ∈ (line 25) such that these medians can be selected again in subsequent iterations; otherwise, we include all medians ∈ in set (line 27). This process of setting up the binary linear program (M-L), trying to solve the binary linear program (M-L), and updating set is performed iteratively until the time limit is reached, or all medians selected in the current best solution are also in set . In the latter case, we double the value of and empty set (lines 29 and 30 of Algorithm 2). As soon as the binary linear program (M-L) has been optimized with the number of medians to be selected being equal to the total number of clusters , we stop the algorithm and return the best solution found. ← ∪ set of ( − 1)-nearest medians ∈ * of median ′ ; 10: ← copy set ; 11: ← set of objects assigned to medians ∈ in solution * ; 12: for medians ∈ A novelty of the local improvement phase that distinguishes the proposed approach from other local search matheuristics is the clusterselection procedure. The procedure starts with the cluster that has the largest amount of unused capacity as the initial cluster. In contrast, the local search matheuristic of Stefanello et al. (2015), for example, does not prioritize subproblems with great potential for improving the objective function value. In Appendix A, we provide results which indicate that this cluster-selection procedure outperforms other cluster-selection procedures.

Search for nearest neighbors using k-d trees
Both phases of the proposed matheuristic require to search for nearest neighbors. This task can be performed efficiently using k-d trees without calculating all pairwise distances (Bentley, 1975). A k-d tree is a binary tree, in which each node represents an object. Moreover, each node is associated with a feature by which the feature space is divided into two halves. Once such a k-d tree is constructed, the tree structure can be exploited to eliminate large portions of the search space, such that the nearest neighbors of any object described by the same features can be determined efficiently. Note that in high-dimensional feature spaces smaller parts of the search space can be eliminated than in lowdimensional feature spaces, such that the search for nearest neighbors using k-d trees becomes more time consuming. Fig. 2 illustrates the solution process of the proposed matheuristic for the illustrative example from Section 2.2. Stores that are selected as medians are indicated with a red circle, and the assignments of the stores to the medians are indicated with green lines.

Illustrative example
The first column of Fig. 2 depicts the solution process of the global optimization phase. The objective function value (OFV) after each iteration is provided in the bottom-right corner of the corresponding subfigures. Starting with an initial set of medians determined by the -means++ algorithm, three iterations (denoted as G1 to G3) are performed. In the first iteration, a first feasible solution is found. In the second iteration, the current solution is improved. A third iteration is performed, which does not improve the current solution and therefore is not depicted in Fig. 2. Thus, the global optimization phase terminates with the solution depicted in the subfigure at the bottom of the first column of Fig. 2.
The second column of Fig. 2 depicts the solution process of the local optimization phase. We provide the objective function value (OFV) after each iteration in the bottom-right corner of the corresponding subfigures. Starting with the solution returned at the end of the global optimization phase, four iterations (denoted as L1 to L4) are performed. We start with a subset consisting of = 2 clusters, which includes the cluster with the largest amount of unused capacity. In the first iteration, a new best feasible solution is found by solving the binary linear program (M-L) for the two selected clusters only. In the second and third iterations, two subsets of clusters are considered for which no improvement is found. At this point, we double the number of clusters to be selected to = 4 because all clusters in the current best solution have been examined once without achieving any improvements. Hence, in the fourth iteration, all = 4 clusters are selected, and thus, all stores are considered. At the end of the fourth iteration, after finding again a new best feasible solution, we terminate the local optimization phase since all stores have been considered. Note that the best solution found at the end of the fourth iteration corresponds to an optimal solution. However, there is no guarantee of finding an optimal solution as long as the parameter value is chosen such that < .

Computational experiment
In this section, we evaluate the performance of the proposed matheuristic in terms of solution quality and running time. In Section 5.1, we introduce the test set. In Section 5.2, we choose the control parameters of our matheuristic. In Section 5.3, we explain the design of our computational experiment. In Section 5.4, we report and discuss the computational results.

Test set
The complete test set comprises the 31 instances described in Table 4. The first 16 instances are well-known test instances from the literature. These include the six instances from the group SJC that were introduced by Lorena and Senne (2003) and the ten instances from the groups p3038 and fnl4461 that were generated by Lorena and Senne (2004) and Stefanello et al. (2015), respectively. The instances from the group SJC comprise real-world data of the central area of the Brazilian city São José dos Campos. The objects of these instances correspond to blocks, and the weights represent the number of houses belonging to each block. The authors did not provide any additional information regarding the interpretation of the clusters or the capacity limit. The ten instances from groups p3038 and fnl4461 are generic instances that are based on two TSP instances from the TSPLIB (Reinelt, 1991) with 3,038 and 4,461 nodes, respectively. To the best of our knowledge, the instances of group fnl4461 are the largest publiclyavailable test instances that have been tested in the literature so far.
Since large-scale instances for the uncapacitated -median problem comprise up to 100,000 objects (e.g., Hansen et al. 2009), these largest existing instances for the CPMP are considered small-scale. Therefore, we generated a set of medium-and large-scale instances based on four TSP instances from the VLSI collection (Rohe, 2013) ranging from approximately 10,000 up to approximately 500,000 nodes. Following the procedure proposed by Stefanello et al. (2015), we took the coordinates of the nodes of the TSP instances as features for the objects of our CPMP instances. Furthermore, for each object ∈ , we determined a weight by drawing a random integer from the set {1, 2, … , 100}. Finally, for each of the four TSP instances, we generated a group of three CPMP instances with = {100, 1,000, 2,000} being the number of clusters to be found. Based on these numbers of clusters, we determined the capacity by using Eq. (4), where was randomly drawn from the range [0.8, 0.9].
Note that instances 7 to 28 are not based on real-world data. These generic instances were generated with the aim of testing the performance of different solution approaches when the number of objects to be clustered increases.
Additionally, we generated three high-dimensional instances with up to approximately 800 features. Since the clustering of images is an important task in data mining (e.g., Ushakov and Vasilyev 2019), the objects of the first two instances represent images, and we took the grayscale level of the pixels as features for the objects. The instance cancer5000_10_784 is based on the images from the collection of textures in colorectal cancer histology (Kather et al., 2016). The instance digits60000_100_784 is based on the images from the MNIST database of handwritten digits (LeCun et al., 1998). The number of features for both instances corresponds to the resolution (28x28 pixels) of the images. The third instance KDD145751_100_74 is based on the KDD protein homology dataset (KDD, 2004), which is a popular dataset for testing clustering methods (e.g., Bachem et al. 2016). The objects of this instance correspond to protein sequences that are described by 74 different features. For each of these high-dimensional instances, we determined weights and capacities following the procedure described above. The number of clusters to be found was selected arbitrarily to be = 10 for the small-scale instance cancer5000_10_784 and = 100 for the medium-and large-scale instances digits60000_100_784 and KDD145751_100_74. We generated these instances with the aim of assessing whether the proposed matheuristic can also deal with instances that comprise more than two features.
Note that the instances from the literature and the instances generated here all consider a given capacity that is the same for all objects. However, Mulvey and Beck (1984), for example, considered the more general case in which any two objects and ′ may have different capacities, i.e., ≠ ′ . The proposed matheuristic is implemented such that it can also deal with this more general case.
The generated medium-and large-scale instances, as well as the generated high-dimensional instances, can be downloaded from https: //github.com/phil85/GB21-MH.

Selection of control parameters
The proposed matheuristic has several parameters that can control the trade-off between solution quality and running time. The goal of this subsection is to provide guidelines on how to select values for these control parameters.
We recommend to perform three runs of the global optimization phase due to the random initialization ( = 3). The size of the subproblems in the local optimization phase should be determined based on the performance of the mathematical programming solver. Currently, state-of-the-art solvers can generally cope quite well with problems that include up to 200 objects. We therefore chose =

200.
To avoid investing an inordinate amount of time on proving the optimality of the solution found for a subproblem in the local optimization phase, we recommend to impose a time limit = 300. To devise promising values for the remaining control parameters, we conducted the following parameter tuning: we applied the matheuristic to a subset of nine test instances with different values for the control parameters , , and . The selected instances comprise three small-, three medium-and three large-scale instances. For the three control parameters, we tested the values = {5, 10, 20}, = {10, 50, 100} and = { -means++, capacity-based}. For each instance and each combination of parameter values (3 × 3 × 2 = 18 combinations), we ran the proposed matheuristic with a total running time limit of = 3,600 seconds for each run. The results of this parameter tuning are summarized in Table 5. We report for each combination of parameter values the average relative gap between the objective function value of the feasible solution found and the objective function value of the best feasible solution found among all tested combinations of parameter values. Bold values indicate the best results among all tested combinations of parameter values for the small-scale instances (columns three to five) and for the medium-and large-scale instances (columns six to eight). Based on these results, we determined the parameter setting presented below in Table 6 for the main computational experiment.

Experimental design
We compared the performance of the proposed matheuristic with the performance of the state-of-the-art approach for the CPMP presented by Stefanello et al. (2015). Since the implementation of this approach is not publicly available, we reimplemented it according to the description given in the paper. We also report results for an exact approach based on the binary linear program formulation presented by Lorena and Senne (2004) and a mathematical programming solver. We implemented all three approaches in Python 3.7, and we used the Gurobi Optimizer 8.1 as a mathematical programming solver with the default settings if not stated otherwise. Our implementation is publicly available on https://github.com/phil85/GB21-MH. Our computations were performed on an HP workstation with an Intel(R) Xeon(R) Silver 4114 CPU (2.20 GHz) with ten cores and 128 GB of RAM. For the proposed matheuristic, we used the parameter values that are listed in Table 6. For the matheuristic of Stefanello et al. (2015), we used the default parameter values proposed in their paper. We ran each approach for each tested instance three times, and for each run, we imposed a running time limit of 3,600 seconds.

Numerical results
First, we compared the three approaches in terms of solution quality and their suitability for different problem sizes. These results are summarized in Table 7 for the small-scale instances and in Table 8 for the medium-scale, the large-scale, and the high-dimensional instances. We refer to the proposed matheuristic as GB21 MH , to the matheuristic proposed by Stefanello et al. (2015) as SAM15 MH and to the exact approach based on the binary linear program formulation presented by Lorena and Senne (2004) as LS04 BLP . Bold values indicate the best results among all tested approaches. In the first five columns of both tables, we list the characteristics of the instances. In the sixth column of Table 7, we report the objective function value of the best known solution reported by Stefanello et al. (2015) (OFV SAM15 ), and in the sixth column of Table 8, we report the objective function value of the best feasible solution found among all tested approaches within the prescribed time limit (OFV BEST ). In the seventh column of both tables, we report the best lower bound on the objective function value obtained with the exact approach based on the binary linear program formulation (OFV LB ). Finally, in the last six columns of both tables, we report for each tested approach the relative gap between the objective function value of the best feasible solution found over the three runs and the reported value for OFV SAM15 and OFV BEST , respectively, as well as the total required running time for the run in which the best feasible solution has been found. Additionally, we provide some Initialization method to determine the initial set of fixed medians -means++ ( < 5,000) capacity-based ( ≥ 5,000) Global Initial number of nearest medians to which an object can be assigned In contrast to the two tested approaches from the literature, the proposed matheuristic found feasible solutions for all instances within the prescribed time limit. For the small-scale instances, the new matheuristic matched the performance of the other approaches, and for the medium-and large-scale instances, the new matheuristic consistently delivered the best feasible solutions. Additionally, for the highdimensional instances, the proposed matheuristic performed best among all tested approaches.
The approach SAM15 MH could not find a feasible solution for the three largest instances because the computation of all pairwise distances exceeded the memory of the used workstation. We tested the approach SAM15 MH also with an extended running time limit. It turns out that the benchmark approach cannot find better solutions than the proposed matheuristic even if we extend the running time limit to 10 hours. For the instance SRA104814_2000, for example, the relative gap could be reduced from 9.56% to 5.84%.
Next, we compared the two heuristic approaches GB21 MH and SAM15 MH with respect to the ability to find good first feasible solutions quickly. These results are summarized in Table 9 for the small-scale instances and in Table 10 for the medium-scale, the large-scale, and the high-dimensional instances. The first six columns of both tables provide information analogously to Tables 7 and 8. In the following four columns, we report the relative gap between the objective function value of the best first feasible solution found over the three runs and the reported value for OFV SAM15 and OFV BEST , respectively, as well as the required running time to find the best first feasible solution. In the last column, we report the speed-up factor between the proposed matheuristic and the benchmark approach regarding the required running time to find the best first feasible solution.
With only one exception, the proposed matheuristic devised first feasible solutions with lower objective function values than the benchmark approach. Furthermore, the proposed matheuristic required substantially less running time to devise the first feasible solution for most of the tested instances. For some instances, the proposed matheuristic found the first feasible solution approximately 135 times faster than the benchmark approach. The ability of the proposed matheuristic to provide good first feasible solutions quickly might be valuable, for example, for clustering problems where is not known in advance. If this is the case, the proposed matheuristic can be run multiple times, each time with another value of , since each run can be performed quickly.
Finally, we highlight the importance of the two phases of the proposed matheuristic. Fig. 3 depicts the improvement of the best feasible solutions in terms of the objective function value over time for the proposed matheuristic and, as a reference, also for the approach SAM15 MH . Each subfigure reports the results for the first of the three runs for one of the following three exemplary instances: fnl4461_1000 (small-scale), FNA52057_1000 (medium-scale) and LRA498378_1000 (large-scale). Note that in the third subfigure, no results are shown for the approach SAM15 MH since no feasible solution has been found for this instance. For small-scale instances, the majority of the running time was spent in the local optimization phase, during which substantial improvements to the solution quality can be achieved. For mediumscale instances, both phases consume a similar amount of running time. While the best feasible solution can be improved quickly at the beginning, no further improvements can be attained after spending some time in the global optimization phase. At this point, the local optimization phase starts, during which substantial improvements can be achieved by locally reoptimizing the best feasible solution obtained at the end of the global optimization phase. For large-scale instances, the entire running time is spent in the global optimization phase. The first feasible solution is devised fairly quickly, particularly when considering the large number of objects to be clustered. Then, the best feasible solution is consecutively improved until the prescribed maximum running time has elapsed.

Capacitated centered clustering problem
In this section, we show that the proposed matheuristic can also be applied to other variants of capacitated clustering problems, such as the capacitated centered clustering problem (CCCP).
The CCCP differs from the CPMP as follows (Negreiros and Palhano, 2006). The cluster centers must correspond to the geometric center of the objects assigned to the clusters and are not selected among the objects themselves. Like the CPMP, the CCCP is NP-hard as well (e.g., Chaves et al. 2018). Compared to the extensive literature dealing with the CPMP, only a few papers have considered the CCCP. The problem was first discussed by Negreiros and Palhano (2006), who proposed an approach comprising two phases that are based on a construction heuristic and variable neighborhood search heuristic. Most     recently, Chaves et al. (2018) proposed an adaptive biased randomkey genetic algorithm with a clustering search, and Baumann (2019) presented a matheuristic based on the well-known -means algorithm. These two approaches devised numerous new best-known solutions for standard test instances from the literature. The matheuristic proposed in this paper can be applied to the CCCP since a given feasible solution to an instance of the CPMP can be converted into a feasible solution to an instance of the CCCP that comprises the same objects and that prescribes the same capacity limit for all clusters. For each cluster of the feasible solution to be converted, the selected medians must be replaced by the geometric centers of the  assigned objects. The objective function value is then calculated as the total distance between the newly computed cluster centers and their assigned objects. The first 16 instances used in our computational experiment for the CPMP are also often analyzed in the literature dealing with the CCCP. For these instances, we converted the best feasible solutions obtained by using the proposed matheuristic into feasible solutions of the CCCP by applying the procedure described above. In Table 11, in the first four columns, we list the characteristics of these instances. In the next column (OFV BKS ), we list the objective function values of the bestknown solutions reported by Chaves et al. (2018) and Baumann (2019). Finally, in the last three columns, we report the objective function value of the best feasible solution found for the CCCP by the proposed matheuristic (OFV MIN ), the relative gap between the reported value for OFV MIN and the reported value for OFV BKS , as well as the total required running time. For the six smaller instances, the proposed matheuristic provides solutions with small gaps to the best-known solutions reported in the literature. For nine of the ten larger instances, we were able to find new best-known solutions even though the proposed matheuristic was not specifically designed for the CCCP.

Conclusions
In this paper, we considered the capacitated -median problem (CPMP). For this problem, we proposed a matheuristic that is specifically designed for instances with a large number of objects. In a computational experiment, the proposed matheuristic consistently outperformed the state-of-the-art approach for the CPMP on medium-and large-scale instances while matching the performance for small-scale instances. Furthermore, we showed that the proposed matheuristic can also be applied to related capacitated clustering problems such as the capacitated centered clustering problem (CCCP). For the largest problem instances of the CCCP tested in this paper, the proposed matheuristic was able to find new best-known solutions.
We suggest the following directions for future research. The proposed matheuristic can be adapted to problems with additional side constraints, such as lower bounds on the capacities of the clusters or must-link and cannot-link constraints as in Baumann (2020). Since the proposed approach is based on binary linear programming, such additional side constraints can easily be incorporated. Moreover, the proposed problem decomposition strategies can be applied to related problems such as the capacitated -center problem (CPCP), which differs from the CPMP only with respect to the objective function (Kramer et al., 2019). Instead of minimizing the total distance between the objects and their assigned medians, the maximum distance over all assignments of objects to medians is minimized.

Appendix A. Analysis of different cluster-selection procedures
We tested four different cluster-selection procedures for the local improvement phase. The cluster-selection procedures are summarized in Table A.1. The procedures differ with respect to the selection of the first cluster, and the selection of the remaining clusters. The first cluster is either selected randomly or based on the largest amount of unused capacity. The remaining clusters are either determined by iteratively selecting the cluster whose median is nearest to the median of the cluster selected at last, or by selecting the clusters represented by the nearest medians to the median of the first cluster. While the former procedure results in queue-shaped cluster selections, the latter procedure results in ball-shaped cluster selections. For this analysis, we used the medium-scale instances since we expected the effect of different cluster-selection procedures to be large for these instances. For each of these instances and each cluster-selection procedure, we ran the proposed matheuristic three times with a total running time limit of = 3,600 seconds for each run, and we used the default parameter values provided in Table 6.
The results of this analysis are summarized in Table A.2. In the first two columns, we list the number and name of the instances. In the third column, we report the objective function value of the best feasible solution found among all tested cluster-selection procedures (OFV BEST ). In the last eight columns, we report for each cluster-selection procedure the relative gap between the objective function value of the best feasible solution found over the three runs and the reported value for OFV BEST , as well as the average relative gap between the objective function value of the feasible solutions found over the three runs and the reported value for OFV BEST . From these results, we concluded that the fourth procedure performs best among all tested procedures. We therefore implemented this procedure in our matheuristic presented in Section 4.2.

Appendix B. Further statistics of computational experiment
In Tables B.1 and B.2, we provide further statistics for the results of our computational experiment. The first six columns of both tables provide information analogously to Tables 7 and 8. In the last four columns of both tables, we report the average and the standard deviation over the three runs of the relative gaps between the objective function value of the best feasible solutions found and the reported value for OFV SAM15 and OFV BEST , respectively.