Solution strategy based on Gaussian mixture models and dispersion reduction for the capacitated centered clustering problem

The Capacitated Centered Clustering Problem (CCCP) - a multi-facility location (MFL) model - is very important within the logistics and supply chain management ﬁelds due to its impact on industrial transportation and distribution. However, solving the CCCP is a challenging task due to its computational complexity. In this work, a strategy based on Gaussian mixture models (GMMs) and dispersion reduction is presented to obtain the most likely locations of facilities for sets of client points considering their distribution patterns. Experiments performed on large CCCP instances, and considering updated best-known solutions, led to estimate the performance of the GMMs approach, termed as Dispersion Reduction GMMs (DRG), with a mean error gap smaller than 2.6%. This result is more competitive when compared to VNS (Variable Neighborhood Search), SA (Simulated Annealing), GA (Genetic Algorithm), and CKM (CKMeans) and faster to achieve when compared to the best-known solutions obtained by Tabu-Search (TS) and Clustering Search (CS). ABSTRACT 10 The Capacitated Centered Clustering Problem (CCCP) - a multi-facility location (MFL) model - is very important within the logistics and supply chain management ﬁelds due to its impact on industrial transportation and distribution. However, solving the CCCP is a challenging task due to its computational complexity. In this work, a strategy based on Gaussian mixture models (GMMs) and dispersion reduction is presented to obtain the most likely locations of facilities for sets of client points considering their distribution patterns. Experiments performed on large CCCP instances, and considering updated best-known solutions, led to estimate the performance of the GMMs approach, termed as Dispersion Reduction GMMs (DRG), with a mean error gap smaller than 2.6%. This result is more competitive when compared to VNS (Variable Neighborhood Search), SA (Simulated Annealing), GA (Genetic Algorithm), and CKM (CKMeans) and faster to achieve when compared to the best-known solutions obtained by Tabu-Search (TS) and Clustering Search (CS).


22
Facilities are very important infrastructure within the supply chain as they support production, 23 distribution and warehousing. Due to this, many operative processes associated to facilities are 24 subject to optimization. Fields such as facility layout planning are crucial for smooth material 25 handling and production flow (Mohamadghasemi, A. and Hadi-Vencheh, A., 2012; Hadi-Vencheh, 26 A. and Mohamadghasemi, A., 2013;Niroomand, S. et al., 2015). 27 On the other hand, where to locate facilities within specific regions is a central problem for 28 strategic decisions of transportation and distribution (Chaves, A.A. et al., 2007). This is because 29 the distance between the facilities and the customers (demand or client points) is crucial to 30 provide efficient transportation and distribution services.

117
• At each iteration, Gaussian distribution-based clustering performs, for a given point, a

118
"soft-assignment" to a particular cluster (there is a degree of uncertainty regarding the as-119 signment). The centroid-based clustering performs a hard-assignment (or direct assignment) 120 where a given point is assigned to a particular cluster and there is no uncertainty.

121
Due to these differences, the GMM-based clustering was considered as an alternative to generate feasible solutions for the CCCP. In terms of the CCCP formulation described in Section Introduction a cluster can be modeled by a single Gaussian probability density function (PDF). Hence, the location "patterns" of a set of clients X can be modeled by a mixture of K Gaussian PDFs where each PDF models a single cluster. If the set contains N clients, then X = [x i=1 , x i=2 , ..., x i=N ] and the mixture can be expressed as: where k = 1, ..., K and |K| = p is the number of Gaussian PDFs, p(X|k) represents the probabilities of each Gaussian PDF describing the set of clients X (Theodoridis, S. and Koutroumbas, K., 2010), and P k is the weight associated to each Gaussian PDF (hence, K k=1 P k = 1.0). Each Gaussian component can be expressed as: where m k is the mean vector and S k is the covariance matrix of the k-th Gaussian PDF or

135
The advantage of this Gaussian approach for clustering is that faster inference about the 136 points x i that belong to a specific cluster k may be obtained considering all points. In this 137 context, it is important to mention that due to the probabilistic nature of the inference process, The EM algorithm starts with initial values for m k , S k and P k . Values for m k and S k were randomly generated as follows: where (cx min , cx max ) and (cy min , cy max ) are the minimum and maximum values throughout all 182 compressed and coded x and y coordinates respectively, and I l is the identity matrix of size l × l. 183 For P k a lower bound for K was obtained by considering the total demand of the points x i and the capacity of the clusters C k . Because all clusters have the same capacity, C k = C. Then, K and P k were obtained as follows: The stage of Expectation starts with these initial values for m k , S k and P k . An initial 184 computation of assignment or "responsibility" scores γ(z ik ) is performed to determine which x i is more likely to be associated to a particular cluster (and thus, to belong to this cluster)   likelihood, then one of them is randomly assigned. An example of this assignment process is 217 presented in Figure 3.
218  = matrix of dimension k N that contains the (z ik ) scores for the set of x i points x 1 is more likely to be generated by (or being associated to) the cluster k By determining the unique assignment of each point x i to each cluster k at Step 2 of the EM algorithm (see Figure 2), the number of points assigned to each cluster is obtained. This leads to determine the cumulative demand of the points assigned to each cluster. This information is stored into the vector: where D k represents the cumulative demand of the points assigned to cluster k and it must satisfy D k ≤ C k . This vector is important to comply with the capacity restrictions because it was found that homogenization of the cumulative demands D k contributes to this objective. Homogenization is achieved by minimizing the coefficient of variation between all cumulative demands: The objective function defined by Eq. (14) is integrated within the evaluation step of the   Table 1 were considered.

Manuscript to be reviewed
Computer Science  doni1  1000  6  doni2  2000  6  doni3  3000  8  doni4  4000 10  doni5  5000 12  doni6  10000 23  doni7  13221 30  SJC1  100  10  SJC2  200  15  SJC3a  300  25  SJC4a  402  30  TA25  25  5  TA50  50  5  TA60  60  5  TA70  70  5  TA80  80  7  TA90  90  4  TA100 100 6 •  In order to compute the error, gap or deviation from the updated best-known solutions the error metric presented by (Yousefikhoshbakht, M. and Khorram, E., 2012) was considered: where a is the cost or distance of the best solution found by the algorithm for a given instance while 287 b is the best known solution for the same instance. In this case it is important to mention that 288 the best-known solution is not necessarily the optimal solution due to the NP-hard complexity of 289 the CCCP. Initially, this metric was computed for the DRG, VNS, SA, CS, TS and GA methods 290 because the reference data was available for all sets of instances. Manuscript to be reviewed Computer Science Table 2 presents the best results of the DRG meta-heuristic for the considered instances.

293
Information regarding the runs performed by each method to report the best result is also 294 presented when available. Also, information regarding the programing language and the hardware 295 used by the authors of the reviewed methods were also included. data of the CKM and TCG methods were available. As presented in Table 3, the DRG meta-308 heuristic is more competitive than the CKM method. Also, as previously observed, the DRG is 309 more consistent.

310
When compared to the TCG method, this is more competitive than the DRG approach even 311 though the error gaps are minimal (less than 1.5%).

320
Regarding speed, Figure 6 presents the computational (running) times reported by the 321 reviewed methods. While TS and CS are the benchmark methods, these take over 25000 seconds 322 to reach the best-known solution for the largest instance. Note that for these methods, their 323 computational times exponentially increase for instances larger than 6000 points.

324
In contrast, SA is very consistent with a computational time of approximately 1000 seconds 325 through all instances. GA significantly increases for instances larger than 6000 points (up to 326 7000 seconds for the largest instance). However, these methods have the largest error gaps as 327 reviewed in Figure 5. The speed pattern of DRG is very similar to that of GA, however, as 328 reviewed in Figure 5, its error gap is the closest to the benchmark methods for instances larger 329 than 6000 points.

330
It is important to mention that this comparison may not be fair due to the differences in the 331 programming language and the hardware used for implementation and testing of all the methods.

332
In order to compare running speed all methods should be tested with the same hardware and Manuscript to be reviewed Computer Science

Manuscript to be reviewed
Computer Science