FC-Kmeans: Fixed-centered K-means algorithm
Introduction
Demands in the service and logistics sector have been increasing rapidly in recent years, especially due to pandemic conditions. The location of warehouses and the boundaries of demand regions or clusters have become very important with the increasing demands. In this sense, it is crucial to analyze big data whose source is demand and GPS data of vehicles. The demand clusters can be determined by analyzing big data and clustering the demand points. Cluster centers (warehouses) can be determined for each demand cluster. Thus, service can be provided from cluster centers to demand points at a low cost. However, the warehouses referred to as cluster centers may already exist in some cases, and it may be desired to determine additional cluster centers(warehouses) and demand clusters according to increasing demands. In such cases, it is not suitable for real-world circumstances to define new cluster centers instead of existing centers. Because both canceling the existing center and establishing a new center will have extra costs. Therefore, there is a need for an algorithm that will cluster the demand points by considering the existing centers and determine the location of the additional cluster centers. Thus, the locations of all clusters and additional cluster centers will be determined by clustering demand points without canceling the existing centers. Clustering is an unsupervised machine learning method. The purpose of clustering is to assign similar points to the same cluster under specific criteria and assign dissimilar points to other clusters. There are many clustering algorithms, and the K-means algorithm is one of the most efficient but straightforward algorithms (Han, Kamber, & Pei, 2012). However, the K-means algorithm is susceptible to ‘k’ randomly selected initial cluster centers, and these randomly selected centers significantly affect the clustering quality and cost in real-life (Celebi et al., 2013, Rahman and Islam, 2014). Changing the initial centers leads to different clusters and cluster centers. If parking lots or warehouses are considered as cluster centers, their location will continually change with clusters according to the initial cluster centers. Considering the cost of purchasing/renting parking lots or warehouses in cities, this unstable situation will lead to very variable costs.
The clustering algorithms which are used to generate demand clusters and thus identify warehouse/parking lots do not take into account the already existing places (cluster centers). For example, a fleet management company having three parking lot (fixed centers) where its vehicles are held to meet customer demands may plan to lease or purchase a fourth additional parking area (non-fixed center) due to increased demands. In order to determine which center (parking lot) will meet each demand for a total of four parking lots, the location of the new cluster center should be determined in addition to the existing cluster centers and each demand should be assigned to these cluster centers. Thus, the company will be able to serve demand clusters at a low cost with the fourth parking lot determined and without changing the location of the three existing parking lots. But the clustering algorithms in the literature cannot cluster demand points into clusters by fixing some of the existing cluster centers. Therefore, there is a need for a clustering algorithm that fixes existing cluster centers and determines the location of the additional center(s) while assigning the demand points to the clusters. As can be seen from the example, clustering the demand points without changing the existing parking lots is also a great necessity. Especially if the cities are mentioned as the metropole, the rental or purchase cost varies more between parking lot locations in the city. The developed algorithm will be useful for many organizations which have existing centers when they make new investments. In this study, the K-means algorithm is modified and two new algorithms that are FC-Kmeans and FC-Kmeans 2 are developed to solve the problems described before. In the proposed algorithms, some cluster centers are fixed and prevented from updating them, as opposed to iteratively updating all cluster centers in the K-means algorithm. To the best of our knowledge, there is no such study in the literature. The experimental study is carried out to determine at what level of the fixed center ratio of the proposed algorithms achieves similar performance with the K-means algorithm. Experimental results show that the clustering performance of FC-Kmeans algorithm reaches K-means performance with a fixed center ratio of around 40 %, although the FC-Kmeans algorithm runs under more restricted conditions.
In Section 2, the literature review is presented. Section 3 explains the K-means and proposed algorithms FC-Kmeans and FC-Kmeans 2. Section 4 presents an extensive computational study. In Section 5, an analysis of the clustering performances of the proposed algorithms is presented. Section 6 contains comments on the conclusions and future work.
Section snippets
Related work
The K-means algorithm and its various extensions are popular and commonly used in several clustering studies presented in Table 1. The K-means algorithm was proposed by MacQueen (1967). The algorithm is an unsupervised machine learning method used in data mining and pattern recognition. The K-means algorithm has the advantages of brevity, efficiency, and speed. Minimizing the sum of squared errors, which is the objective function, is the basis of this algorithm. The algorithm tries to find ‘k’
Fixed-Centered K-means algorithms
The K-means algorithm randomly determines the initial cluster centers. The pseudo-code of the algorithm can be described as Algorithm 1 (Kliemann & Sanders, 2016).Algorithm 1. K-means algorithm Input: max_iter: number of allowed maximum iterations 1: Assign initial centers randomly 2: repeat 3: 3: for each do 4: Let 5: 6: for do 7: if then 8: until the cluster centers
Experimental setup
A real-world Istanbul and publicly available Uber (Dataworld, 2014) datasets consisting of latitude and longitude features of 69,902 and 60,000 geographical points respectively are used in the experiments. Fig. 4 represents parking lot (red) points as fixed centers and demand points (blue) of Istanbul data. The latitude and longitude values given as points in the dataset represent the demand locations where a company requested a vehicle for its customers. At the same time, the fixed centers
Results & discussion
All algorithms are implemented in Python and performed on the Intel Core i7-7700 3.6 GHz, 8 GB Ram personal computer. The time complexity of the K-means and K-means++ algorithm is O(n) (Yang et al., 2019). Phase-I and Phase-II of the FC-Kmeans algorithm have the same operations as K-means. Therefore, the time complexity for each phase is O(n) and the time complexity of the FC-Kmeans algorithm is O(n). Also, the time complexity of FC-Kmeans 2 Phase-I and Phase-II are O(k-m) and O(n)
Conclusion
In this study, a problem that may arise in real-life, especially in clustering studies on spatial data, has been identified and two algorithms, FC-Kmeans and FC-Kmeans 2 have been developed to determine the location of additional cluster centers in cases where some cluster centers are fixed. These two algorithms based on K-means, have different approaches for determining the initial cluster centers and allow some of the cluster centers to be fixed. The purpose of this study is not to improve
CRediT authorship contribution statement
Merhad Ay: Conceptualization, Methodology, Software, Validation, Formal analysis, Resources, Data curation, Writing – original draft, Writing – review & editing, Visualization. Lale Özbakır: Conceptualization, Methodology, Formal analysis, Writing – review & editing, Supervision, Project administration. Sinem Kulluk: Conceptualization, Methodology, Data curation, Supervision. Burak Gülmez: Software, Validation, Investigation, Resources, Data curation. Güney Öztürk: Methodology, Resources, Data
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgments
This work was supported by the Erciyes University Technology Transfer Office (068552-156). We are also thankful to the Turkcell company for their data which helped us in this research.
References (31)
- et al.
Cluster validation techniques for genome expression data
Signal Processing
(2003) - et al.
Supervised kernel density estimation K-means
Expert Systems with Applications
(2021) - et al.
A comparative study of efficient initialization methods for the k-means clustering algorithm
Expert Systems with Applications
(2013) - et al.
Robust deep k-means: An effective and simple method for data clustering
Pattern Recognition
(2021) - et al.
DSKmeans: A new kmeans-type approach to discriminative subspace clustering
Knowledge-Based Systems
(2014) - et al.
Combining K-Means and a genetic algorithm through a novel arrangement of genetic operators for high quality clustering
Expert Systems with Applications
(2018) - et al.
Stores clustering using a data mining approach for distributing automotive spare-parts to reduce transportation costs
Expert Systems with Applications
(2012) - et al.
A clustering method based on K-Means algorithm
Physics Procedia
(2012) - et al.
An efficient hybrid algorithm based on modified imperialist competitive algorithm and K-means for data clustering
Engineering Applications of Artificial Intelligence
(2011) - et al.
A hybrid clustering technique combining a novel genetic algorithm with K-Means
Knowledge-Based Systems
(2014)
Silhouettes: A graphical aid to the interpretation and validation of cluster analysis
Journal of Computational and Applied Mathematics
Clustering analysis using an adaptive fused distance
Engineering Applications of Artificial Intelligence
A new geometric shape-based genetic clustering algorithm for the multi-depot vehicle routing problem
Expert Systems with Applications
An automatic K-Means clustering algorithm of GPS data combining a novel niche genetic algorithm with noise and density
ISPRS International Journal of Geo-Information
K-means++: The advantages of careful seeding
Cited by (0)
- 1
ORCID: 0000-0002-8103-7715.