FC-Kmeans: Fixed-centered K-means algorithm

https://doi.org/10.1016/j.eswa.2022.118656Get rights and content

Highlights

  • Two algorithms that determine different initial cluster centers are proposed.

  • It is aimed to perform clustering by considering existing fixed cluster centers.

  • The proposed algorithms are applied to real-life spatial data having fixed centers.

  • FC-Kmeans converged to K-means performance at fixed center ratio of 43% and below.

Abstract

Clustering is one of the data mining methods that partition large-sized data into subgroups according to their similarities. K-means clustering algorithm works well in spherical or convex data distribution of large-sized data sets. Most of the algorithms based on K-means have generally been interested in an initial cluster centers selection or cluster distribution. However, these algorithms may not meet satisfy some requirements in practice. This paper presents the FC-Kmeans algorithm, which enables clustering by fixing some cluster centers considering real conditions. Thus, while some of the cluster centers are fixed, it is tried to obtain the most appropriate cluster centers for the others and the best distribution of the data to the clusters. The K-means clustering algorithm is compared with two different fixed-centered clustering algorithms which are FC-Kmeans and FC-Kmeans 2. The experimental results show that although the FC-Kmeans algorithm has more limitations than K-means, it converges the performance of K-means algorithm according to some performance indicators such as SSE, DB Index and Silhouette Index.

Introduction

Demands in the service and logistics sector have been increasing rapidly in recent years, especially due to pandemic conditions. The location of warehouses and the boundaries of demand regions or clusters have become very important with the increasing demands. In this sense, it is crucial to analyze big data whose source is demand and GPS data of vehicles. The demand clusters can be determined by analyzing big data and clustering the demand points. Cluster centers (warehouses) can be determined for each demand cluster. Thus, service can be provided from cluster centers to demand points at a low cost. However, the warehouses referred to as cluster centers may already exist in some cases, and it may be desired to determine additional cluster centers(warehouses) and demand clusters according to increasing demands. In such cases, it is not suitable for real-world circumstances to define new cluster centers instead of existing centers. Because both canceling the existing center and establishing a new center will have extra costs. Therefore, there is a need for an algorithm that will cluster the demand points by considering the existing centers and determine the location of the additional cluster centers. Thus, the locations of all clusters and additional cluster centers will be determined by clustering demand points without canceling the existing centers. Clustering is an unsupervised machine learning method. The purpose of clustering is to assign similar points to the same cluster under specific criteria and assign dissimilar points to other clusters. There are many clustering algorithms, and the K-means algorithm is one of the most efficient but straightforward algorithms (Han, Kamber, & Pei, 2012). However, the K-means algorithm is susceptible to ‘k’ randomly selected initial cluster centers, and these randomly selected centers significantly affect the clustering quality and cost in real-life (Celebi et al., 2013, Rahman and Islam, 2014). Changing the initial centers leads to different clusters and cluster centers. If parking lots or warehouses are considered as cluster centers, their location will continually change with clusters according to the initial cluster centers. Considering the cost of purchasing/renting parking lots or warehouses in cities, this unstable situation will lead to very variable costs.

The clustering algorithms which are used to generate demand clusters and thus identify warehouse/parking lots do not take into account the already existing places (cluster centers). For example, a fleet management company having three parking lot (fixed centers) where its vehicles are held to meet customer demands may plan to lease or purchase a fourth additional parking area (non-fixed center) due to increased demands. In order to determine which center (parking lot) will meet each demand for a total of four parking lots, the location of the new cluster center should be determined in addition to the existing cluster centers and each demand should be assigned to these cluster centers. Thus, the company will be able to serve demand clusters at a low cost with the fourth parking lot determined and without changing the location of the three existing parking lots. But the clustering algorithms in the literature cannot cluster demand points into clusters by fixing some of the existing cluster centers. Therefore, there is a need for a clustering algorithm that fixes existing cluster centers and determines the location of the additional center(s) while assigning the demand points to the clusters. As can be seen from the example, clustering the demand points without changing the existing parking lots is also a great necessity. Especially if the cities are mentioned as the metropole, the rental or purchase cost varies more between parking lot locations in the city. The developed algorithm will be useful for many organizations which have existing centers when they make new investments. In this study, the K-means algorithm is modified and two new algorithms that are FC-Kmeans and FC-Kmeans 2 are developed to solve the problems described before. In the proposed algorithms, some cluster centers are fixed and prevented from updating them, as opposed to iteratively updating all cluster centers in the K-means algorithm. To the best of our knowledge, there is no such study in the literature. The experimental study is carried out to determine at what level of the fixed center ratio of the proposed algorithms achieves similar performance with the K-means algorithm. Experimental results show that the clustering performance of FC-Kmeans algorithm reaches K-means performance with a fixed center ratio of around 40 %, although the FC-Kmeans algorithm runs under more restricted conditions.

In Section 2, the literature review is presented. Section 3 explains the K-means and proposed algorithms FC-Kmeans and FC-Kmeans 2. Section 4 presents an extensive computational study. In Section 5, an analysis of the clustering performances of the proposed algorithms is presented. Section 6 contains comments on the conclusions and future work.

Section snippets

Related work

The K-means algorithm and its various extensions are popular and commonly used in several clustering studies presented in Table 1. The K-means algorithm was proposed by MacQueen (1967). The algorithm is an unsupervised machine learning method used in data mining and pattern recognition. The K-means algorithm has the advantages of brevity, efficiency, and speed. Minimizing the sum of squared errors, which is the objective function, is the basis of this algorithm. The algorithm tries to find ‘k

Fixed-Centered K-means algorithms

The K-means algorithm randomly determines the initial cluster centers. The pseudo-code of the algorithm can be described as Algorithm 1 (Kliemann & Sanders, 2016).

Algorithm 1. K-means algorithm
Input:
 p1,p2,p3,,pnsetofpointsP
 C1,C2,C3,,CksetofclustersC
 x,yp
 max_iter: number of allowed maximum iterations
1: Assign initial centers μ1,μ2,μ3,,μk randomly
2: repeat
3: C1,.,Ck
3: for each pP do
4:  Let i=argmini=1kp-μi2
5:  CiCi{p}
6: for i=1k do
7:  if Ci then μi=1CipCip
8: until the cluster centers μ1,

Experimental setup

A real-world Istanbul and publicly available Uber (Dataworld, 2014) datasets consisting of latitude and longitude features of 69,902 and 60,000 geographical points respectively are used in the experiments. Fig. 4 represents parking lot (red) points as fixed centers and demand points (blue) of Istanbul data. The latitude and longitude values given as points in the dataset represent the demand locations where a company requested a vehicle for its customers. At the same time, the fixed centers

Results & discussion

All algorithms are implemented in Python and performed on the Intel Core i7-7700 3.6 GHz, 8 GB Ram personal computer. The time complexity of the K-means and K-means++ algorithm is O(n) (Yang et al., 2019). Phase-I and Phase-II of the FC-Kmeans algorithm have the same operations as K-means. Therefore, the time complexity for each phase is O(n) and the time complexity of the FC-Kmeans algorithm is O(n). Also, the time complexity of FC-Kmeans 2 Phase-I and Phase-II are O(k-m) and O(n)

Conclusion

In this study, a problem that may arise in real-life, especially in clustering studies on spatial data, has been identified and two algorithms, FC-Kmeans and FC-Kmeans 2 have been developed to determine the location of additional cluster centers in cases where some cluster centers are fixed. These two algorithms based on K-means, have different approaches for determining the initial cluster centers and allow some of the cluster centers to be fixed. The purpose of this study is not to improve

CRediT authorship contribution statement

Merhad Ay: Conceptualization, Methodology, Software, Validation, Formal analysis, Resources, Data curation, Writing – original draft, Writing – review & editing, Visualization. Lale Özbakır: Conceptualization, Methodology, Formal analysis, Writing – review & editing, Supervision, Project administration. Sinem Kulluk: Conceptualization, Methodology, Data curation, Supervision. Burak Gülmez: Software, Validation, Investigation, Resources, Data curation. Güney Öztürk: Methodology, Resources, Data

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This work was supported by the Erciyes University Technology Transfer Office (068552-156). We are also thankful to the Turkcell company for their data which helped us in this research.

References (31)

  • P.J. Rousseeuw

    Silhouettes: A graphical aid to the interpretation and validation of cluster analysis

    Journal of Computational and Applied Mathematics

    (1987)
  • K.K. Sharma et al.

    Clustering analysis using an adaptive fused distance

    Engineering Applications of Artificial Intelligence

    (2020)
  • G.N. Yucenur et al.

    A new geometric shape-based genetic clustering algorithm for the multi-depot vehicle routing problem

    Expert Systems with Applications

    (2011)
  • X. Zhou et al.

    An automatic K-Means clustering algorithm of GPS data combining a novel niche genetic algorithm with noise and density

    ISPRS International Journal of Geo-Information

    (2017)
  • D. Arthur et al.

    K-means++: The advantages of careful seeding

  • Cited by (0)

    1

    ORCID: 0000-0002-8103-7715.

    View full text