Skip to main content
Log in

An effective clustering scheme for high-dimensional data

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

While the classical K-means algorithm has been widely used in many fields, it still has some defects. Therefore, this paper proposes a scheme to improve the clustering quality of K-means algorithm. The farthest initial center selection and the min–max rule are used to improve the random initialization of K-means algorithm, which can avoid the empty clusters in the clustering results. For high-dimensional data sets, standardized feature scaling makes the data subject to normal distribution, and supervised linear discriminant analysis (LDA) is used to effectively reduce the data dimension and facilitate visualization. The empirical rule is used to estimate the range of the number of clusters. Within this range, the number of clusters of data is visually estimated by searching the elbow of the sum-of-squared-errors (SSE) curve. Further, a novel clustering validity function f(K) is proposed to determine the optimal number of clusters for complex real-world data sets. Through silhouette analysis, the clustering quality can be intuitively evaluated by calculating the silhouette coefficient of cluster and observing its size. The simulation results of different types of data sets show that this scheme can not only improve the clustering quality of K-means algorithm, but also provide a visual cluster analysis method for high-dimensional data sets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Algorithm 1
Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20
Fig. 21
Fig. 22
Fig. 23
Fig. 24
Fig. 25
Fig. 26
Fig. 27
Fig. 28
Fig. 29

Similar content being viewed by others

Data Availability

The data used in this article are all from publicly available datasets in the UCI Machine Learning Repository, http://archive.ics.uci.edu/.

References

  1. Abdalameer AK, Alswaitti M, Alsudani AA, Isa NAM (2022) A new validity clustering index-based on finding new centroid positions using the mean of clustered data to determine the optimum number of clusters. Expert Syst Appl 191(2022):116329. https://doi.org/10.1016/j.eswa.2021.116329

    Article  Google Scholar 

  2. Ahmad A, Khan SS (2021) initKmix-A novel initial partition generation algorithm for clustering mixed data using k-means-based clustering. Expert Syst Appl 167:114149. https://doi.org/10.1016/j.eswa.2020.114149

    Article  Google Scholar 

  3. Akinobu T, Takayuki S, Hiroshi Y (2007) Asymmetric agglomerative hierarchical clustering algorithms and their Evaluations. J Classif 24:123–143. https://doi.org/10.1007/s00357-007-0002-1

    Article  MathSciNet  Google Scholar 

  4. Alminagorta O, Loewen CJG, Kerckhove DT, Jackson DA, Chu C (2021) Exploratory analysis of multivariate data: Applications of parallel coordinates in ecology. Eco Inform 2021:101361. https://doi.org/10.1016/j.ecoinf.2021.101361

    Article  Google Scholar 

  5. Awana U, Shamimb S, Khanc Z, Ul ZN, Shariqe SM, Khanb MN (2021) Big data analytics capability and decision-making: The role of data-driven insight on circular economy performance. Technol Forecast Soc Chang 168(2021):120766. https://doi.org/10.1016/j.techfore.2021.120766

    Article  Google Scholar 

  6. Bandyopadhyay S, Maulik U (2001) Nonparametric genetic clustering: Comparison of validity indices. IEEE Trans Syst Man Cybern--Part C: Appl Rev 31(1): 120–125. https://doi.org/10.1109/5326.923275

  7. Batool F (2021) Hennig C (2021) Clustering with the average silhouette width. Comput Stat Data Anal 158:107190. https://doi.org/10.1016/j.csda.2021.107190

    Article  Google Scholar 

  8. Benrazek A E, Kouahla Z, Farou B, Ferrag M A, Seridi H, Kurulay M (2020) An efficient indexing for Internet of Things massive data based on cloud-fog computing. Transactions on Emerg Telecommun Technol 2020: 1–21. https://doi.org/10.1002/ETT-19-0392.R1

  9. Biswas TK et al (2023) ECKM: An improved K-means clustering based on computational geometry. Expert Syst Appl 212:118862. https://doi.org/10.1016/j.eswa.2022.118862

    Article  Google Scholar 

  10. Cao F, Liang J, Jiang G (2009) An initialization method for the K-Means algorithm using neighborhood model. Comput Math Appl 58(2009):474–483. https://doi.org/10.1016/j.camwa.2009.04.017

    Article  MathSciNet  Google Scholar 

  11. Chen M, Mao S, Liu Y (2014) Big data: A survey. Mobile Networks Appl 19(2014):171–209. https://doi.org/10.1007/s11036-013-0489-0.

  12. Creighton JHC (1994) A first course in probability models and statistical inference. Springer, 1994, New York

  13. Czarnowski I, Jedrzejowicz P (2021) Supervised classification problems–taxonomy of dimensions and notation for problems identification. IEEE Access 2021:151386–151400. https://doi.org/10.1109/ACCESS.2021.3125622

    Article  Google Scholar 

  14. Erilli NA, Yolcu U, Egrioglu E, Aladag CH, Oner Y (2011) Determining the most proper number of cluster in fuzzy clustering by using artificial. Expert Syst Appl 38(2011):2248–2252. https://doi.org/10.1016/j.eswa.2010.08.012

    Article  Google Scholar 

  15. Erisoglu M, Calis N, Sakallioglu S (2011) A new algorithm for initial cluster centers in k-means algorithm. Pattern Recogn Lett 32(2011):1701–1705. https://doi.org/10.1016/j.patrec.2011.07.011

    Article  Google Scholar 

  16. Gao K, Liu B, Yu X, Yu A (2022) Unsupervised meta learning with multiview constraints for hyperspectral image small sample set classification. IEEE Trans Image Process 31:3449–3462. https://doi.org/10.1109/TIP.2022.3169689

    Article  Google Scholar 

  17. Huang D, Wang C-D, Peng H, Lai J, Kwoh C-K (2021) Enhanced ensemble clustering via fast propagation of cluster-wise similaritie. IEEE Trans Syst Man Cybern: Syst 51(1):508–520. https://doi.org/10.1109/TSMC.2018.2876202

    Article  Google Scholar 

  18. Ikotun A.M, Ezugwu A.E, Abualigah L, Abuhaija B, Heming J (2023) K-means clustering algorithms: A comprehensive review, variants analysis, and advances in the era of big data. Inf Sci 622(2023):178–210. https://doi.org/10.1016/j.ins.2022.11.139

  19. Karim A, Loqman C, Boumhidi J (2018) Determining the number of clusters using neural network and max stable set problem. Procedia Comput Sci 127(2018):16–25. https://doi.org/10.1016/j.procs.2018.01.093

    Article  Google Scholar 

  20. Karimzadeh S, Olafsson S (2019) Data clustering using proximity matrices with missing values. Expert Syst Appl 126(2019):265–276. https://doi.org/10.1016/j.eswa.2019.02.022

    Article  Google Scholar 

  21. Kariyam A, Effendie AR (2023) A medoid-based deviation ratio index to determine the number of clusters in a dataset. MethodsX 10(2023):102084. https://doi.org/10.1016/j.mex.2023.102084

    Article  Google Scholar 

  22. Khan F (2012) An initial seed selection algorithm for k-means clustering of georeferenced data to improve replicability of cluster assignments for mapping application. Appl Soft Comput 12(2012):3698–3700. https://doi.org/10.1016/j.asoc.2012.07.021

    Article  Google Scholar 

  23. Khan SS, Ahmad A (2004) Cluster center initialization algorithm for K-means clustering. Pattern Recogn Lett 25(2004):1293–1302. https://doi.org/10.1016/j.patrec.2004.04.007

    Article  Google Scholar 

  24. Kumar KM, Reddy ARM (2017) An efficient k-means clustering filtering algorithm using density based initial cluster centers. Inf Sci 418–419(2017):286–301. https://doi.org/10.1016/j.ins.2017.07.036

    Article  MathSciNet  Google Scholar 

  25. Kwak N (2008) Principal component analysis based on L1-norm maximization. IEEE Trans Pattern Anal Mach Intell 30(9):1672–1680. https://doi.org/10.1109/TPAMI.2008.114

    Article  Google Scholar 

  26. Lespinats S, Verleysen M, Giron A, Fertil B (2007) DD-HDS: A method for visualization and exploration of high-dimensional data. IEEE Trans Neural Networks 18(5):1265–1279. https://doi.org/10.1109/TNN.2007.891682

    Article  Google Scholar 

  27. Li P, Zhang W, Lu C, Zhang R, Li X (2022) Robust kernel principal component analysis with optimal mean. Neural Netw 152(2022):347–352. https://doi.org/10.1016/j.neunet.2022.05.005

    Article  Google Scholar 

  28. Li S, Zhang H, Ma R, Zhou J, Wen J (2022) Zhang B (2022) Linear discriminant analysis with generalized kernel constraint for robust image classification. Pattern Recogn 136:109196. https://doi.org/10.1016/j.patcog.2022.109196

    Article  Google Scholar 

  29. Lippiello E, Baccari S, Bountzis P (2023) Determining the number of clusters, before finding clusters, from the susceptibility of the similarity matrix. Physica A 616:128592. https://doi.org/10.1016/j.physa.2023.128592

    Article  MathSciNet  Google Scholar 

  30. Lu JF, Tang JB, Tang ZM, Yang JY (2008) Hierarchical initialization approach for K-Means clustering. Pattern Recogn Lett 29(2008):787–795. https://doi.org/10.1016/j.patrec.2007.12.009

    Article  Google Scholar 

  31. Marek S, Magdalena W (2017) Constrained clustering with a complex cluster structure. Adv Data Anal Classif 11:493–518. https://doi.org/10.1007/s11634-016-0254-x

    Article  MathSciNet  Google Scholar 

  32. Maurice R (2018) A Comparative study of divisive and agglomerative hierarchical clustering algorithms. J Classif 35:345–366. https://doi.org/10.1007/s00357-018-9259-9

    Article  MathSciNet  Google Scholar 

  33. Meng Z, Shi Z (2020) On rule acquisition methods for data classification in heterogeneous incomplete decision systems. Knowl-Based Syst 193:105472. https://doi.org/10.1016/j.knosys.2020.105472

    Article  Google Scholar 

  34. Mo D, Huang SH (2012) Fractal-based intrinsic dimension estimation and its application in dimensionality reduction. IEEE Trans Knowl Data Eng 24(1):59–71. https://doi.org/10.1109/TKDE.2010.225

    Article  Google Scholar 

  35. Nock R, Nielsen F (2006) On weighting clustering. IEEE Trans Pattern Anal Mach Intell 28(8):1223–1235. https://doi.org/10.1109/tpami.2006.168

    Article  Google Scholar 

  36. Park H-S, Jun C-H (2009) A simple and fast algorithm for K-medoids clustering. Expert Syst Appl 36(2009):3336–3341. https://doi.org/10.1016/j.eswa.2008.01.039

    Article  Google Scholar 

  37. Pena JM, Lozano JA, Larranaga P (1999) An empirical comparison of four initialization methods for the K-means algorithm. Pattern Recogn Lett 20(1999):1027–1040. https://doi.org/10.1016/s0167-8655(99)00069-0

    Article  Google Scholar 

  38. Qiao K, Zhang J, Chen J (2023) Two effective heuristic methods of determining the numbers of fuzzy clustering centers based on bilevel programming. Appl Soft Comput 132:109718. https://doi.org/10.1016/j.asoc.2022.109718

    Article  Google Scholar 

  39. Redmond SJ, Heneghan C (2007) A method for initialising the K-means clustering algorithm using kd-trees. Pattern Recogn Lett 28(2007):965–973. https://doi.org/10.1016/j.patrec.2007.01.001

    Article  Google Scholar 

  40. Rousseeuw PJ (1987) Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math 20(1987):53–65. https://doi.org/10.1016/0377-0427(87)90125-7

    Article  Google Scholar 

  41. Selim SZ, Ismail MA (1984) K-means-type algorithms: A generalized convergence theorem and characterization of local optimality. IEEE Trans Pattern Anal Machine Intell PAMI-6(1):81–87. https://doi.org/10.1109/TPAMI.1984.4767478

  42. Douglas S (2006) K-means clustering: A half-century synthesis. British Journal of Mathematical and Statistical Psychology 59: 1-34. https://doi.org/10.1348/000711005X48266

  43. Viloria A, Lezama OBP (2019) Improvements for determining the number of clusters in k-means for innovation databases in SMEs. Procedia Computer Science 151(2019):1201–1206. https://doi.org/10.1016/j.procs.2019.04.172

    Article  Google Scholar 

  44. Wang Z, Hu H, Wang R, Zhang Q, Nie F, Li X (2022) Capped lp-norm linear discriminant analysis for robust projections learning. Neurocomputing 511(2022):399–409. https://doi.org/10.1016/j.neucom.2022.09.006

    Article  Google Scholar 

  45. Xiao Q, Li C, Tang Y, Chen X (2021) Energy efficiency modeling for configuration-dependent machining via machine learning: A comparative study. IEEE Trans Autom Sci Eng 18(2):717–730. https://doi.org/10.1109/TASE.2019.2961714

    Article  Google Scholar 

  46. Zanaty EA (2012) Determining the number of clusters for kernelized fuzzy C-means algorithms for automatic medical image segmentation. Egyptian Inf J 13:39–58. https://doi.org/10.1016/j.eij.2012.01.004

    Article  Google Scholar 

  47. Zhang X, Liu C, Wuen CY (2020) Towards robust pattern recognition: A review. Proc IEEE 108(6):894–922. https://doi.org/10.1109/JPROC.2020.2989782

    Article  Google Scholar 

  48. Zhang Y, Mandziuk J, Quek CH, Goh BW (2017) Curvature-based method for determining the number of clusters. Inf Sci 415–416(2017):414–428. https://doi.org/10.1016/j.ins.2017.05.024

    Article  Google Scholar 

  49. Zhu E, Ma R (2018) An effective partitional clustering algorithm based on new clustering validity index. Appl Soft Comput 71(2018):608–621. https://doi.org/10.1016/j.asoc.2018.07.026

    Article  Google Scholar 

  50. Zhu R, Dong M, Xue J-H (2019) Learning distance to subspace for the nearest subspace methods in high-dimensional data classification. Inf Sci 481(2019):69–80. https://doi.org/10.1016/j.ins.2018.12.061

    Article  MathSciNet  Google Scholar 

  51. Turet JG, Costa APCS (2022) Hybrid methodology for analysis of structured and unstructured data to support decision-making in public security [J]. Data Knowl Eng 141(2022):102056. https://doi.org/10.1016/j.datak.2022.102056

  52. Feng M, Zheng J, Ren J, Hussain A, Li X, Xi Y, Liu Q (2019) Big data analytics and mining for effective visualization and trends forecasting of crime data [J]. IEEE Access 2019:106111–106123.  https://doi.org/10.1109/ACCESS.2019.2930410

Download references

Acknowledgements

This work was supported in part by the National Natural Science Foundation of China (No. 71972013), in part by the Special Projects in Key Fields of Ordinary Colleges and Universities in Guangdong Province (New Generation Information Technology) (No. 2021ZDZX1035)

Author information

Authors and Affiliations

Authors

Contributions

Xuansen He designed and performed the research. Fan He participated in part of the research and put forward meaningful suggestions. Xuansen He wrote the manuscript. Yueping Fan, Lingmin Jiang, and Runzong Liu provided the collation of some research contents. Allam Maalla proposed some revisions to the manuscript.

Corresponding author

Correspondence to Xuansen He.

Ethics declarations

Conflict of interest

The authors have no competing interests to declare that are relevant to the content of this article.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

He, X., He, F., Fan, Y. et al. An effective clustering scheme for high-dimensional data. Multimed Tools Appl 83, 45001–45045 (2024). https://doi.org/10.1007/s11042-023-17129-4

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-023-17129-4

Keywords

Navigation