An effective clustering scheme for high-dimensional data

He, Xuansen; He, Fan; Fan, Yueping; Jiang, Lingmin; Liu, Runzong; Maalla, Allam

doi:10.1007/s11042-023-17129-4

An effective clustering scheme for high-dimensional data

Published: 19 October 2023

Volume 83, pages 45001–45045, (2024)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Xuansen He ORCID: orcid.org/0000-0002-1541-7832^1,2,
Fan He³,
Yueping Fan¹,
Lingmin Jiang¹,
Runzong Liu¹ &
…
Allam Maalla⁴

164 Accesses
1 Citation
Explore all metrics

Abstract

While the classical K-means algorithm has been widely used in many fields, it still has some defects. Therefore, this paper proposes a scheme to improve the clustering quality of K-means algorithm. The farthest initial center selection and the min–max rule are used to improve the random initialization of K-means algorithm, which can avoid the empty clusters in the clustering results. For high-dimensional data sets, standardized feature scaling makes the data subject to normal distribution, and supervised linear discriminant analysis (LDA) is used to effectively reduce the data dimension and facilitate visualization. The empirical rule is used to estimate the range of the number of clusters. Within this range, the number of clusters of data is visually estimated by searching the elbow of the sum-of-squared-errors (SSE) curve. Further, a novel clustering validity function f(K) is proposed to determine the optimal number of clusters for complex real-world data sets. Through silhouette analysis, the clustering quality can be intuitively evaluated by calculating the silhouette coefficient of cluster and observing its size. The simulation results of different types of data sets show that this scheme can not only improve the clustering quality of K-means algorithm, but also provide a visual cluster analysis method for high-dimensional data sets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Comprehensive Survey of Clustering Algorithms

Article 01 June 2015

Density-Based Clustering Based on Hierarchical Density Estimates

Feature dimensionality reduction: a review

Article Open access 21 January 2022

Data Availability

The data used in this article are all from publicly available datasets in the UCI Machine Learning Repository, http://archive.ics.uci.edu/.

References

Abdalameer AK, Alswaitti M, Alsudani AA, Isa NAM (2022) A new validity clustering index-based on finding new centroid positions using the mean of clustered data to determine the optimum number of clusters. Expert Syst Appl 191(2022):116329. https://doi.org/10.1016/j.eswa.2021.116329
Article Google Scholar
Ahmad A, Khan SS (2021) initKmix-A novel initial partition generation algorithm for clustering mixed data using k-means-based clustering. Expert Syst Appl 167:114149. https://doi.org/10.1016/j.eswa.2020.114149
Article Google Scholar
Akinobu T, Takayuki S, Hiroshi Y (2007) Asymmetric agglomerative hierarchical clustering algorithms and their Evaluations. J Classif 24:123–143. https://doi.org/10.1007/s00357-007-0002-1
Article MathSciNet Google Scholar
Alminagorta O, Loewen CJG, Kerckhove DT, Jackson DA, Chu C (2021) Exploratory analysis of multivariate data: Applications of parallel coordinates in ecology. Eco Inform 2021:101361. https://doi.org/10.1016/j.ecoinf.2021.101361
Article Google Scholar
Awana U, Shamimb S, Khanc Z, Ul ZN, Shariqe SM, Khanb MN (2021) Big data analytics capability and decision-making: The role of data-driven insight on circular economy performance. Technol Forecast Soc Chang 168(2021):120766. https://doi.org/10.1016/j.techfore.2021.120766
Article Google Scholar
Bandyopadhyay S, Maulik U (2001) Nonparametric genetic clustering: Comparison of validity indices. IEEE Trans Syst Man Cybern--Part C: Appl Rev 31(1): 120–125. https://doi.org/10.1109/5326.923275
Batool F (2021) Hennig C (2021) Clustering with the average silhouette width. Comput Stat Data Anal 158:107190. https://doi.org/10.1016/j.csda.2021.107190
Article Google Scholar
Benrazek A E, Kouahla Z, Farou B, Ferrag M A, Seridi H, Kurulay M (2020) An efficient indexing for Internet of Things massive data based on cloud-fog computing. Transactions on Emerg Telecommun Technol 2020: 1–21. https://doi.org/10.1002/ETT-19-0392.R1
Biswas TK et al (2023) ECKM: An improved K-means clustering based on computational geometry. Expert Syst Appl 212:118862. https://doi.org/10.1016/j.eswa.2022.118862
Article Google Scholar
Cao F, Liang J, Jiang G (2009) An initialization method for the K-Means algorithm using neighborhood model. Comput Math Appl 58(2009):474–483. https://doi.org/10.1016/j.camwa.2009.04.017
Article MathSciNet Google Scholar
Chen M, Mao S, Liu Y (2014) Big data: A survey. Mobile Networks Appl 19(2014):171–209. https://doi.org/10.1007/s11036-013-0489-0.
Creighton JHC (1994) A first course in probability models and statistical inference. Springer, 1994, New York
Czarnowski I, Jedrzejowicz P (2021) Supervised classification problems–taxonomy of dimensions and notation for problems identification. IEEE Access 2021:151386–151400. https://doi.org/10.1109/ACCESS.2021.3125622
Article Google Scholar
Erilli NA, Yolcu U, Egrioglu E, Aladag CH, Oner Y (2011) Determining the most proper number of cluster in fuzzy clustering by using artificial. Expert Syst Appl 38(2011):2248–2252. https://doi.org/10.1016/j.eswa.2010.08.012
Article Google Scholar
Erisoglu M, Calis N, Sakallioglu S (2011) A new algorithm for initial cluster centers in k-means algorithm. Pattern Recogn Lett 32(2011):1701–1705. https://doi.org/10.1016/j.patrec.2011.07.011
Article Google Scholar
Gao K, Liu B, Yu X, Yu A (2022) Unsupervised meta learning with multiview constraints for hyperspectral image small sample set classification. IEEE Trans Image Process 31:3449–3462. https://doi.org/10.1109/TIP.2022.3169689
Article Google Scholar
Huang D, Wang C-D, Peng H, Lai J, Kwoh C-K (2021) Enhanced ensemble clustering via fast propagation of cluster-wise similaritie. IEEE Trans Syst Man Cybern: Syst 51(1):508–520. https://doi.org/10.1109/TSMC.2018.2876202
Article Google Scholar
Ikotun A.M, Ezugwu A.E, Abualigah L, Abuhaija B, Heming J (2023) K-means clustering algorithms: A comprehensive review, variants analysis, and advances in the era of big data. Inf Sci 622(2023):178–210. https://doi.org/10.1016/j.ins.2022.11.139
Karim A, Loqman C, Boumhidi J (2018) Determining the number of clusters using neural network and max stable set problem. Procedia Comput Sci 127(2018):16–25. https://doi.org/10.1016/j.procs.2018.01.093
Article Google Scholar
Karimzadeh S, Olafsson S (2019) Data clustering using proximity matrices with missing values. Expert Syst Appl 126(2019):265–276. https://doi.org/10.1016/j.eswa.2019.02.022
Article Google Scholar
Kariyam A, Effendie AR (2023) A medoid-based deviation ratio index to determine the number of clusters in a dataset. MethodsX 10(2023):102084. https://doi.org/10.1016/j.mex.2023.102084
Article Google Scholar
Khan F (2012) An initial seed selection algorithm for k-means clustering of georeferenced data to improve replicability of cluster assignments for mapping application. Appl Soft Comput 12(2012):3698–3700. https://doi.org/10.1016/j.asoc.2012.07.021
Article Google Scholar
Khan SS, Ahmad A (2004) Cluster center initialization algorithm for K-means clustering. Pattern Recogn Lett 25(2004):1293–1302. https://doi.org/10.1016/j.patrec.2004.04.007
Article Google Scholar
Kumar KM, Reddy ARM (2017) An efficient k-means clustering filtering algorithm using density based initial cluster centers. Inf Sci 418–419(2017):286–301. https://doi.org/10.1016/j.ins.2017.07.036
Article MathSciNet Google Scholar
Kwak N (2008) Principal component analysis based on L₁-norm maximization. IEEE Trans Pattern Anal Mach Intell 30(9):1672–1680. https://doi.org/10.1109/TPAMI.2008.114
Article Google Scholar
Lespinats S, Verleysen M, Giron A, Fertil B (2007) DD-HDS: A method for visualization and exploration of high-dimensional data. IEEE Trans Neural Networks 18(5):1265–1279. https://doi.org/10.1109/TNN.2007.891682
Article Google Scholar
Li P, Zhang W, Lu C, Zhang R, Li X (2022) Robust kernel principal component analysis with optimal mean. Neural Netw 152(2022):347–352. https://doi.org/10.1016/j.neunet.2022.05.005
Article Google Scholar
Li S, Zhang H, Ma R, Zhou J, Wen J (2022) Zhang B (2022) Linear discriminant analysis with generalized kernel constraint for robust image classification. Pattern Recogn 136:109196. https://doi.org/10.1016/j.patcog.2022.109196
Article Google Scholar
Lippiello E, Baccari S, Bountzis P (2023) Determining the number of clusters, before finding clusters, from the susceptibility of the similarity matrix. Physica A 616:128592. https://doi.org/10.1016/j.physa.2023.128592
Article MathSciNet Google Scholar
Lu JF, Tang JB, Tang ZM, Yang JY (2008) Hierarchical initialization approach for K-Means clustering. Pattern Recogn Lett 29(2008):787–795. https://doi.org/10.1016/j.patrec.2007.12.009
Article Google Scholar
Marek S, Magdalena W (2017) Constrained clustering with a complex cluster structure. Adv Data Anal Classif 11:493–518. https://doi.org/10.1007/s11634-016-0254-x
Article MathSciNet Google Scholar
Maurice R (2018) A Comparative study of divisive and agglomerative hierarchical clustering algorithms. J Classif 35:345–366. https://doi.org/10.1007/s00357-018-9259-9
Article MathSciNet Google Scholar
Meng Z, Shi Z (2020) On rule acquisition methods for data classification in heterogeneous incomplete decision systems. Knowl-Based Syst 193:105472. https://doi.org/10.1016/j.knosys.2020.105472
Article Google Scholar
Mo D, Huang SH (2012) Fractal-based intrinsic dimension estimation and its application in dimensionality reduction. IEEE Trans Knowl Data Eng 24(1):59–71. https://doi.org/10.1109/TKDE.2010.225
Article Google Scholar
Nock R, Nielsen F (2006) On weighting clustering. IEEE Trans Pattern Anal Mach Intell 28(8):1223–1235. https://doi.org/10.1109/tpami.2006.168
Article Google Scholar
Park H-S, Jun C-H (2009) A simple and fast algorithm for K-medoids clustering. Expert Syst Appl 36(2009):3336–3341. https://doi.org/10.1016/j.eswa.2008.01.039
Article Google Scholar
Pena JM, Lozano JA, Larranaga P (1999) An empirical comparison of four initialization methods for the K-means algorithm. Pattern Recogn Lett 20(1999):1027–1040. https://doi.org/10.1016/s0167-8655(99)00069-0
Article Google Scholar
Qiao K, Zhang J, Chen J (2023) Two effective heuristic methods of determining the numbers of fuzzy clustering centers based on bilevel programming. Appl Soft Comput 132:109718. https://doi.org/10.1016/j.asoc.2022.109718
Article Google Scholar
Redmond SJ, Heneghan C (2007) A method for initialising the K-means clustering algorithm using kd-trees. Pattern Recogn Lett 28(2007):965–973. https://doi.org/10.1016/j.patrec.2007.01.001
Article Google Scholar
Rousseeuw PJ (1987) Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math 20(1987):53–65. https://doi.org/10.1016/0377-0427(87)90125-7
Article Google Scholar
Selim SZ, Ismail MA (1984) K-means-type algorithms: A generalized convergence theorem and characterization of local optimality. IEEE Trans Pattern Anal Machine Intell PAMI-6(1):81–87. https://doi.org/10.1109/TPAMI.1984.4767478
Douglas S (2006) K-means clustering: A half-century synthesis. British Journal of Mathematical and Statistical Psychology 59: 1-34. https://doi.org/10.1348/000711005X48266
Viloria A, Lezama OBP (2019) Improvements for determining the number of clusters in k-means for innovation databases in SMEs. Procedia Computer Science 151(2019):1201–1206. https://doi.org/10.1016/j.procs.2019.04.172
Article Google Scholar
Wang Z, Hu H, Wang R, Zhang Q, Nie F, Li X (2022) Capped l_p-norm linear discriminant analysis for robust projections learning. Neurocomputing 511(2022):399–409. https://doi.org/10.1016/j.neucom.2022.09.006
Article Google Scholar
Xiao Q, Li C, Tang Y, Chen X (2021) Energy efficiency modeling for configuration-dependent machining via machine learning: A comparative study. IEEE Trans Autom Sci Eng 18(2):717–730. https://doi.org/10.1109/TASE.2019.2961714
Article Google Scholar
Zanaty EA (2012) Determining the number of clusters for kernelized fuzzy C-means algorithms for automatic medical image segmentation. Egyptian Inf J 13:39–58. https://doi.org/10.1016/j.eij.2012.01.004
Article Google Scholar
Zhang X, Liu C, Wuen CY (2020) Towards robust pattern recognition: A review. Proc IEEE 108(6):894–922. https://doi.org/10.1109/JPROC.2020.2989782
Article Google Scholar
Zhang Y, Mandziuk J, Quek CH, Goh BW (2017) Curvature-based method for determining the number of clusters. Inf Sci 415–416(2017):414–428. https://doi.org/10.1016/j.ins.2017.05.024
Article Google Scholar
Zhu E, Ma R (2018) An effective partitional clustering algorithm based on new clustering validity index. Appl Soft Comput 71(2018):608–621. https://doi.org/10.1016/j.asoc.2018.07.026
Article Google Scholar
Zhu R, Dong M, Xue J-H (2019) Learning distance to subspace for the nearest subspace methods in high-dimensional data classification. Inf Sci 481(2019):69–80. https://doi.org/10.1016/j.ins.2018.12.061
Article MathSciNet Google Scholar
Turet JG, Costa APCS (2022) Hybrid methodology for analysis of structured and unstructured data to support decision-making in public security [J]. Data Knowl Eng 141(2022):102056. https://doi.org/10.1016/j.datak.2022.102056
Feng M, Zheng J, Ren J, Hussain A, Li X, Xi Y, Liu Q (2019) Big data analytics and mining for effective visualization and trends forecasting of crime data [J]. IEEE Access 2019:106111–106123. https://doi.org/10.1109/ACCESS.2019.2930410

Download references

Acknowledgements

This work was supported in part by the National Natural Science Foundation of China (No. 71972013), in part by the Special Projects in Key Fields of Ordinary Colleges and Universities in Guangdong Province (New Generation Information Technology) (No. 2021ZDZX1035)

Author information

Authors and Affiliations

School of Information Technology and Engineering, Guangzhou College of Commerce, Guangzhou, China
Xuansen He, Yueping Fan, Lingmin Jiang & Runzong Liu
College of Information Science and Engineering, Hunan University, Changsha, China
Xuansen He
School of Management and Economics, Beijing Institute of Technology, Beijing, China
Fan He
School of Engineering, Guangzhou College of Technology and Business, Guangzhou, China
Allam Maalla

Authors

Xuansen He
View author publications
You can also search for this author in PubMed Google Scholar
Fan He
View author publications
You can also search for this author in PubMed Google Scholar
Yueping Fan
View author publications
You can also search for this author in PubMed Google Scholar
Lingmin Jiang
View author publications
You can also search for this author in PubMed Google Scholar
Runzong Liu
View author publications
You can also search for this author in PubMed Google Scholar
Allam Maalla
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Xuansen He designed and performed the research. Fan He participated in part of the research and put forward meaningful suggestions. Xuansen He wrote the manuscript. Yueping Fan, Lingmin Jiang, and Runzong Liu provided the collation of some research contents. Allam Maalla proposed some revisions to the manuscript.

Corresponding author

Correspondence to Xuansen He.

Ethics declarations

Conflict of interest

The authors have no competing interests to declare that are relevant to the content of this article.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

He, X., He, F., Fan, Y. et al. An effective clustering scheme for high-dimensional data. Multimed Tools Appl 83, 45001–45045 (2024). https://doi.org/10.1007/s11042-023-17129-4

Download citation

Received: 21 February 2023
Revised: 09 July 2023
Accepted: 15 September 2023
Published: 19 October 2023
Issue Date: May 2024
DOI: https://doi.org/10.1007/s11042-023-17129-4

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

An effective clustering scheme for high-dimensional data

Abstract

Access this article

Similar content being viewed by others

A Comprehensive Survey of Clustering Algorithms

Density-Based Clustering Based on Hierarchical Density Estimates

Feature dimensionality reduction: a review

Data Availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

An effective clustering scheme for high-dimensional data

Abstract

Access this article

Similar content being viewed by others

A Comprehensive Survey of Clustering Algorithms

Density-Based Clustering Based on Hierarchical Density Estimates

Feature dimensionality reduction: a review

Data Availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation