Min–max kurtosis mean distance based k-means initial centroid initialization method for big genomic data clustering

Pandey, Kamlesh Kumar; Shukla, Diwakar

doi:10.1007/s12065-022-00720-3

Min–max kurtosis mean distance based k-means initial centroid initialization method for big genomic data clustering

Research Paper
Published: 18 April 2022

Volume 16, pages 1055–1076, (2023)
Cite this article

Evolutionary Intelligence Aims and scope Submit manuscript

Kamlesh Kumar Pandey¹ &
Diwakar Shukla¹

267 Accesses
3 Citations
Explore all metrics

Abstract

Genomic clustering is a big data application that uses the K-means (KM) clustering approach to discover hidden patterns and trends in genes for disease diagnosis, biological analysis, and tissue detection. The KM algorithm is highly dependent on the initial centroid because it determines the effectiveness, efficiency, computing resources, and local optima of the KM clustering. The existing initial centroid initialization approach traps local optima due to randomization and achieves high computational cost due to the enormous interrelated dimension. Therefore, the KM algorithm produces the lowest quality cluster and maximizes the computation time and resource consumption. To address this issue, this study has presented the Min–Max Kurtosis Mean Distance (MKMD) algorithm for big data clustering in a single machine environment. The MKMD algorithm enhances the effectiveness and efficiency of the KM algorithm by measuring the distance between data points of the minimum–maximum kurtosis dimension and their mean. The performance of the presented algorithm has been compared against KM, KM + + , ADV, MKM, Mean-KM, NFD, K-MAM, NRKM2, FMNN and MuKM algorithms using internal and external effectiveness evaluation criteria with efficiency assessment on sixteen genomic datasets. The experimental results reveal that the MKMDKM algorithm minimizes iterations, distance computation, data comparison, local optima, resource consumption, and improves cluster performance, effectiveness and efficiency with stable convergence and results as compared to other algorithms. According to the statistical analysis, the proposed MKMDKM algorithm has achieved statistical significance by employing the Friedman test and the post hoc test.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Single-Cell RNA Sequencing Analysis: A Step-by-Step Overview

Evaluation of k-nearest neighbour classifier performance for heterogeneous data sets

Article Open access 06 November 2019

A Short Review on Different Clustering Techniques and Their Applications

References

Rehman MH, Liew CS, Abbas A et al (2016) Big data reduction methods: a survey. Data Sci Eng 1:265–284. https://doi.org/10.1007/s41019-016-0022-0
Article Google Scholar
Mahmud MS, Huang JZ, Salloum S et al (2020) A survey of data partitioning and sampling methods to support big data analysis. Big Data Min Anal 3:85–101. https://doi.org/10.26599/BDMA.2019.9020015
Article Google Scholar
Kacfah Emani C, Cullot N, Nicolle C (2015) Understandable big data: a survey. Comput Sci Rev 17:70–81. https://doi.org/10.1016/j.cosrev.2015.05.002
Article MathSciNet Google Scholar
Pandey KK, Shukla D (2019) Challenges of big data to big data mining with their processing framework. In: 2018 8th International Conference on Communication Systems and Network Technologies (CSNT). IEEE, pp 89–94
Reinartz T (1998) Similarity driven sampling for data mining. In: European Symposium on Principles of Data Mining and Knowledge Discovery. Springer, pp 423–431
Abualigah L, Diabat A, Mirjalili S et al (2021) The arithmetic optimization algorithm. Comput Methods Appl Mech Eng 376:113609. https://doi.org/10.1016/j.cma.2020.113609
Article MathSciNet MATH Google Scholar
Abualigah L, Yousri D, Abd Elaziz M et al (2021) Aquila optimizer: a novel meta-heuristic optimization algorithm. Comput Ind Eng 157:107250. https://doi.org/10.1016/j.cie.2021.107250
Article Google Scholar
Abualigah L, Diabat A, Elaziz MA (2021) Improved slime mould algorithm by opposition-based learning and levy flight distribution for global optimization and advances in real-world engineering problems. J Ambient Intell Humaniz Comput. https://doi.org/10.1007/s12652-021-03372-w
Article Google Scholar
Torrente A, Romo J (2021) Initializing k-means clustering by bootstrap and data depth. J Classif 38:232–256. https://doi.org/10.1007/s00357-020-09372-3
Article MathSciNet MATH Google Scholar
Liu T, Zhu J, Zhou J et al (2019) Initialization similarity clustering algorithm. Multimed Tools App. 78:33279–33296. https://doi.org/10.1007/s11042-019-7663-8
Article Google Scholar
Kumar KM, Reddy ARM (2017) An efficient k-means clustering filtering algorithm using density based initial cluster centers. Inf Sci (Ny) 418–419:286–301. https://doi.org/10.1016/j.ins.2017.07.036
Article MathSciNet MATH Google Scholar
Ismkhan H (2018) I-k-means−+: an iterative clustering algorithm based on an enhanced version of the k-means. Pattern Recognit 79:402–413. https://doi.org/10.1016/j.patcog.2018.02.015
Article Google Scholar
Zahra S, Ghazanfar MA, Khalid A et al (2015) Novel centroid selection approaches for k-means-clustering based recommender systems. Inf Sci (Ny) 320:156–189. https://doi.org/10.1016/j.ins.2015.03.062
Article Google Scholar
Kushwaha N, Pant M, Kant S, Jain VK (2018) Magnetic optimization algorithm for data clustering. Pattern Recognit Lett 115:59–65. https://doi.org/10.1016/j.patrec.2017.10.031
Article Google Scholar
Saxena A, Prasad M, Gupta A et al (2017) A review of clustering techniques and developments. Neurocomputing 267:664–681. https://doi.org/10.1016/j.neucom.2017.06.053
Article Google Scholar
Jain AK (2010) Data clustering: 50 years beyond k-means. Pattern Recognit Lett 31:651–666. https://doi.org/10.1016/j.patrec.2009.09.011
Article Google Scholar
Wang Z (2019) Mining data and metadata from the gene expression omnibus national center for biotechnology information. Biophys Rev 11:103–110
Article Google Scholar
Xiao J, Yan Y, Zhang J, Tang Y (2010) A quantum-inspired genetic algorithm for k-means clustering. Expert Syst Appl 37:4966–4973. https://doi.org/10.1016/j.eswa.2009.12.017
Article Google Scholar
Xu J, Xu B, Zhang W et al (2009) Stable initialization scheme for k-means clustering. Wuhan Univ J Nat Sci 14:24–28. https://doi.org/10.1007/s11859-009-0106-z
Article MathSciNet Google Scholar
Kwedlo W, Iwanowicz P (2010) Using genetic algorithm for selection of initial cluster centers for the K-means method. In: Rutkowski L (ed) Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 2nd edn. Springer, Verlag Berlin Heidelberg, pp 165–172
Google Scholar
Khondoker MR (2018) Big data clustering. Wiley StatsRef: Statistics Reference Online. John Wiley & Sons Ltd, Chichester, pp 1–10
Google Scholar
Chen M, Ludwig SA, Li K (2017) Clustering in big data. In: Li K-C, Jiang H, Zomaya AY (eds) Big data management and processing. Chapman and Hall/CRC, New York, pp 333–346
Chapter Google Scholar
Dafir Z, Lamari Y, Slaoui SC (2021) A survey on parallel clustering algorithms for big data. Artif Intell Rev 54:2411–2443. https://doi.org/10.1007/s10462-020-09918-2
Article Google Scholar
Alguliyev RM, Aliguliyev RM, Sukhostat LV (2020) Efficient algorithm for big data clustering on single machine. CAAI Trans Intell Technol 5:9–14. https://doi.org/10.1049/trit.2019.0048
Article Google Scholar
Bakhthemmat A, Izadi M (2020) Decreasing the execution time of reducers by revising clustering based on the futuristic greedy approach. J Big Data 7:6. https://doi.org/10.1186/s40537-019-0279-z
Article Google Scholar
HajKacem MA Ben, N’Cir C E Ben, Essoussi N (2019) Overview of scalable partitional methods for big data clustering. In: Nasraoui O, N’Cir C E Ben (eds) Clustering Methods for Big Data Analytics, Unsupervised and Semi-Supervised Learning. Springer Nature, Switzerland, pp 1–23
Karmakar B, Das S, Bhattacharya S et al (2019) Tight clustering for large datasets with an application to gene expression data. Sci Rep 9:3053. https://doi.org/10.1038/s41598-019-39459-w
Article Google Scholar
Overview G (2019) Microarray bioinformatics. Springer, New York
Google Scholar
Mabu AM, Prasad R, Yadav R (2020) Mining gene expression data using data mining techniques: a critical review. J Inf Optim Sci 41:723–742. https://doi.org/10.1080/02522667.2018.1555311
Article Google Scholar
Jiang D, Tang C, Zhang A (2004) Cluster analysis for gene expression data: a survey. IEEE Trans Knowl Data Eng 16:1370–1386. https://doi.org/10.1109/TKDE.2004.68
Article Google Scholar
Hasan MS, Duan Z-H (2015) Hierarchical k-Means : A hybrid clustering algorithm and its application to study gene expression in lung adenocarcinoma. In: Emerging Trends in Computational Biology, Bioinformatics, and Systems Biology. Elsevier, pp 51–67
Dong R, He L, He RL, Yau SS-T (2019) A novel approach to clustering genome sequences using internucleotide covariance. Front Genet. https://doi.org/10.3389/fgene.2019.00234
Article Google Scholar
Fränti P, Sieranoja S (2019) How much can k-means be improved by using better initialization and repeats? Pattern Recognit 93:95–112. https://doi.org/10.1016/j.patcog.2019.04.014
Article Google Scholar
Mousavian Anaraki SA, Haeri A, Moslehi F (2021) A hybrid reciprocal model of PCA and k-means with an innovative approach of considering sub-datasets for the improvement of k-means initialization and step-by-step labeling to create clusters with high interpretability. Pattern Anal Appl. https://doi.org/10.1007/s10044-021-00977-x
Article Google Scholar
Arthur D, Vassilvitskii S (2007) K-means++: The advantages of careful seeding. In: SODA ’07: Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms. ACM Digital Library, pp 1027–1035
Fränti P, Sieranoja S (2018) K-means properties on six clustering benchmark datasets. Appl Intell 48:4743–4759. https://doi.org/10.1007/s10489-018-1238-7
Article MATH Google Scholar
Duwairi R, Abu-Rahmeh M (2015) A novel approach for initializing the spherical k-means clustering algorithm. Simul Model Pract Theory 54:49–63. https://doi.org/10.1016/j.simpat.2015.03.007
Article Google Scholar
Peña J, Lozano J, Larrañaga P (1999) An empirical comparison of four initialization methods for the k-means algorithm. Pattern Recognit Lett 20:1027–1040. https://doi.org/10.1016/S0167-8655(99)00069-0
Article Google Scholar
Birgin EG, Martinez JM, Ronconi DP (2003) Minimization subproblems and heuristics for an applied clustering problem. Eur J Oper Res 146:19–34. https://doi.org/10.1016/S0377-2217(02)00208-4
Article MathSciNet MATH Google Scholar
He J, Lan M, Tan CL, et al (2004) Initialization of cluster refinement algorithms: A review and comparative study. In: IEEE International Conference on Neural Networks - Conference Proceedings. IEEE Xplore, pp 297–302
Steinley D, Brusco MJ (2007) Initializing k-means batch clustering: a critical evaluation of several techniques. J Classif 24:99–121. https://doi.org/10.1007/s00357-007-0003-0
Article MathSciNet MATH Google Scholar
Celebi ME, Kingravi HA (2012) Deterministic initialization of the k-means algorithm using hierarchical clustering. Int J Pattern Recognit Artif Intell. https://doi.org/10.1142/S0218001412500188
Article MathSciNet Google Scholar
Celebi ME, Kingravi HA, Vela PA (2013) A comparative study of efficient initialization methods for the k-means clustering algorithm. Expert Syst Appl 40:200–210. https://doi.org/10.1016/j.eswa.2012.07.021
Article Google Scholar
Celebi ME, Kingravi HA (2015) Linear, deterministic and order-invariant initialization methods for the k-means clustering algorithm. In: Celebi ME (ed) Partitional Clustering Algorithms. Springer International Publishing, Cham, pp 79–98
Chapter Google Scholar
Pourahmad S, Basirat A, Rahimi A, Doostfatemeh M (2020) Does determination of initial cluster centroids improve the performance of k-means clustering algorithm? comparison of three hybrid methods by genetic algorithm, minimum spanning tree, and hierarchical clustering in an applied study. Comput Math Methods Med. https://doi.org/10.1155/2020/7636857
Article Google Scholar
Ji J, Pang W, Zheng Y et al (2015) An initialization method for clustering mixed mumeric and categorical data based on the density and distance. Int J Pattern Recognit Artif Intell. https://doi.org/10.1142/S021800141550024X
Article Google Scholar
Erisoglu M, Calis N, Sakallioglu S (2011) A new algorithm for initial cluster centers in k-means algorithm. Pattern Recognit Lett 32:1701–1705. https://doi.org/10.1016/j.patrec.2011.07.011
Article Google Scholar
Reddy D, Jana PK (2012) Initialization for k-means clustering using voronoi diagram. Procedia Technol 4:395–400. https://doi.org/10.1016/j.protcy.2012.05.061
Article Google Scholar
Goyal M, Kumar S (2014) Improving the initial centroids of k-means clustering algorithm to generalize its applicability. J Inst Eng Ser B 95:345–350. https://doi.org/10.1007/s40031-014-0106-z
Article Google Scholar
Poomagal S, Saranya P, Karthik S (2016) A novel method for selecting initial centroids in k-means clustering algorithm. Int J Intell Syst Technol Appl 15:230. https://doi.org/10.1504/IJISTA.2016.078347
Article Google Scholar
Dhanabal S, Chandramathi S (2017) Enhancing clustering accuracy by finding initial centroid using k-minimum-average-maximum method. Int J Inf Commun Technol. https://doi.org/10.1504/IJICT.2017.10007027
Article Google Scholar
Kazemi A, Khodabandehlouie G (2018) A new initialisation method for k-means algorithm in the clustering problem: data analysis. Int J Data Anal Tech Strateg 10:291. https://doi.org/10.1504/IJDATS.2018.094127
Article Google Scholar
Li Y, Cai J, Yang H et al (2019) A novel algorithm for initial cluster center selection. IEEE Access 7:74683–74693. https://doi.org/10.1109/ACCESS.2019.2921320
Article Google Scholar
Wang S, Liu X, Xiang L (2021) An improved initialisation method for K-means algorithm optimised by tissue-like P system. Int J Parallel, Emergent Distrib Syst 36:3–10. https://doi.org/10.1080/17445760.2019.1682144
Article Google Scholar
Motwani M, Arora N, Gupta A (2019) A study on initial centroids selection for partitional clustering algorithms. In: Advances in Intelligent Systems and Computing. Springer Singapore, pp 211–220
Lakshmi MA, Victor Daniel G, Srinivasa Rao D (2019) Initial centroids for k-means using nearest neighbors and feature means. In: Advances in Intelligent Systems and Computing. Springer Singapore, pp 27–34
Chowdhury K, Chaudhuri D, Pal AK (2020) An entropy-based initialization method of k-means clustering on the optimal number of clusters. Neural Comput Appl 33:6965–6982. https://doi.org/10.1007/s00521-020-05471-9
Article Google Scholar
Murugesan VP, Murugesan P (2020) A new initialization and performance measure for the rough k-means clustering. Soft Comput 24:11605–11619. https://doi.org/10.1007/s00500-019-04625-9
Article MATH Google Scholar
Fahad A, Alshatri N, Tari Z et al (2014) A survey of clustering algorithms for big data: taxonomy and empirical analysis. IEEE Trans Emerg Top Comput 2:267–279. https://doi.org/10.1109/TETC.2014.2330519
Article Google Scholar
Pandove D, Goel S (2015) A comprehensive study on clustering approaches for big data mining. In: 2015 2nd International Conference on Electronics and Communication Systems (ICECS). IEEE, pp 1333–1338
Pandey KK, Shukla D (2019) A study of clustering taxonomy for big data mining with optimized clustering mapreduce model. Int J Emerg Technol 10:226–234
Google Scholar
Pandove D, Goel S, Rani R (2018) Systematic review of clustering high-dimensional and large datasets. ACM Trans Knowl Discov Data 12:1–68. https://doi.org/10.1145/3132088
Article Google Scholar
Abualigah L (2019) Feature selection and enhanced krill herd algorithm for text document clustering. Springer Nature, Switzerland
Xiao Y, Yu J (2012) Partitive clustering (k -means family). Wiley Interdiscip Rev Data Min Knowl Discov 2:209–225. https://doi.org/10.1002/widm.1049
Article Google Scholar
Kalyanakrishnan S (2017) K -means clustering. IIT,Bombay
Äyrämö S (2006) Knowledge mining using robust clustering. Jyväskylä University Printing House
Celikoglu A, Tirnakli U (2018) Skewness and kurtosis analysis for non-gaussian distributions. Phys A Stat Mech its Appl 499:325–334. https://doi.org/10.1016/j.physa.2018.02.035
Article MathSciNet MATH Google Scholar
Gentile C (2013) Using the kurtosis measure to identify clusters in wireless channel impulse responses. IEEE Trans Antennas Propag 61:3392–3395. https://doi.org/10.1109/TAP.2013.2253299
Article MathSciNet MATH Google Scholar
Judez L, Chaya C, De Miguel JM, Bru R (2006) Stratification and sample size of data sources for agricultural mathematical programming models. Math Comp Modell 43(5–6):530–535
Article MathSciNet MATH Google Scholar
Keskintürk T, Er Ş (2007) A genetic algorithm approach to determine stratum boundaries and sample sizes of each stratum in stratified sampling. Comp. Stat. Data Analysis 52(1):53–67
Article MathSciNet MATH Google Scholar
Étoré P, Jourdain B (2010) Adaptive optimal allocation in stratified sampling methods. Methodol Comput Appl Probab 12:335–360. https://doi.org/10.1007/s11009-008-9108-0
Article MathSciNet MATH Google Scholar
Rice JA (2007) Mathematical statistics and metastatistical analysis, Third Edit. Thomson Higher Education
Singh S (2003) Advanced sampling theory with applications, vol 1. Springer Netherlands, Dordrecht
Book Google Scholar
Hoshida Y, Brunet J-P, Tamayo P et al (2007) Subclass mapping: identifying common subtypes in independent disease data sets. PLoS ONE. https://doi.org/10.1371/journal.pone.0001195
Article Google Scholar
Hoshida Y (2010) Nearest template prediction: a single-sample-based flexible class prediction with confidence assessment. PLoS ONE. https://doi.org/10.1371/journal.pone.0015543
Article Google Scholar
Dua, D. and Graff C (2019) UCI machine learning repository [http://archive.ics.uci.edu/ml]. Irvine, CA Univ California, Sch Inf Comput Sci
Feltes BC, Chandelier EB, Grisci BI, Dorn M (2019) CuMiDa: an extensively curated microarray database for benchmarking and testing of machine learning approaches in cancer research. J Comput Biol 26:376–386. https://doi.org/10.1089/cmb.2018.0238
Article Google Scholar
De Souto MCP, Coelho ALV, Faceli K, et al (2012) A comparison of external clustering evaluation indices in the context of imbalanced data sets. In: 2012 Brazilian Symposium on Neural Networks. IEEE, pp 49–54
Rosenberg A, Hirschberg J (2007) V-Measure: a conditional entropy-based external cluster evaluation measure. In: Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning,. Association for Computational Linguistics, pp 410–420
Aggarwal CC, Reddy CK (2014) Data custering algorithms and applications. CRC Press, Boca Raton, United States
Google Scholar
Gan G, Ma C, Wu J (2007) Data clustering theory, algorithms and applications. Society for Industrial and Applied Mathematics and American Statistical Association, Philadelphia, Pennsylvania
Book MATH Google Scholar
Yeh W-C, Lai C-M (2015) Accelerated simplified swarm optimization with exploitation search scheme for data clustering. PLoS ONE 10:e0137246. https://doi.org/10.1371/journal.pone.0137246
Article Google Scholar
Demšar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1–30
MathSciNet MATH Google Scholar
Deeb H, Sarangi A, Mishra D, Sarangi SK (2020) Improved black hole optimization algorithm for data clustering. J King Saud Univ-Comput Inf Sci. https://doi.org/10.1016/j.jksuci.2020.12.013
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science and Applications, Dr. Hari Singh Gour Vishwavidyalaya, Sagar, Madhya Pradesh, India
Kamlesh Kumar Pandey & Diwakar Shukla

Authors

Kamlesh Kumar Pandey
View author publications
You can also search for this author in PubMed Google Scholar
Diwakar Shukla
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Kamlesh Kumar Pandey.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Pandey, K.K., Shukla, D. Min–max kurtosis mean distance based k-means initial centroid initialization method for big genomic data clustering. Evol. Intel. 16, 1055–1076 (2023). https://doi.org/10.1007/s12065-022-00720-3

Download citation

Received: 23 June 2021
Revised: 11 December 2021
Accepted: 13 March 2022
Published: 18 April 2022
Issue Date: June 2023
DOI: https://doi.org/10.1007/s12065-022-00720-3

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Min–max kurtosis mean distance based k-means initial centroid initialization method for big genomic data clustering

Abstract

Access this article

Similar content being viewed by others

Single-Cell RNA Sequencing Analysis: A Step-by-Step Overview

Evaluation of k-nearest neighbour classifier performance for heterogeneous data sets

A Short Review on Different Clustering Techniques and Their Applications

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Min–max kurtosis mean distance based k-means initial centroid initialization method for big genomic data clustering

Abstract

Access this article

Similar content being viewed by others

Single-Cell RNA Sequencing Analysis: A Step-by-Step Overview

Evaluation of k-nearest neighbour classifier performance for heterogeneous data sets

A Short Review on Different Clustering Techniques and Their Applications

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation