Skip to main content
Log in

Min–max kurtosis mean distance based k-means initial centroid initialization method for big genomic data clustering

  • Research Paper
  • Published:
Evolutionary Intelligence Aims and scope Submit manuscript

Abstract

Genomic clustering is a big data application that uses the K-means (KM) clustering approach to discover hidden patterns and trends in genes for disease diagnosis, biological analysis, and tissue detection. The KM algorithm is highly dependent on the initial centroid because it determines the effectiveness, efficiency, computing resources, and local optima of the KM clustering. The existing initial centroid initialization approach traps local optima due to randomization and achieves high computational cost due to the enormous interrelated dimension. Therefore, the KM algorithm produces the lowest quality cluster and maximizes the computation time and resource consumption. To address this issue, this study has presented the Min–Max Kurtosis Mean Distance (MKMD) algorithm for big data clustering in a single machine environment. The MKMD algorithm enhances the effectiveness and efficiency of the KM algorithm by measuring the distance between data points of the minimum–maximum kurtosis dimension and their mean. The performance of the presented algorithm has been compared against KM, KM +  + , ADV, MKM, Mean-KM, NFD, K-MAM, NRKM2, FMNN and MuKM algorithms using internal and external effectiveness evaluation criteria with efficiency assessment on sixteen genomic datasets. The experimental results reveal that the MKMDKM algorithm minimizes iterations, distance computation, data comparison, local optima, resource consumption, and improves cluster performance, effectiveness and efficiency with stable convergence and results as compared to other algorithms. According to the statistical analysis, the proposed MKMDKM algorithm has achieved statistical significance by employing the Friedman test and the post hoc test.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Rehman MH, Liew CS, Abbas A et al (2016) Big data reduction methods: a survey. Data Sci Eng 1:265–284. https://doi.org/10.1007/s41019-016-0022-0

    Article  Google Scholar 

  2. Mahmud MS, Huang JZ, Salloum S et al (2020) A survey of data partitioning and sampling methods to support big data analysis. Big Data Min Anal 3:85–101. https://doi.org/10.26599/BDMA.2019.9020015

    Article  Google Scholar 

  3. Kacfah Emani C, Cullot N, Nicolle C (2015) Understandable big data: a survey. Comput Sci Rev 17:70–81. https://doi.org/10.1016/j.cosrev.2015.05.002

    Article  MathSciNet  Google Scholar 

  4. Pandey KK, Shukla D (2019) Challenges of big data to big data mining with their processing framework. In: 2018 8th International Conference on Communication Systems and Network Technologies (CSNT). IEEE, pp 89–94

  5. Reinartz T (1998) Similarity driven sampling for data mining. In: European Symposium on Principles of Data Mining and Knowledge Discovery. Springer, pp 423–431

  6. Abualigah L, Diabat A, Mirjalili S et al (2021) The arithmetic optimization algorithm. Comput Methods Appl Mech Eng 376:113609. https://doi.org/10.1016/j.cma.2020.113609

    Article  MathSciNet  MATH  Google Scholar 

  7. Abualigah L, Yousri D, Abd Elaziz M et al (2021) Aquila optimizer: a novel meta-heuristic optimization algorithm. Comput Ind Eng 157:107250. https://doi.org/10.1016/j.cie.2021.107250

    Article  Google Scholar 

  8. Abualigah L, Diabat A, Elaziz MA (2021) Improved slime mould algorithm by opposition-based learning and levy flight distribution for global optimization and advances in real-world engineering problems. J Ambient Intell Humaniz Comput. https://doi.org/10.1007/s12652-021-03372-w

    Article  Google Scholar 

  9. Torrente A, Romo J (2021) Initializing k-means clustering by bootstrap and data depth. J Classif 38:232–256. https://doi.org/10.1007/s00357-020-09372-3

    Article  MathSciNet  MATH  Google Scholar 

  10. Liu T, Zhu J, Zhou J et al (2019) Initialization similarity clustering algorithm. Multimed Tools App. 78:33279–33296. https://doi.org/10.1007/s11042-019-7663-8

    Article  Google Scholar 

  11. Kumar KM, Reddy ARM (2017) An efficient k-means clustering filtering algorithm using density based initial cluster centers. Inf Sci (Ny) 418–419:286–301. https://doi.org/10.1016/j.ins.2017.07.036

    Article  MathSciNet  MATH  Google Scholar 

  12. Ismkhan H (2018) I-k-means−+: an iterative clustering algorithm based on an enhanced version of the k-means. Pattern Recognit 79:402–413. https://doi.org/10.1016/j.patcog.2018.02.015

    Article  Google Scholar 

  13. Zahra S, Ghazanfar MA, Khalid A et al (2015) Novel centroid selection approaches for k-means-clustering based recommender systems. Inf Sci (Ny) 320:156–189. https://doi.org/10.1016/j.ins.2015.03.062

    Article  Google Scholar 

  14. Kushwaha N, Pant M, Kant S, Jain VK (2018) Magnetic optimization algorithm for data clustering. Pattern Recognit Lett 115:59–65. https://doi.org/10.1016/j.patrec.2017.10.031

    Article  Google Scholar 

  15. Saxena A, Prasad M, Gupta A et al (2017) A review of clustering techniques and developments. Neurocomputing 267:664–681. https://doi.org/10.1016/j.neucom.2017.06.053

    Article  Google Scholar 

  16. Jain AK (2010) Data clustering: 50 years beyond k-means. Pattern Recognit Lett 31:651–666. https://doi.org/10.1016/j.patrec.2009.09.011

    Article  Google Scholar 

  17. Wang Z (2019) Mining data and metadata from the gene expression omnibus national center for biotechnology information. Biophys Rev 11:103–110

    Article  Google Scholar 

  18. Xiao J, Yan Y, Zhang J, Tang Y (2010) A quantum-inspired genetic algorithm for k-means clustering. Expert Syst Appl 37:4966–4973. https://doi.org/10.1016/j.eswa.2009.12.017

    Article  Google Scholar 

  19. Xu J, Xu B, Zhang W et al (2009) Stable initialization scheme for k-means clustering. Wuhan Univ J Nat Sci 14:24–28. https://doi.org/10.1007/s11859-009-0106-z

    Article  MathSciNet  Google Scholar 

  20. Kwedlo W, Iwanowicz P (2010) Using genetic algorithm for selection of initial cluster centers for the K-means method. In: Rutkowski L (ed) Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 2nd edn. Springer, Verlag Berlin Heidelberg, pp 165–172

    Google Scholar 

  21. Khondoker MR (2018) Big data clustering. Wiley StatsRef: Statistics Reference Online. John Wiley & Sons Ltd, Chichester, pp 1–10

    Google Scholar 

  22. Chen M, Ludwig SA, Li K (2017) Clustering in big data. In: Li K-C, Jiang H, Zomaya AY (eds) Big data management and processing. Chapman and Hall/CRC, New York, pp 333–346

    Chapter  Google Scholar 

  23. Dafir Z, Lamari Y, Slaoui SC (2021) A survey on parallel clustering algorithms for big data. Artif Intell Rev 54:2411–2443. https://doi.org/10.1007/s10462-020-09918-2

    Article  Google Scholar 

  24. Alguliyev RM, Aliguliyev RM, Sukhostat LV (2020) Efficient algorithm for big data clustering on single machine. CAAI Trans Intell Technol 5:9–14. https://doi.org/10.1049/trit.2019.0048

    Article  Google Scholar 

  25. Bakhthemmat A, Izadi M (2020) Decreasing the execution time of reducers by revising clustering based on the futuristic greedy approach. J Big Data 7:6. https://doi.org/10.1186/s40537-019-0279-z

    Article  Google Scholar 

  26. HajKacem MA Ben, N’Cir C E Ben, Essoussi N (2019) Overview of scalable partitional methods for big data clustering. In: Nasraoui O, N’Cir C E Ben (eds) Clustering Methods for Big Data Analytics, Unsupervised and Semi-Supervised Learning. Springer Nature, Switzerland, pp 1–23

  27. Karmakar B, Das S, Bhattacharya S et al (2019) Tight clustering for large datasets with an application to gene expression data. Sci Rep 9:3053. https://doi.org/10.1038/s41598-019-39459-w

    Article  Google Scholar 

  28. Overview G (2019) Microarray bioinformatics. Springer, New York

    Google Scholar 

  29. Mabu AM, Prasad R, Yadav R (2020) Mining gene expression data using data mining techniques: a critical review. J Inf Optim Sci 41:723–742. https://doi.org/10.1080/02522667.2018.1555311

    Article  Google Scholar 

  30. Jiang D, Tang C, Zhang A (2004) Cluster analysis for gene expression data: a survey. IEEE Trans Knowl Data Eng 16:1370–1386. https://doi.org/10.1109/TKDE.2004.68

    Article  Google Scholar 

  31. Hasan MS, Duan Z-H (2015) Hierarchical k-Means : A hybrid clustering algorithm and its application to study gene expression in lung adenocarcinoma. In: Emerging Trends in Computational Biology, Bioinformatics, and Systems Biology. Elsevier, pp 51–67

  32. Dong R, He L, He RL, Yau SS-T (2019) A novel approach to clustering genome sequences using internucleotide covariance. Front Genet. https://doi.org/10.3389/fgene.2019.00234

    Article  Google Scholar 

  33. Fränti P, Sieranoja S (2019) How much can k-means be improved by using better initialization and repeats? Pattern Recognit 93:95–112. https://doi.org/10.1016/j.patcog.2019.04.014

    Article  Google Scholar 

  34. Mousavian Anaraki SA, Haeri A, Moslehi F (2021) A hybrid reciprocal model of PCA and k-means with an innovative approach of considering sub-datasets for the improvement of k-means initialization and step-by-step labeling to create clusters with high interpretability. Pattern Anal Appl. https://doi.org/10.1007/s10044-021-00977-x

    Article  Google Scholar 

  35. Arthur D, Vassilvitskii S (2007) K-means++: The advantages of careful seeding. In: SODA ’07: Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms. ACM Digital Library, pp 1027–1035

  36. Fränti P, Sieranoja S (2018) K-means properties on six clustering benchmark datasets. Appl Intell 48:4743–4759. https://doi.org/10.1007/s10489-018-1238-7

    Article  MATH  Google Scholar 

  37. Duwairi R, Abu-Rahmeh M (2015) A novel approach for initializing the spherical k-means clustering algorithm. Simul Model Pract Theory 54:49–63. https://doi.org/10.1016/j.simpat.2015.03.007

    Article  Google Scholar 

  38. Peña J, Lozano J, Larrañaga P (1999) An empirical comparison of four initialization methods for the k-means algorithm. Pattern Recognit Lett 20:1027–1040. https://doi.org/10.1016/S0167-8655(99)00069-0

    Article  Google Scholar 

  39. Birgin EG, Martinez JM, Ronconi DP (2003) Minimization subproblems and heuristics for an applied clustering problem. Eur J Oper Res 146:19–34. https://doi.org/10.1016/S0377-2217(02)00208-4

    Article  MathSciNet  MATH  Google Scholar 

  40. He J, Lan M, Tan CL, et al (2004) Initialization of cluster refinement algorithms: A review and comparative study. In: IEEE International Conference on Neural Networks - Conference Proceedings. IEEE Xplore, pp 297–302

  41. Steinley D, Brusco MJ (2007) Initializing k-means batch clustering: a critical evaluation of several techniques. J Classif 24:99–121. https://doi.org/10.1007/s00357-007-0003-0

    Article  MathSciNet  MATH  Google Scholar 

  42. Celebi ME, Kingravi HA (2012) Deterministic initialization of the k-means algorithm using hierarchical clustering. Int J Pattern Recognit Artif Intell. https://doi.org/10.1142/S0218001412500188

    Article  MathSciNet  Google Scholar 

  43. Celebi ME, Kingravi HA, Vela PA (2013) A comparative study of efficient initialization methods for the k-means clustering algorithm. Expert Syst Appl 40:200–210. https://doi.org/10.1016/j.eswa.2012.07.021

    Article  Google Scholar 

  44. Celebi ME, Kingravi HA (2015) Linear, deterministic and order-invariant initialization methods for the k-means clustering algorithm. In: Celebi ME (ed) Partitional Clustering Algorithms. Springer International Publishing, Cham, pp 79–98

    Chapter  Google Scholar 

  45. Pourahmad S, Basirat A, Rahimi A, Doostfatemeh M (2020) Does determination of initial cluster centroids improve the performance of k-means clustering algorithm? comparison of three hybrid methods by genetic algorithm, minimum spanning tree, and hierarchical clustering in an applied study. Comput Math Methods Med. https://doi.org/10.1155/2020/7636857

    Article  Google Scholar 

  46. Ji J, Pang W, Zheng Y et al (2015) An initialization method for clustering mixed mumeric and categorical data based on the density and distance. Int J Pattern Recognit Artif Intell. https://doi.org/10.1142/S021800141550024X

    Article  Google Scholar 

  47. Erisoglu M, Calis N, Sakallioglu S (2011) A new algorithm for initial cluster centers in k-means algorithm. Pattern Recognit Lett 32:1701–1705. https://doi.org/10.1016/j.patrec.2011.07.011

    Article  Google Scholar 

  48. Reddy D, Jana PK (2012) Initialization for k-means clustering using voronoi diagram. Procedia Technol 4:395–400. https://doi.org/10.1016/j.protcy.2012.05.061

    Article  Google Scholar 

  49. Goyal M, Kumar S (2014) Improving the initial centroids of k-means clustering algorithm to generalize its applicability. J Inst Eng Ser B 95:345–350. https://doi.org/10.1007/s40031-014-0106-z

    Article  Google Scholar 

  50. Poomagal S, Saranya P, Karthik S (2016) A novel method for selecting initial centroids in k-means clustering algorithm. Int J Intell Syst Technol Appl 15:230. https://doi.org/10.1504/IJISTA.2016.078347

    Article  Google Scholar 

  51. Dhanabal S, Chandramathi S (2017) Enhancing clustering accuracy by finding initial centroid using k-minimum-average-maximum method. Int J Inf Commun Technol. https://doi.org/10.1504/IJICT.2017.10007027

    Article  Google Scholar 

  52. Kazemi A, Khodabandehlouie G (2018) A new initialisation method for k-means algorithm in the clustering problem: data analysis. Int J Data Anal Tech Strateg 10:291. https://doi.org/10.1504/IJDATS.2018.094127

    Article  Google Scholar 

  53. Li Y, Cai J, Yang H et al (2019) A novel algorithm for initial cluster center selection. IEEE Access 7:74683–74693. https://doi.org/10.1109/ACCESS.2019.2921320

    Article  Google Scholar 

  54. Wang S, Liu X, Xiang L (2021) An improved initialisation method for K-means algorithm optimised by tissue-like P system. Int J Parallel, Emergent Distrib Syst 36:3–10. https://doi.org/10.1080/17445760.2019.1682144

    Article  Google Scholar 

  55. Motwani M, Arora N, Gupta A (2019) A study on initial centroids selection for partitional clustering algorithms. In: Advances in Intelligent Systems and Computing. Springer Singapore, pp 211–220

  56. Lakshmi MA, Victor Daniel G, Srinivasa Rao D (2019) Initial centroids for k-means using nearest neighbors and feature means. In: Advances in Intelligent Systems and Computing. Springer Singapore, pp 27–34

  57. Chowdhury K, Chaudhuri D, Pal AK (2020) An entropy-based initialization method of k-means clustering on the optimal number of clusters. Neural Comput Appl 33:6965–6982. https://doi.org/10.1007/s00521-020-05471-9

    Article  Google Scholar 

  58. Murugesan VP, Murugesan P (2020) A new initialization and performance measure for the rough k-means clustering. Soft Comput 24:11605–11619. https://doi.org/10.1007/s00500-019-04625-9

    Article  MATH  Google Scholar 

  59. Fahad A, Alshatri N, Tari Z et al (2014) A survey of clustering algorithms for big data: taxonomy and empirical analysis. IEEE Trans Emerg Top Comput 2:267–279. https://doi.org/10.1109/TETC.2014.2330519

    Article  Google Scholar 

  60. Pandove D, Goel S (2015) A comprehensive study on clustering approaches for big data mining. In: 2015 2nd International Conference on Electronics and Communication Systems (ICECS). IEEE, pp 1333–1338

  61. Pandey KK, Shukla D (2019) A study of clustering taxonomy for big data mining with optimized clustering mapreduce model. Int J Emerg Technol 10:226–234

    Google Scholar 

  62. Pandove D, Goel S, Rani R (2018) Systematic review of clustering high-dimensional and large datasets. ACM Trans Knowl Discov Data 12:1–68. https://doi.org/10.1145/3132088

    Article  Google Scholar 

  63. Abualigah L (2019) Feature selection and enhanced krill herd algorithm for text document clustering. Springer Nature, Switzerland

  64. Xiao Y, Yu J (2012) Partitive clustering (k -means family). Wiley Interdiscip Rev Data Min Knowl Discov 2:209–225. https://doi.org/10.1002/widm.1049

    Article  Google Scholar 

  65. Kalyanakrishnan S (2017) K -means clustering. IIT,Bombay

  66. Äyrämö S (2006) Knowledge mining using robust clustering. Jyväskylä University Printing House

  67. Celikoglu A, Tirnakli U (2018) Skewness and kurtosis analysis for non-gaussian distributions. Phys A Stat Mech its Appl 499:325–334. https://doi.org/10.1016/j.physa.2018.02.035

    Article  MathSciNet  MATH  Google Scholar 

  68. Gentile C (2013) Using the kurtosis measure to identify clusters in wireless channel impulse responses. IEEE Trans Antennas Propag 61:3392–3395. https://doi.org/10.1109/TAP.2013.2253299

    Article  MathSciNet  MATH  Google Scholar 

  69. Judez L, Chaya C, De Miguel JM, Bru R (2006) Stratification and sample size of data sources for agricultural mathematical programming models. Math Comp Modell 43(5–6):530–535

    Article  MathSciNet  MATH  Google Scholar 

  70. Keskintürk T, Er Ş (2007) A genetic algorithm approach to determine stratum boundaries and sample sizes of each stratum in stratified sampling. Comp. Stat. Data Analysis 52(1):53–67

    Article  MathSciNet  MATH  Google Scholar 

  71. Étoré P, Jourdain B (2010) Adaptive optimal allocation in stratified sampling methods. Methodol Comput Appl Probab 12:335–360. https://doi.org/10.1007/s11009-008-9108-0

    Article  MathSciNet  MATH  Google Scholar 

  72. Rice JA (2007) Mathematical statistics and metastatistical analysis, Third Edit. Thomson Higher Education

  73. Singh S (2003) Advanced sampling theory with applications, vol 1. Springer Netherlands, Dordrecht

    Book  Google Scholar 

  74. Hoshida Y, Brunet J-P, Tamayo P et al (2007) Subclass mapping: identifying common subtypes in independent disease data sets. PLoS ONE. https://doi.org/10.1371/journal.pone.0001195

    Article  Google Scholar 

  75. Hoshida Y (2010) Nearest template prediction: a single-sample-based flexible class prediction with confidence assessment. PLoS ONE. https://doi.org/10.1371/journal.pone.0015543

    Article  Google Scholar 

  76. Dua, D. and Graff C (2019) UCI machine learning repository [http://archive.ics.uci.edu/ml]. Irvine, CA Univ California, Sch Inf Comput Sci

  77. Feltes BC, Chandelier EB, Grisci BI, Dorn M (2019) CuMiDa: an extensively curated microarray database for benchmarking and testing of machine learning approaches in cancer research. J Comput Biol 26:376–386. https://doi.org/10.1089/cmb.2018.0238

    Article  Google Scholar 

  78. De Souto MCP, Coelho ALV, Faceli K, et al (2012) A comparison of external clustering evaluation indices in the context of imbalanced data sets. In: 2012 Brazilian Symposium on Neural Networks. IEEE, pp 49–54

  79. Rosenberg A, Hirschberg J (2007) V-Measure: a conditional entropy-based external cluster evaluation measure. In: Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning,. Association for Computational Linguistics, pp 410–420

  80. Aggarwal CC, Reddy CK (2014) Data custering algorithms and applications. CRC Press, Boca Raton, United States

    Google Scholar 

  81. Gan G, Ma C, Wu J (2007) Data clustering theory, algorithms and applications. Society for Industrial and Applied Mathematics and American Statistical Association, Philadelphia, Pennsylvania

    Book  MATH  Google Scholar 

  82. Yeh W-C, Lai C-M (2015) Accelerated simplified swarm optimization with exploitation search scheme for data clustering. PLoS ONE 10:e0137246. https://doi.org/10.1371/journal.pone.0137246

    Article  Google Scholar 

  83. Demšar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1–30

    MathSciNet  MATH  Google Scholar 

  84. Deeb H, Sarangi A, Mishra D, Sarangi SK (2020) Improved black hole optimization algorithm for data clustering. J King Saud Univ-Comput Inf Sci. https://doi.org/10.1016/j.jksuci.2020.12.013

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Kamlesh Kumar Pandey.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Pandey, K.K., Shukla, D. Min–max kurtosis mean distance based k-means initial centroid initialization method for big genomic data clustering. Evol. Intel. 16, 1055–1076 (2023). https://doi.org/10.1007/s12065-022-00720-3

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s12065-022-00720-3

Keywords

Navigation