Abstract
In the digitalized world, efficient big data clustering is necessary for massive data generation. The clustering algorithm plays an important role in resolving the computational complexity. The big data arriving from various sources are being processed using the MapReduce framework (MRF) by the knowledge of the clustering algorithms. Moreover, the clustering algorithm is useful for mining the significant information from the dataset. Generally, there are several difficulties in applying the clustering approach to big data as its new challenges are based on computation cost and reasonable time. Hence, this research introduced the Competitive Jaya Leader Harris Hawks Optimization assisted Entropy Weighted Power K-Means Clustering (CJayaLHHO_EWPKMC) for big data clustering. In addition, the overall processing of the devised method for big data clustering is carried out in the MapReduce (MR) framework. In mapper, the feature selection is done using Support vector Machine-Recursive Feature Elimination (SVM-RFE) assisted Jaya Leader Harris Hawks Optimization (JayaLHHO). In the reducer, the big data clustering is established using the EWPKMC method, wherein the weight of EWPKMC is modified with the CJayaLHHO algorithm such that the clustering outcome is attained. The proposed method is scalable, simple, cost-effective, and able to integrate with other technologies. The experimental result portrays that the developed method attained a superior presentation than the conventional methods based on the clustering accuracy is 0.937, the Jaccard coefficient is 0.913, and the rand coefficient is 0.912.
Similar content being viewed by others
Data availability
The data underlying this article are available in MHEALTH Dataset, Skin Segmentation Dataset, at http://archive.ics.uci.edu/ml/datasets/mhealth+dataset#, https://archive.ics.uci.edu/ml/datasets/Skin+Segmentation.
References
Heidari S, Alborzi M, Radfar R, Afsharkazemi MA, Rajabzadeh Ghatari A (2019) Big data clustering with varied density based on MapReduce. J Big Data 6(1):1–16
Prasad KR, Mohammed M, Prasad LVN, Anguraj DK (2021) An efficient sampling-based visualization technique for big data clustering with crisp partitions. Distrib Parallel Databases 39(3):813–832
Qin Y, Yalamanchili HK, Qin J, Yan B, Wang J (2015) The current status and challenges in computational analysis of genomic big data. Big Data Res 2(1):12–18
Shukla AK, Muhuri PK (2019) Big-data clustering with interval type-2 fuzzy uncertainty modeling in gene expression datasets. Eng Appl Artif Intell 77:268–282
Madan S, Bhardwaj K, Gupta S (2021) Critical analysis of big data privacy preservation techniques and challenges. Advs Intell Syst Comput 1394:267–278
Wu X, Zhu X, Wu GQ, Ding W (2013) Data mining with big data. IEEE Trans Knowl Data Eng 26(1):97–107
Kulkarni O, Jena S, Sankar VR (2020) MapReduce framework based big data clustering using fractional integrated sparse fuzzy C means algorithm. IET Image Proc 14(12):2719–2727
Sardar TH, Ansari Z (2022) Distributed big data clustering using MapReduce-based fuzzy C-medoids. J Inst Eng (India): Series B 103(1):73–82
Dean J, Ghemawat S (2004) MapReduce: simplified data processing on large clusters. Commun ACM 51(1):107–113
Cui X, Zhu P, Yang X, Li K, Ji C (2014) Optimized big data K-means clustering using MapReduce. J Supercomput 70(3):1249–1259
Fan T (2018) Research and implementation of user clustering based on MapReduce in multimedia big data. Multimed Tools Appl 77(8):10017–10031
Lu W (2020) Improved K-means clustering algorithm for big data mining under Hadoop parallel framework. J Grid Comput 18(2):239–250
Madan S, Goswami P (2020) Nature inspired computational intelligence implementation for privacy preservation in MapReduce framework. IJIIDS 13:191–207
Sardar TH, Faizabadi AR, Ansari Z (2017) An evaluation of MapReduce framework in cluster analysis. In: Proceedings of 2017 international conference on intelligent computing, instrumentation and control technologies (ICICICT). ICICICT, pp 110–114
Madan S, Goswami P (2019) A privacy preserving scheme for big data publishing in the cloud using k-anonymization and hybridized optimization algorithm, international conference on circuits and systems in digital enterprise technology (ICCSDET). pp 1–7
Madhulatha TS (2012) An overview on clustering methods. IOSR J Eng 2(4):719–725. ArXiv preprint arXiv:1205.1117
Cura T (2012) A particle swarm optimization approach to clustering. Expert Syst Appl 39(1):1582–1588
Shelokar PS, Jayaraman VK, Kulkarni BD (2004) An ant colony approach for clustering. AnalyticaChimicaActa 509(2):187–195
Bu F, Zhang Q, Yang LT, Yu H (2020) An edge-cloud-aided high-order possibilistic c-means algorithm for big data clustering. IEEE Trans Fuzzy Syst 28(12):3100–3109
Prasad KR, Mohammed M, Noorullah RM (2021) Visual topic models for healthcare data clustering. Evol Intel 14(2):545–562
Sardar TH, Ansari Z (2022) MapReduce-based Fuzzy C-means Algorithm for Distributed Document Clustering. Journal of The Institution of Engineers (India): Series B 103(1):131–142
Zhang Y, Deng Q, Liang W, Zou X (2018) An efficient feature selection strategy based on multiple support vector machine technology with gene expression data. BioMed Res Int 2018(1):1–11
Rao R (2016) Jaya: A simple and new optimization algorithm for solving constrained and unconstrained optimization problems. Int J Ind Eng Comput 7(1):19–34
Naik MK, Panda R, Wunnava A, Jena B, Abraham A (2021) A leader Harris hawks optimization for 2-D Masi entropy-based multilevel image thresholding. Multimedia Tools and Applications 80(28):35543–35583
Chakraborty S, Paul D, Das S, Xu J (2020) Entropy regularized power k-means clustering. 23rd international conference on artificial intelligence and statistics (AISTATS 2020)
Cheng R, Jin Y (2014) A competitive swarm optimizer for large scale optimization. IEEE transactions on cybernetics 45(2):191–204
Nafis NSM, Awang S (2021) An enhanced hybrid feature selection technique using term frequency-inverse document frequency and support vector machine-recursive feature elimination for sentiment classification. IEEE Access 9:52177–52192
MHEALTH Dataset taken from, “http://archive.ics.uci.edu/ml/datasets/mhealth+dataset#”. Accessed on May 2022
Skin Segmentation Dataset taken from “https://archive.ics.uci.edu/ml/datasets/Skin+Segmentation”. Accessed on July 2012
Rajendran S, Khalaf OI, Alotaibi Y, AlghamdiS (2021) MapReduce-based big data classification model using feature subset selection and hyperparameter tuned deep belief network. Sci Rep 11
Al-Thanoon NA, Algamal ZY, Qasim OS (2021) Feature selection based on a crow search algorithm for big data classification. Chem Intell Lab Syst 212
Acknowledgements
I would like to express my very great appreciation to the co-authors of this manuscript for their valuable and constructive suggestions during the planning and development of this research work.
Funding
This research did not receive any specific funding.
Author information
Authors and Affiliations
Contributions
All authors have made substantial contributions to conception and design, revising the manuscript, and the final approval of the version to be published. Also, all authors agreed to be accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved.
Corresponding author
Ethics declarations
Conflict of interest
The authors declare no conflict of interest.
Informed consent
Not Applicable.
Ethical approval
Not Applicable.
Additional information
Publisher's note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Madan, S., C, K., Bhatia, M.K. et al. An optimized SVM-RFE based feature selection and weighted entropy K-means approach for big data clustering in mapreduce. Multimed Tools Appl (2024). https://doi.org/10.1007/s11042-023-18044-4
Received:
Revised:
Accepted:
Published:
DOI: https://doi.org/10.1007/s11042-023-18044-4