An optimized SVM-RFE based feature selection and weighted entropy K-means approach for big data clustering in mapreduce

Madan, Suman; C, Komalavalli; Bhatia, Manjot Kaur; Laroiya, Chetna; Arora, Monika

doi:10.1007/s11042-023-18044-4

An optimized SVM-RFE based feature selection and weighted entropy K-means approach for big data clustering in mapreduce

Published: 15 February 2024

(2024)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Suman Madan¹,
Komalavalli C²,
Manjot Kaur Bhatia¹,
Chetna Laroiya¹ &
…
Monika Arora³

47 Accesses
Explore all metrics

Abstract

In the digitalized world, efficient big data clustering is necessary for massive data generation. The clustering algorithm plays an important role in resolving the computational complexity. The big data arriving from various sources are being processed using the MapReduce framework (MRF) by the knowledge of the clustering algorithms. Moreover, the clustering algorithm is useful for mining the significant information from the dataset. Generally, there are several difficulties in applying the clustering approach to big data as its new challenges are based on computation cost and reasonable time. Hence, this research introduced the Competitive Jaya Leader Harris Hawks Optimization assisted Entropy Weighted Power K-Means Clustering (CJayaLHHO_EWPKMC) for big data clustering. In addition, the overall processing of the devised method for big data clustering is carried out in the MapReduce (MR) framework. In mapper, the feature selection is done using Support vector Machine-Recursive Feature Elimination (SVM-RFE) assisted Jaya Leader Harris Hawks Optimization (JayaLHHO). In the reducer, the big data clustering is established using the EWPKMC method, wherein the weight of EWPKMC is modified with the CJayaLHHO algorithm such that the clustering outcome is attained. The proposed method is scalable, simple, cost-effective, and able to integrate with other technologies. The experimental result portrays that the developed method attained a superior presentation than the conventional methods based on the clustering accuracy is 0.937, the Jaccard coefficient is 0.913, and the rand coefficient is 0.912.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Design and Development of Bayesian Optimization Algorithms for Big Data Classification Based on MapReduce Framework

Research on Attribute Dimension Partition Based on SVM Classifying and MapReduce

Article 22 February 2018

SBKMEDA: Sorting-Based K-Median Clustering Algorithm Using Multi-Machine Technique for Big Data

Data availability

The data underlying this article are available in MHEALTH Dataset, Skin Segmentation Dataset, at http://archive.ics.uci.edu/ml/datasets/mhealth+dataset#, https://archive.ics.uci.edu/ml/datasets/Skin+Segmentation.

References

Heidari S, Alborzi M, Radfar R, Afsharkazemi MA, Rajabzadeh Ghatari A (2019) Big data clustering with varied density based on MapReduce. J Big Data 6(1):1–16
Article Google Scholar
Prasad KR, Mohammed M, Prasad LVN, Anguraj DK (2021) An efficient sampling-based visualization technique for big data clustering with crisp partitions. Distrib Parallel Databases 39(3):813–832
Article Google Scholar
Qin Y, Yalamanchili HK, Qin J, Yan B, Wang J (2015) The current status and challenges in computational analysis of genomic big data. Big Data Res 2(1):12–18
Article Google Scholar
Shukla AK, Muhuri PK (2019) Big-data clustering with interval type-2 fuzzy uncertainty modeling in gene expression datasets. Eng Appl Artif Intell 77:268–282
Article Google Scholar
Madan S, Bhardwaj K, Gupta S (2021) Critical analysis of big data privacy preservation techniques and challenges. Advs Intell Syst Comput 1394:267–278
Wu X, Zhu X, Wu GQ, Ding W (2013) Data mining with big data. IEEE Trans Knowl Data Eng 26(1):97–107
Google Scholar
Kulkarni O, Jena S, Sankar VR (2020) MapReduce framework based big data clustering using fractional integrated sparse fuzzy C means algorithm. IET Image Proc 14(12):2719–2727
Article Google Scholar
Sardar TH, Ansari Z (2022) Distributed big data clustering using MapReduce-based fuzzy C-medoids. J Inst Eng (India): Series B 103(1):73–82
Google Scholar
Dean J, Ghemawat S (2004) MapReduce: simplified data processing on large clusters. Commun ACM 51(1):107–113
Cui X, Zhu P, Yang X, Li K, Ji C (2014) Optimized big data K-means clustering using MapReduce. J Supercomput 70(3):1249–1259
Article Google Scholar
Fan T (2018) Research and implementation of user clustering based on MapReduce in multimedia big data. Multimed Tools Appl 77(8):10017–10031
Article Google Scholar
Lu W (2020) Improved K-means clustering algorithm for big data mining under Hadoop parallel framework. J Grid Comput 18(2):239–250
Article MathSciNet Google Scholar
Madan S, Goswami P (2020) Nature inspired computational intelligence implementation for privacy preservation in MapReduce framework. IJIIDS 13:191–207
Article Google Scholar
Sardar TH, Faizabadi AR, Ansari Z (2017) An evaluation of MapReduce framework in cluster analysis. In: Proceedings of 2017 international conference on intelligent computing, instrumentation and control technologies (ICICICT). ICICICT, pp 110–114
Madan S, Goswami P (2019) A privacy preserving scheme for big data publishing in the cloud using k-anonymization and hybridized optimization algorithm, international conference on circuits and systems in digital enterprise technology (ICCSDET). pp 1–7
Madhulatha TS (2012) An overview on clustering methods. IOSR J Eng 2(4):719–725. ArXiv preprint arXiv:1205.1117
Cura T (2012) A particle swarm optimization approach to clustering. Expert Syst Appl 39(1):1582–1588
Article Google Scholar
Shelokar PS, Jayaraman VK, Kulkarni BD (2004) An ant colony approach for clustering. AnalyticaChimicaActa 509(2):187–195
CAS Google Scholar
Bu F, Zhang Q, Yang LT, Yu H (2020) An edge-cloud-aided high-order possibilistic c-means algorithm for big data clustering. IEEE Trans Fuzzy Syst 28(12):3100–3109
Article Google Scholar
Prasad KR, Mohammed M, Noorullah RM (2021) Visual topic models for healthcare data clustering. Evol Intel 14(2):545–562
Article Google Scholar
Sardar TH, Ansari Z (2022) MapReduce-based Fuzzy C-means Algorithm for Distributed Document Clustering. Journal of The Institution of Engineers (India): Series B 103(1):131–142
ADS Google Scholar
Zhang Y, Deng Q, Liang W, Zou X (2018) An efficient feature selection strategy based on multiple support vector machine technology with gene expression data. BioMed Res Int 2018(1):1–11
Rao R (2016) Jaya: A simple and new optimization algorithm for solving constrained and unconstrained optimization problems. Int J Ind Eng Comput 7(1):19–34
Google Scholar
Naik MK, Panda R, Wunnava A, Jena B, Abraham A (2021) A leader Harris hawks optimization for 2-D Masi entropy-based multilevel image thresholding. Multimedia Tools and Applications 80(28):35543–35583
Article Google Scholar
Chakraborty S, Paul D, Das S, Xu J (2020) Entropy regularized power k-means clustering. 23rd international conference on artificial intelligence and statistics (AISTATS 2020)
Cheng R, Jin Y (2014) A competitive swarm optimizer for large scale optimization. IEEE transactions on cybernetics 45(2):191–204
Article PubMed Google Scholar
Nafis NSM, Awang S (2021) An enhanced hybrid feature selection technique using term frequency-inverse document frequency and support vector machine-recursive feature elimination for sentiment classification. IEEE Access 9:52177–52192
Article Google Scholar
MHEALTH Dataset taken from, “http://archive.ics.uci.edu/ml/datasets/mhealth+dataset#”. Accessed on May 2022
Skin Segmentation Dataset taken from “https://archive.ics.uci.edu/ml/datasets/Skin+Segmentation”. Accessed on July 2012
Rajendran S, Khalaf OI, Alotaibi Y, AlghamdiS (2021) MapReduce-based big data classification model using feature subset selection and hyperparameter tuned deep belief network. Sci Rep 11
Al-Thanoon NA, Algamal ZY, Qasim OS (2021) Feature selection based on a crow search algorithm for big data classification. Chem Intell Lab Syst 212

Download references

Acknowledgements

I would like to express my very great appreciation to the co-authors of this manuscript for their valuable and constructive suggestions during the planning and development of this research work.

Funding

This research did not receive any specific funding.

Author information

Authors and Affiliations

Department of Information Technology, Jagan Institute of Management Studies, Sector 5, Rohini, New Delhi, India
Suman Madan, Manjot Kaur Bhatia & Chetna Laroiya
School of CSE and IS, Presidency University, Bangaluru, India
Komalavalli C
Department of CSE, Bhagwan Parshuram Institute of Technology, Rohini, Delhi, India
Monika Arora

Authors

Suman Madan
View author publications
You can also search for this author in PubMed Google Scholar
Komalavalli C
View author publications
You can also search for this author in PubMed Google Scholar
Manjot Kaur Bhatia
View author publications
You can also search for this author in PubMed Google Scholar
Chetna Laroiya
View author publications
You can also search for this author in PubMed Google Scholar
Monika Arora
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

All authors have made substantial contributions to conception and design, revising the manuscript, and the final approval of the version to be published. Also, all authors agreed to be accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved.

Corresponding author

Correspondence to Suman Madan.

Ethics declarations

Conflict of interest

The authors declare no conflict of interest.

Informed consent

Not Applicable.

Ethical approval

Not Applicable.

Additional information

Publisher's note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Madan, S., C, K., Bhatia, M.K. et al. An optimized SVM-RFE based feature selection and weighted entropy K-means approach for big data clustering in mapreduce. Multimed Tools Appl (2024). https://doi.org/10.1007/s11042-023-18044-4

Download citation

Received: 28 February 2023
Revised: 17 November 2023
Accepted: 26 December 2023
Published: 15 February 2024
DOI: https://doi.org/10.1007/s11042-023-18044-4

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

An optimized SVM-RFE based feature selection and weighted entropy K-means approach for big data clustering in mapreduce

Abstract

Access this article

Similar content being viewed by others

Design and Development of Bayesian Optimization Algorithms for Big Data Classification Based on MapReduce Framework

Research on Attribute Dimension Partition Based on SVM Classifying and MapReduce

SBKMEDA: Sorting-Based K-Median Clustering Algorithm Using Multi-Machine Technique for Big Data

Data availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Informed consent

Ethical approval

Additional information

Publisher's note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

An optimized SVM-RFE based feature selection and weighted entropy K-means approach for big data clustering in mapreduce

Abstract

Access this article

Similar content being viewed by others

Design and Development of Bayesian Optimization Algorithms for Big Data Classification Based on MapReduce Framework

Research on Attribute Dimension Partition Based on SVM Classifying and MapReduce

SBKMEDA: Sorting-Based K-Median Clustering Algorithm Using Multi-Machine Technique for Big Data

Data availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Informed consent

Ethical approval

Additional information

Publisher's note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation