Skip to main content
Log in

An optimized SVM-RFE based feature selection and weighted entropy K-means approach for big data clustering in mapreduce

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

In the digitalized world, efficient big data clustering is necessary for massive data generation. The clustering algorithm plays an important role in resolving the computational complexity. The big data arriving from various sources are being processed using the MapReduce framework (MRF) by the knowledge of the clustering algorithms. Moreover, the clustering algorithm is useful for mining the significant information from the dataset. Generally, there are several difficulties in applying the clustering approach to big data as its new challenges are based on computation cost and reasonable time. Hence, this research introduced the Competitive Jaya Leader Harris Hawks Optimization assisted Entropy Weighted Power K-Means Clustering (CJayaLHHO_EWPKMC) for big data clustering. In addition, the overall processing of the devised method for big data clustering is carried out in the MapReduce (MR) framework. In mapper, the feature selection is done using Support vector Machine-Recursive Feature Elimination (SVM-RFE) assisted Jaya Leader Harris Hawks Optimization (JayaLHHO). In the reducer, the big data clustering is established using the EWPKMC method, wherein the weight of EWPKMC is modified with the CJayaLHHO algorithm such that the clustering outcome is attained. The proposed method is scalable, simple, cost-effective, and able to integrate with other technologies. The experimental result portrays that the developed method attained a superior presentation than the conventional methods based on the clustering accuracy is 0.937, the Jaccard coefficient is 0.913, and the rand coefficient is 0.912.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Algorithm 1
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Data availability

The data underlying this article are available in MHEALTH Dataset, Skin Segmentation Dataset, at http://archive.ics.uci.edu/ml/datasets/mhealth+dataset#https://archive.ics.uci.edu/ml/datasets/Skin+Segmentation.

References

  1. Heidari S, Alborzi M, Radfar R, Afsharkazemi MA, Rajabzadeh Ghatari A (2019) Big data clustering with varied density based on MapReduce. J Big Data 6(1):1–16

    Article  Google Scholar 

  2. Prasad KR, Mohammed M, Prasad LVN, Anguraj DK (2021) An efficient sampling-based visualization technique for big data clustering with crisp partitions. Distrib Parallel Databases 39(3):813–832

    Article  Google Scholar 

  3. Qin Y, Yalamanchili HK, Qin J, Yan B, Wang J (2015) The current status and challenges in computational analysis of genomic big data. Big Data Res 2(1):12–18

    Article  Google Scholar 

  4. Shukla AK, Muhuri PK (2019) Big-data clustering with interval type-2 fuzzy uncertainty modeling in gene expression datasets. Eng Appl Artif Intell 77:268–282

    Article  Google Scholar 

  5. Madan S, Bhardwaj K, Gupta S (2021) Critical analysis of big data privacy preservation techniques and challenges. Advs Intell Syst Comput 1394:267–278

  6. Wu X, Zhu X, Wu GQ, Ding W (2013) Data mining with big data. IEEE Trans Knowl Data Eng 26(1):97–107

    Google Scholar 

  7. Kulkarni O, Jena S, Sankar VR (2020) MapReduce framework based big data clustering using fractional integrated sparse fuzzy C means algorithm. IET Image Proc 14(12):2719–2727

    Article  Google Scholar 

  8. Sardar TH, Ansari Z (2022) Distributed big data clustering using MapReduce-based fuzzy C-medoids. J Inst Eng (India): Series B 103(1):73–82

    Google Scholar 

  9. Dean J, Ghemawat S (2004) MapReduce: simplified data processing on large clusters. Commun ACM 51(1):107–113

  10. Cui X, Zhu P, Yang X, Li K, Ji C (2014) Optimized big data K-means clustering using MapReduce. J Supercomput 70(3):1249–1259

    Article  Google Scholar 

  11. Fan T (2018) Research and implementation of user clustering based on MapReduce in multimedia big data. Multimed Tools Appl 77(8):10017–10031

    Article  Google Scholar 

  12. Lu W (2020) Improved K-means clustering algorithm for big data mining under Hadoop parallel framework. J Grid Comput 18(2):239–250

    Article  MathSciNet  Google Scholar 

  13. Madan S, Goswami P (2020) Nature inspired computational intelligence implementation for privacy preservation in MapReduce framework. IJIIDS 13:191–207

    Article  Google Scholar 

  14. Sardar TH, Faizabadi AR, Ansari Z (2017) An evaluation of MapReduce framework in cluster analysis. In: Proceedings of 2017 international conference on intelligent computing, instrumentation and control technologies (ICICICT). ICICICT, pp 110–114

  15. Madan S, Goswami P (2019) A privacy preserving scheme for big data publishing in the cloud using k-anonymization and hybridized optimization algorithm, international conference on circuits and systems in digital enterprise technology (ICCSDET). pp 1–7

  16. Madhulatha TS (2012) An overview on clustering methods. IOSR J Eng 2(4):719–725. ArXiv preprint arXiv:1205.1117

  17. Cura T (2012) A particle swarm optimization approach to clustering. Expert Syst Appl 39(1):1582–1588

    Article  Google Scholar 

  18. Shelokar PS, Jayaraman VK, Kulkarni BD (2004) An ant colony approach for clustering. AnalyticaChimicaActa 509(2):187–195

    CAS  Google Scholar 

  19. Bu F, Zhang Q, Yang LT, Yu H (2020) An edge-cloud-aided high-order possibilistic c-means algorithm for big data clustering. IEEE Trans Fuzzy Syst 28(12):3100–3109

    Article  Google Scholar 

  20. Prasad KR, Mohammed M, Noorullah RM (2021) Visual topic models for healthcare data clustering. Evol Intel 14(2):545–562

    Article  Google Scholar 

  21. Sardar TH, Ansari Z (2022) MapReduce-based Fuzzy C-means Algorithm for Distributed Document Clustering. Journal of The Institution of Engineers (India): Series B 103(1):131–142

    ADS  Google Scholar 

  22. Zhang Y, Deng Q, Liang W, Zou X (2018) An efficient feature selection strategy based on multiple support vector machine technology with gene expression data. BioMed Res Int 2018(1):1–11

  23. Rao R (2016) Jaya: A simple and new optimization algorithm for solving constrained and unconstrained optimization problems. Int J Ind Eng Comput 7(1):19–34

    Google Scholar 

  24. Naik MK, Panda R, Wunnava A, Jena B, Abraham A (2021) A leader Harris hawks optimization for 2-D Masi entropy-based multilevel image thresholding. Multimedia Tools and Applications 80(28):35543–35583

    Article  Google Scholar 

  25. Chakraborty S, Paul D, Das S, Xu J (2020) Entropy regularized power k-means clustering. 23rd international conference on artificial intelligence and statistics (AISTATS 2020)

  26. Cheng R, Jin Y (2014) A competitive swarm optimizer for large scale optimization. IEEE transactions on cybernetics 45(2):191–204

    Article  PubMed  Google Scholar 

  27. Nafis NSM, Awang S (2021) An enhanced hybrid feature selection technique using term frequency-inverse document frequency and support vector machine-recursive feature elimination for sentiment classification. IEEE Access 9:52177–52192

    Article  Google Scholar 

  28. MHEALTH Dataset taken from, “http://archive.ics.uci.edu/ml/datasets/mhealth+dataset#”. Accessed on May 2022

  29. Skin Segmentation Dataset taken from “https://archive.ics.uci.edu/ml/datasets/Skin+Segmentation”. Accessed on July 2012

  30. Rajendran S, Khalaf OI, Alotaibi Y, AlghamdiS (2021) MapReduce-based big data classification model using feature subset selection and hyperparameter tuned deep belief network. Sci Rep 11

  31. Al-Thanoon NA, Algamal ZY, Qasim OS (2021) Feature selection based on a crow search algorithm for big data classification. Chem Intell Lab Syst 212

Download references

Acknowledgements

I would like to express my very great appreciation to the co-authors of this manuscript for their valuable and constructive suggestions during the planning and development of this research work.

Funding

This research did not receive any specific funding.

Author information

Authors and Affiliations

Authors

Contributions

All authors have made substantial contributions to conception and design, revising the manuscript, and the final approval of the version to be published. Also, all authors agreed to be accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved.

Corresponding author

Correspondence to Suman Madan.

Ethics declarations

Conflict of interest

The authors declare no conflict of interest.

Informed consent

Not Applicable.

Ethical approval

Not Applicable.

Additional information

Publisher's note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Madan, S., C, K., Bhatia, M.K. et al. An optimized SVM-RFE based feature selection and weighted entropy K-means approach for big data clustering in mapreduce. Multimed Tools Appl (2024). https://doi.org/10.1007/s11042-023-18044-4

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s11042-023-18044-4

Keywords

Navigation