Skip to main content
Log in

Distributed ReliefF-based feature selection in Spark

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

Feature selection (FS) is a key research area in the machine learning and data mining fields; removing irrelevant and redundant features usually helps to reduce the effort required to process a dataset while maintaining or even improving the processing algorithm’s accuracy. However, traditional algorithms designed for executing on a single machine lack scalability to deal with the increasing amount of data that have become available in the current Big Data era. ReliefF is one of the most important algorithms successfully implemented in many FS applications. In this paper, we present a completely redesigned distributed version of the popular ReliefF algorithm based on the novel Spark cluster computing model that we have called DiReliefF. The effectiveness of our proposal is tested on four publicly available datasets, all of them with a large number of instances and two of them with also a large number of features. Subsets of these datasets were also used to compare the results to a non-distributed implementation of the algorithm. The results show that the non-distributed implementation is unable to handle such large volumes of data without specialized hardware, while our design can process them in a scalable way with much better processing times and memory usage.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Notes

  1. https://github.com/rauljosepalma/DiReliefF.

  2. http://www.sas.com/en_us/software/high-performance-analytics.html.

  3. http://largescale.ml.tu-berlin.de/about/.

References

  1. Apache Software Foundation: Hadoop. https://hadoop.apache.org

  2. Bacardit J, Widera P, Márquez-chamorro A, Divina F, Aguilar-Ruiz JS, Krasnogor N (2012) Contact map prediction using a large-scale ensemble of rule sets and the fusion of multiple predicted structural features. Bioinformatics 28(19):2441–2448. https://doi.org/10.1093/bioinformatics/bts472

    Article  Google Scholar 

  3. Baldi P, Sadowski P, Whiteson D, Neyman J, Pearson E, Hornik K, Stinchcombe M, White H, Hochreiter S, Bengio Y, Simard P, Frasconi P, Baldi P, Sadowski P, Hinton GE, Osindero S, Teh YW, Aad G, Aaltonen T, Alwall J, Sjostrand T, Cheng HC, Han Z, Barr A, Lester C, Stephens P, Hocker A, Aaltonen T (2014) Searching for exotic particles in high-energy physics with deep learning. Nat Commun 5:694–706. https://doi.org/10.1038/ncomms5308

    Article  Google Scholar 

  4. Bolón-Canedo V, Sánchez-Maroño N, Alonso-Betanzos A (2012) A review of feature selection methods on synthetic data. Knowl Inf Syst 34(3):483–519. https://doi.org/10.1007/s10115-012-0487-8

    Article  Google Scholar 

  5. Bolón-Canedo V, Sánchez-Maroño N, Alonso-Betanzos A (2015) Distributed feature selection: an application to microarray data classification. Appl Soft Comput 30:136–150. https://doi.org/10.1016/j.asoc.2015.01.035

    Article  Google Scholar 

  6. Bu Y, Howe B, Ernst MD (2010) HaLoop: efficient iterative data processing on large clusters. Proc VLDB Endow 3(1–2):285–296. https://doi.org/10.14778/1920841.1920881

    Article  Google Scholar 

  7. Dean J, Ghemawat S (2004) MapReduce: simplied data processing on large clusters. In: Proceedings of 6th symposium on operating systems design and implementation, pp 137–149. https://doi.org/10.1145/1327452.1327492

  8. Dean J, Ghemawat S (2008) MapReduce: simplified data processing on large clusters. Commun ACM 51(1):107. http://dl.acm.org/citation.cfm?id=1327452.1327492

  9. Ekanayake J, Li H, Zhang B, Gunarathne T, Bae SH, Qiu J, Fox G (2010) Twister: a runtime for iterative MapReduce. In: Proceedings of the 19th ACM international symposium on high performance distributed computing, HPDC ’10, pp 810–818. ACM, New York. https://doi.org/10.1145/1851476.1851593

  10. García S, Luengo J, Herrera F (2015) Feature selection. In: Data preprocessing in data mining, pp 163–193. Springer International Publishing, Cham. https://doi.org/10.1007/978-3-319-10247-4_7

  11. Greene CS, Penrod NM, Kiralis J, Moore JH (2009) Spatially uniform ReliefF (SURF) for computationally-efficient filtering of gene–gene interactions. BioData Min 2(1):5. https://doi.org/10.1186/1756-0381-2-5

    Article  Google Scholar 

  12. Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH (2009) The WEKA data mining software. ACM SIGKDD Explor Newsl 11(1):10. https://doi.org/10.1145/1656274.1656278

    Article  Google Scholar 

  13. Hong SJ (1997) Use of contextual information for feature ranking and discretization. IEEE Trans Knowl Data Eng 9(5):718–730. https://doi.org/10.1109/69.634751

    Article  Google Scholar 

  14. Huang Y, McCullagh PJ, Black ND (2009) An optimization of ReliefF for classification in large datasets. Data Knowl Eng 68(11):1348–1356. https://doi.org/10.1016/j.datak.2009.07.011

    Article  Google Scholar 

  15. Kalousis A, Prados J, Hilario M (2006) Stability of feature selection algorithms: a study on high-dimensional spaces. Knowl Inf Syst 12(1):95–116. https://doi.org/10.1007/s10115-006-0040-8

    Article  Google Scholar 

  16. Kira K, Rendell LA (1992) A practical approach to feature selection. In: Proceedings of the ninth international workshop on machine learning, pp 249–256

  17. Kononenko I (1994) Estimating attributes: analysis and extensions of RELIEF. Mach Learn ECML-94 784:171–182. https://doi.org/10.1007/3-540-57868-4

    Article  Google Scholar 

  18. Kubica J, Singh S, Sorokina D (2011) Parallel large-scale feature selection. In: Scaling up machine learning, pp 352–370. https://doi.org/10.1017/CBO9781139042918.018

  19. Kuncheva LI (2007) A stability index for feature selection. In: International multi-conference: artificial intelligence and applications, pp 390–395.

  20. Leskovec J, Rajaraman A, Ullman JD (2014) Mining massive datasets, 2nd edn. Cambridge University Press, Cambridge (2014). http://infolab.stanford.edu/~ullman/mmds/book.pdf

  21. Li J, Cheng K, Wang S, Morstatter F, Trevino RP, Tang J, Liu H (2016) Feature selection: a data perspective. arXiv:1601.07996

  22. Lichman M (2013) UCI machine learning repository. http://archive.ics.uci.edu/ml

  23. Liu Y, Xu L, Li M (2016) The parallelization of back propagation neural network in mapreduce and spark. Int J Parallel Program. https://doi.org/10.1007/s10766-016-0401-1

    Google Scholar 

  24. Ma J, Saul LK, Savage S, Voelker GM (2009) Identifying suspicious URLs: an application of large-scale online learning. In: Proceedings of the international conference on machine learning (ICML). Montreal, Quebec

  25. Meng X, Bradley J, Yavuz B, Sparks E, Venkataraman S, Liu D, Freeman J, Tsai D, Amde M, Owen S, Xin D, Xin R, Franklin MJ, Zadeh R, Zaharia M, Talwalkar A (2015) MLlib: machine learning in apache spark. J Mach Learn 17:1–7. http://www.jmlr.org/papers/volume17/15-237/15-237.pdf

  26. Peng H, Long F, Ding C (2005) Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans Pattern Anal Mach Intell 27(8):1226–38. https://doi.org/10.1109/TPAMI.2005.159. http://www.ncbi.nlm.nih.gov/pubmed/16119262

  27. Peralta D, del Río S, Ramírez-Gallego S, Riguero I, Benitez JM, Herrera F (2015) Evolutionary feature selection for big data classification: a mapreduce approach evolutinary feature selection for big data classification: a mapreduce approach. Math Probl Eng. https://doi.org/10.1155/2015/246139. http://sci2s.ugr.es/sites/default/files/2015-hindawi-peralta.pdf

  28. Ramírez-Gallego S, Lastra I, Martínez-Rego D, Bolón-Canedo V, Benítez JM, Herrera F, Alonso-Betanzos A (2016) Fast-mRMR: fast minimum redundancy maximum relevance algorithm for high-dimensional big data. Int J Intell Syst. https://doi.org/10.1002/int.21833

    Google Scholar 

  29. Reyes O, Morell C, Ventura S (2015) Scalable extensions of the ReliefF algorithm for weighting and selecting features on the multi-label learning context. Neurocomputing 161:168–182. https://doi.org/10.1016/j.neucom.2015.02.045

    Article  Google Scholar 

  30. Robnik-Šikonja M, Kononenko I (2003) Theoretical and empirical analysis of ReliefF and RReliefF. Mach Learn 53(1–2):23–69

    Article  MATH  Google Scholar 

  31. Shi J, Qiu Y, Minhas UF, Jiao L, Wang C, Reinwald B, Özcan F (2015) Clash of the titans: mapreduce vs. spark for large scale data analytics. Proc VLDB Endow 8(13):2110–2121. https://doi.org/10.14778/2831360.2831365

    Article  Google Scholar 

  32. Wang Y, Ke W, Tao X (2016) A feature selection method for large-scale network traffic classification based on spark. Information 7(1):6. https://doi.org/10.3390/info7010006. http://www.mdpi.com/2078-2489/7/1/6

  33. Xindong Wu X, Xingquan Zhu X, Gong-Qing Wu GQ, Wei Ding W (2014) Data mining with big data. IEEE Trans Knowl Data Eng 26(1):97–107. https://doi.org/10.1109/TKDE.2013.109. http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=6547630

  34. Zafra A, Pechenizkiy M, Ventura S (2012) ReliefF-MI: an extension of ReliefF to multiple instance learning. Neurocomputing 75(1):210–218. https://doi.org/10.1016/j.neucom.2011.03.052

    Article  Google Scholar 

  35. Zaharia M, Chowdhury M, Das T, Dave A (2012) Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: NSDI’12 proceedings of the 9th USENIX conference on networked systems design and implementation, pp 2–2. https://doi.org/10.1111/j.1095-8649.2005.00662.x. https://www.usenix.org/system/files/conference/nsdi12/nsdi12-final138.pdf

  36. Zaharia M, Chowdhury M, Franklin MJ, Shenker S, Stoica I (2010) Spark: cluster computing with working sets. In: HotCloud’10 proceedings of the 2nd USENIX conference on hot topics in cloud computing, p 10. https://doi.org/10.1007/s00256-009-0861-0

  37. Zhang Y, Ding C, Li T (2008) Gene selection algorithm by combining reliefF and mRMR. BMC Genomics 9(Suppl 2):S27. https://doi.org/10.1186/1471-2164-9-S2-S27

    Article  Google Scholar 

  38. Zhao Z, Cox J, Duling D, Sarle W (2012) Massively parallel feature selection: an approach based on variance preservation. Lect. Notes Comput Sci 7523 LNAI(PART 1):237–252. https://doi.org/10.1007/978-3-642-33460-3_21 (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

    Article  MATH  Google Scholar 

Download references

Acknowledgements

The authors thank the anonymous reviewers for helping us to improve the manuscript, the National Autonomous University of Honduras and the University of Alcalá. R. Palma-Mendoza holds a scholarship from the Spanish Fundación Carolina. This research was also supported by Projects BadgePeople TIN2016-76956-C3-3-R and SEBASENet TIN2015-71841-REDT.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Raul-Jose Palma-Mendoza.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Palma-Mendoza, RJ., Rodriguez, D. & de-Marcos, L. Distributed ReliefF-based feature selection in Spark. Knowl Inf Syst 57, 1–20 (2018). https://doi.org/10.1007/s10115-017-1145-y

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-017-1145-y

Keywords

Navigation