Home            Contact us            FAQs
    
      Journal Home      |      Aim & Scope     |     Author(s) Information      |      Editorial Board      |      MSP Download Statistics

     Research Journal of Applied Sciences, Engineering and Technology


A Novel Feature Selection Based on One-Way ANOVA F-Test for E-Mail Spam Classification

Nadir Omer Fadl Elssied, Othman Ibrahim and Ahmed Hamza Osman
Faculty of Computing, University Technology Malaysia, 81310, Skudai, Johor Bahru, Malaysia
Research Journal of Applied Sciences, Engineering and Technology  2014  3:625-638
http://dx.doi.org/10.19026/rjaset.7.299  |  © The Author(s) 2014
Received: June 21, 2013  |  Accepted: July 09, 2013  |  Published: January 20, 2014

Abstract

Spam is commonly defined as unwanted e-mails and it became a global threat against e-mail users. Although, Support Vector Machine (SVM) has been commonly used in e-mail spam classification, yet the problem of high data dimensionality of the feature space due to the massive number of e-mail dataset and features still exist. To improve the limitation of SVM, reduce the computational complexity (efficiency) and enhancing the classification accuracy (effectiveness). In this study, feature selection based on one-way ANOVA F-test statistics scheme was applied to determine the most important features contributing to e-mail spam classification. This feature selection based on one-way ANOVA F-test is used to reduce the high data dimensionality of the feature space before the classification process. The experiment of the proposed scheme was carried out using spam base well-known benchmarking dataset to evaluate the feasibility of the proposed method. The comparison is achieved for different datasets, categorization algorithm and success measures. In addition, experimental results on spam base English datasets showed that the enhanced SVM (FSSVM) significantly outperforms SVM and many other recent spam classification methods for English dataset in terms of computational complexity and dimension reduction.

Keywords:

Feature selection, machine learning, one-way ANOVA F-test, spam detection, SVM,


References

  1. Alguliyev, R. and S. Nazirova, 2012. Two approaches on implementation of CBR and CRM technologies to the spam filtering problem. J. Inform. Secur., 3(1): 11-17.
    CrossRef    
  2. Alper, K.U. and S. Gunal, 2012. A novel probabilistic feature selection method for text classification. Knowl. Based Syst., 36: 226-235.
    CrossRef    
  3. Bulletin, K.S., 2012. Spam report: February, by Maria Namestnikova.
    Direct Link
  4. Chen, J., H. Huang, S. Tian and Y. Qu, 2009. Feature selection for text classification with Naïve Bayes. Expert Syst. Appl., 36(3, Part 1): 5432-5435.
    CrossRef    
  5. Chhabra, P., R. Wadhvani and S. Shukla, 2010. Spam filtering using support vector machine. Special Issue IJCCT, 1(2, 3, 4): 161-171.
  6. Fagbola, T., S. Olabiyisi and A. Adigun, 2012. Hybrid GA-SVM for efficient feature selection in e-mail classification. Comput. Eng. Intelli. Syst., 3(3): 17-28.
  7. Forman, G., 2003. An extensive empirical study of feature selection metrics for text classification. J. Mach. Learn. Res., 3: 1289-1305.
  8. Golovko, V., S. Bezobrazov, P. Kachurka and L. Vaitsekhovich, 2010. Neural network and artificial immunesystems for malware and network intrusion detection. Adv. Mach. Learn. II, 263: 485-513.
    CrossRef    
  9. Guzella, T.S. and W.M. Caminhas, 2009. A review of machine learning approaches to Spam filtering. Expert Syst. Appl., 36(7): 10206-10222.
    CrossRef    
  10. Ji, Z. and D. Dasgupata, 2004. Augmented negative selectionalgorithmwithvariable-coverage detectors. Proceeding of the Congress on Evolutionary Computation, CEC 2004, pp: 1081-1088.
  11. Jin, Q. and M. Ming, 2011. A method to construct self set for IDS based on negative selection algorithm. Proceeding of the International Conference on Mechatronic Science, Electric Engineering and Computer (MEC), : 1051-1053.
    CrossRef    
  12. Lai, C.C. and C.H. Wu, 2007. Particle swarm optimization-aided feature selection for spam email classification. Proceeding of the 2nd International Conference on Innovative Computing, Information and Control, pp: 165.
    CrossRef    
  13. Lee, S.M., D.S. Kim, J.H. Kim and J.S. Park, 2010. Spam detection using feature selection and parameters optimization. Proceeding of the International Conference on Complex, Intelligent and Software Intensive Systems (CISIS), pp: 883-888.
    CrossRef    
  14. Liang, J., S. Yang and A. Winstanley, 2008. Invariant optimal feature selection: A distance discriminant and feature ranking based solution. Pattern Recog., 41(5): 1429-1439.
    CrossRef    
  15. Long, X., W.L. Cleveland and Y.L. Yao, 2011. Methods and Systems for Identifying and Localizing Objects based on Features of the Objects that are Mapped to a Vector: Google Patents.
  16. Ma, W., D. Tran and D. Sharma, 2009. A novel spam email detection system based on negative selection. Proceeding of the 4th International Conference on Computer Sciences and Convergence Information Technology, ICCIT '09, pp: 987-992.
    CrossRef    
  17. Mark Hopkins, E.R., G. Forman and J. Suermondt, (year). Spambase Dataset. Retreived form: ftp:// ftp.ics.uci.edu/pub/machine-learningdata bas es/ spambase/.
  18. Marsono, M.N., 2007. Towards improving e-mail content classi?cation for spam control: Architecture, abstraction and strategies. Ph.D. Thesis, University of Victoria.
  19. Méndez, J.R., F. Fdez-Riverola, F. Diaz, E. Iglesias and J. Corchado, 2006. A comparative performance study of feature selection methods for the anti-spam filtering domain. Lect. Notes Comput. Sci., 4065: 106-120.
    CrossRef    
  20. Mohammad, A.H. and R.A. Zitar, 2011. Application of genetic optimized artificial immune system and neural networks in spam detection. Appl. Soft Comput., 11(4): 3827-3845.
    CrossRef    
  21. Mohammed, M., A. Shawkat and T. Kevin, 2010. Improved C4.5 algorithm for rule based classification. Proceedings of 9th Artificial Intelligence, Knowledge Engineering and Database Conference (AIKED'10), pp: 296-301.
  22. Morariu, D., L. Vintan and V. Tresp, 2006. Evolutionary feature selection for text documents using the SVM. Informatics, 15: 215-221.
  23. Nazirova, S. and R. Alguliyev, 2012. Two Approaches on implementation of CBR and CRM technologies to the spam filtering problem. J. Inform. Secur., 3(1): 11-17.
    CrossRef    
  24. Noble, W.S., 2006. What is a support vector machine? Nature Biotechnol., 24(12): 1565-1567.
    CrossRef    PMid:17160063    
  25. Parimala, R. and R. Nallaswamy, 2011. A study of spam e-mail classification using feature selection package. Global J. Comput. Sci. Technol., 11(7).
  26. Pearson, K., 1920. Notes on the history of correlation. Biometrika, 13(1): 25-45.
    CrossRef    
  27. Saad, O., A. Darwish and R. Faraj, 2012. A survey of machine learning techniques for Spam filtering. Int. J. Comput. Sci. Network Secur., 12(2): 66.
  28. Salcedo-Campos, F., J. Díaz-Verdejo and P. García-Teodoro, 2012. Segmental parameterisation and statistical modelling of e-mail headers for spam detection. Inform. Sci., 195(0): 45-61.
    CrossRef    
  29. Salehi, S. and A. Selamat, 2011. Hybrid simple artificial immune system (SAIS) and particle swarm optimization (PSO) for spam detection. Proceeding of the 5th Malaysian Conference in Software Engineering (MySEC), : 124-129.
    CrossRef    
  30. Sanasam,R.S., H.A. Murthy and Timothy A. Gonsalves, 2010. Feature Selection for Text Classi?cation Based on Gini Coefficient of Inequality. Proceeding of the 4th Workshop and Conference on Feature Selection in Data Mining, pp: 76-85.
  31. Shang, W., H. Huang, H. Zhu, Y. Lin, Y. Qu and Z. Wang,2007.A novel feature selection algorithm for text categorization. Expert Syst. Appl., 33(1): 1-5.
    CrossRef    
  32. Sirisanyalak, B. and O. Sornil, 2007. An artificial immunity-based spam detection system. Proceeding of the IEEE Congress on Evolutionary Computation, pp: 3392-3398.
    CrossRef    
  33. Sun, J., C. Zheng, X. Li and Y. Zhou, 2010. Analysis of the distance between two classes for tuning SVM hyperparameters. IEEE T. Neural Networks, 21(2): 305-318.
    CrossRef    PMid:20071257    
  34. Tu, C.J., L.Y. Chuang, J.Y. Chang and C.H. Yang, 2007. Feature selection using PSO-SVM. IAENG Int. J. Comput. Sci., 33(1): 111-116.
  35. Uguz, H., 2011. A two-stage feature selection method for text categorization by using information gain, principal component analysis and genetic algorithm. Knowl. Based Syst., 24(7): 1024-1032.
    CrossRef    
  36. Unler, A. and A. Murat, 2010. A discrete particle swarm optimization method for feature selection in binary classification problems. Euro. J. Operat. Res., 206(3): 528-539.
    CrossRef    
  37. Unler, A., A. Murat and R.B. Chinnam, 2011. < i> mr< sup> 2< i> PSO: A maximum relevance minimum redundancy feature selection method based on swarm intelligence for support vector machine classification. Inform. Sci., 181(20): 4625-4641.
    CrossRef    
  38. Uysal, A.K. and S. Gunal, 2012. A novel probabilistic feature selection method for text classification. Knowl. Based Syst., 36: 226-235.
    CrossRef    
  39. Wang, L., 2005. Support Vector Machines: Theory and Applications. Springer Verlag, Berlin, New York.
    CrossRef    
  40. Wang, X.L. and I. Cloete, 2005. Learning to classify email: A survey. Proceeding of 2005 International Conference on Machine Learning and Cybernetics, pp: 5716-5719.
    CrossRef    
  41. Wang, H.B., Y. Yu and Z. Liu, 2005. SVM classifier incorporating features electionusing GA forspam detection. Lect. Notes Comput. Sci., 3824: 1147-1154.
    CrossRef    
  42. Wang, Z.Q., X. Sun, X. Li and D.X. Zhang, 2006. An efficient SVM-based spam filtering algorithm. Proceeding of the International Conference on Machine Learning and Cybernetics, pp: 3682-3686.
    CrossRef    
  43. Wood, P., 2012. Symantec Intelligence Report: September 2012.
  44. Wu, X., V. Kumar, J. Ross Quinlan, J. Ghosh, Q. Yang, H. Motoda, J.M. Geoffrey, N. Angus, L. Bing and S.Y. Philip, 2008. Top 10 algorithms in data mining. Knowl. Inform. Syst., 14(1): 1-37.
    CrossRef    
  45. Xiao-Li, C., L. Pei-Yu, Z. Zhen-Fang and Q. Ye, 2009. A method of spam filtering based on weighted support vector machines. Proceeding of the IEEE International Symposium on IT in Medicine and Education, pp: 947-950.
  46. Yang, Y. and J.O. Pedersen, 1997. A comparative study on feature selection in text categorization. Proceeding of the 14th International Conference on Machine Learning, pp: 412-420.
  47. Yang, Z., X. Nie, W. Xu and J. Guo, 2006. An approach to spam detection by naive Bayes ensemble based on decision induction. Proceeding of the 6th International Conference on Intelligent Systems Design and Applications, ISDA'06., pp: 861-866.
    CrossRef    PMCid:PMC2810464    
  48. Ying, K.C., S.W. Lin, Z.J. Lee and Y.T. Lin, 2010. An ensemble approach applied to classify spam e-mails. Expert Syst. Appl., 37(3): 2197-2201.
    CrossRef    
  49. Youn, S. and D. McLeod, 2007. A comparative study for email classification. Adv. Innovat. Syst. Comput. Sci. Software Eng., pp: 387-391.
    CrossRef    
  50. Yun, C., D. Shin, H. Jo and J. Yang, 2007. An experimental study on feature subset selection methods. Proceeding of the 7th IEEE International Conference on Computer and Information Technology, pp: 77-82.
    CrossRef    
  51. Zhu, Z., 2008. An email classification model based on rough set and support vector machine. Proceeding of the 5th International Conference on Fuzzy Systems and Knowledge Discovery, FSKD'08., pp: 236-240.
    CrossRef    

Competing interests

The authors have no competing interests.

Open Access Policy

This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

Copyright

The authors have no competing interests.

ISSN (Online):  2040-7467
ISSN (Print):   2040-7459
Submit Manuscript
   Information
   Sales & Services
Home   |  Contact us   |  About us   |  Privacy Policy
Copyright © 2024. MAXWELL Scientific Publication Corp., All rights reserved