Review on Email Spam Filtering Techniques

doi:10.23940/ijpe.21.02.p2.178190

Abstract

Abstract: A huge increase in the number of spam emails has led to the requirement for the evolution of more reliable and robust anti-spam techniques or filters that are utilized for preventing these emails (spam) from getting into inboxes. Machine Learning-based methods have been predominant and efficient in classifying emails as spam. This paper presents a broad review of successful and current machine learning-based methods that have been employed in email spam filtering. It also compares the strengths and limitations of current machine learning approaches that will guide researchers in efficiently dealing with the threat of spam in the future.

Key words: E-mail, Spam filtering, Machine Learning, Spam, Ham, Classification

Naina Nisar, Nitin Rakesh, and Megha Chhabra. Review on Email Spam Filtering Techniques [J]. Int J Performability Eng, 2021, 17(2): 178-190.

Add to citation manager EndNote|Reference Manager|ProCite|BibTeX|RefWorks

References

1. Cormack GV, “Email Spam Filtering: A Systematic Review”, Foundations and Trends in Information Retrieval 1(4):335-455, 2008.
2. D.M. Fonseca, O.H. Fazzion, E. Cunha, I. Las-casas, P.D. Guedes, W. Meira, M. Chaves, “Measuring characterizing, and avoiding spam traffic costs”, IEEE Int. Comp., 99, 2016.
3. Sarwat Nizamani, Nasrullah Memon, Uffe Kock Wiil, Panagiotis Karampelas, “Modeling Suspicious Email Detection using Enhanced Feature Selection”, IJMO2012, Vol. 2(4): 371-377 ISSN: 2010-3697, 2013.
4. P. Sahil, G. Dishant, A. Mehak, K. Ishita, J. Nishtha, “Comparison and analysis of spam detection algorithms”, Int. J. Appl. Innov. Eng. Manag.(IJAIEM), 2(4), pp. 1-7, 2013.
5. T.S. Guzella, W.M. Caminhas, “A review of machine learning approaches to spam filtering”, Expert Syst. Appl., 36 (7) (2009), pp. 10206-10222, 2009.
6. Diale M., Celik T., andVan Der Walt C., “Unsupervised feature learning for spam email filtering.Computers & Electrical Engineering”. vol. 74, pp. 89-104, 2019.
7. Rusland N. F., Wahid N., Kasim S., & Hafit, H., “Analysis of Naïve Bayes Algorithm for Email Spam Filtering across Multiple Datasets”, IOP Conference Series: Materials Science and Engineering, 226, 012091. doi:10.1088/1757-899x/226/1/012091, 2017.
8. Dedeturk B.K and Bahriye Akay, “Spam filtering using a logistic regression model trained by an artificial bee colony algorithm”, Appl. Soft Comput. 106229, 2020.
9. Herrero A, Corchado E, Pellicer MA, Abraham A., “MOVIHIDS: a mobile-visualization hybrid intrusion detection system”, Neurocomputing 72(13-15):2775-2784, 2009.
10. Guzella TS, Caminhas WM, “A Review of Machine Learning Approaches to Spam Filtering”, Expert Systems with Applications 36(7):10,206-10,222, 2009.
11. Diao Y, Lu H, Wu D, “A Comparative Study of Classiﬁcation Based Personal E-mail Filtering”. In: Knowledge Discovery and Data Mining, Current Issues and New Applications, pp 408-419. 2003.
12. Shi L, Wang Q, Ma X, Weng M, Qiao H, “Spam Email Classiﬁcation Using Decision Tree Ensemble”, Journal of Computational Information Systems: 949-956, 3 Feb, 2012.
13. Gansterer WN, Ecker GF, “On the Relationship Between Feature Selection and Classiﬁcation Accuracy”, Journal of Machine Learning Research 4:90-105, 2008.
14. Zhang L, Zhu J, Yao T, “An Evaluation of Statistical Spam Filtering Techniques Spam Filtering as Text Categorization”, ACM Transactions on Asian Language Information Processing (TALIP) 3(4):243-269, 2004.
15. Abdi, H., “Principal component analysis”, Computational Statistics 2, 433-459 (2010).
16. J. Xie, W. Chen, D. Zhang, S. Zu and Y. Chen, “Application of Principal Component Analysis in Weighted Stacking of Seismic Data”, in IEEE Geoscience and Remote Sensing Letters, vol. 14, no. 8, pp. 1213-1217, Aug. 2017.
17. F. Qian, A. Pathak, Y. C. Hu, Z. M. Mao,Y. Xie, “A case for unsupervised-learning-based spam filtering”, In Proc. of SIGMETRICS, 2010.
18. Turney, Peter D. and Pantel, Patrick, “From frequency to meaning: Vector space models of semantics”, Journal of Artificial Intelligence Research, 2010.
19. Yeh C.Y., Wu C.H., Doong S.H., “Effective spam classiﬁcation based on meta-heuristics”, In: Proceedings of IEEE International Conference on Systems, Man and Cybernetics, pp. 3872-3877 (2005).
20. M´endez J.R., D´ıaz F., Iglesias E.L., Corchado J.M., “A comparative performance study of feature selection methods for the anti-spam ﬁltering domain”, In: Advances in Data Mining. Applications in Medicine, Web Mining, Marketing, Image and Signal Mining, pp. 106-120. Springer, Berlin, Heidelberg (2006).
21. K. Tretyakov, “Machine learning techniques in spam ﬁltering”, Data Mining Problem-oriented Seminar, MTAT.03.177, May 2004.
22. Ching-Tung Wu, Kwang-Ting Cheng, Qiang Zhu and Yi-Leh Wu, "Using visual features for anti-spam filtering", IEEE International Conference on Image Processing2005, Genova, pp. III-509, 2005.
23. W. Li, N. Zhong, Y. Yao, J. Liu, C. Liu, “Spam filtering and email-mediated applications”, Paper presented at the International Workshop on Web Intelligence Meets Brain Informatics, 2006.
24. Sculley D, Wachman GM, “Relaxed online SVMs for spam filtering”, In: Proceedings of the 30th annual international ACM SIGIR conference on research and development in information retrieval. pp 415-422, 2007.
25. E. M.El-Alfy,"Discovering classification rules for email spam filtering with an ant colony optimization algorithm," IEEE Congress on Evolutionary Computation, Trondheim, pp. 1778-1783, 2009.
26. O. Amayri, N. Bouguila, “A study of spam filtering using support vector machines”, Artif. Intell.Rev. 34(1) 73-108, 2010.
27. Al-jarrah O, Khater I, Al-duwairi B, “Identifying Potentially Useful Email Header Features for Email Spam Filtering”, In: The Sixth International Conference on Digital Society, c, pp 140-145, 2012.
28. S. Dhanaraj, V. Karthikeyani, “A study on e-mail image spam filtering techniques”, In: International Conference on Pattern Recognition, Informatics and Mobile Engineering (PRIME), 2013.
29. C. Laorden, X. UgartePedrero, I. Santos, B. Sanz, J. Nieves, P.G. Bringas, “Study on the effectiveness of anomaly detection for spam filtering”, Inf. Sci., 277, pp. 421-444, 2014.
30. G. Mi, Y. Gao,Y. Tan, “Apply stacked auto-encoder to spam detection,” In: International Conference in Swarm Intelligence, 2015.
31 31.A. Bhowmick, S.M. Hazarika, “Machine Learning for E-Mail Spam Filtering: Review, Techniques and Trends”, arXiv:1606.01042v1 [cs.LG] 3 Jun (2016), pp. 1-27.
32. Barushka. A., Hajek. P., “Spam filtering using integrated distribution-based balancing approach and regularized deep neural networks”, In: Applied Intelligence, 2018.
33. Gaurav D., Tiwari S.M., Goyal A., Gandhi N., Abraham A., “Machine intelligence-based algorithms for spam ﬁltering on document labeling”, Soft Comput., 2019.
34. T.M. Mitchell,“Machine Learning (first ed.)”, McGraw-Hill, 1997.
35. Patil, T. and Sherekar, S., “Performance Analysis of Na¨ıve Bayes and Classiﬁcation Algorithm for Data Classiﬁcation”, International Journal Of Computer Science And Applications, 2013.
36. G. Bandana, “Design and Development of Naïve Bayes Classifier”, North Dakota State University of Agriculture and Applied Science Graduate Faculty of Computer Science, Master thesis, 2013.
37. Wu X, Kumar V, Ross Quinlan J, Ghosh J, Yang Q, Motoda H, McLachlan GJ, Ng A, Liu B, Yu PS, Zhou ZH, Steinbach M, Hand DJ, Steinberg D, “Top 10 Algorithms in Data Mining”, vol 14., 2017.
38. D. Sculley, G. WachmanW.Kraaij, A.P. deVries, C.L.A. Clarke, N. Fuhr, N. Kando (Eds.), “Relaxed Online SVMs for Spam Filtering”, SIGIR, ACM, pp. 415-422, 2007.
39. K. Li, Z. Zhong, “Fast statistical spam filter by approximate classifications”, Proceedings of the Joint International Conference on Measurement and Modeling of Computer Systems, Saint Malo, France, 2006.
40. Rios G, Zha H, “Exploring Support Vector Machines and Random Forests for Spam Detection”, In: Conference on e-mail and anti-spam(CEAS), pp 5-10, 2004.
41. A. Edstrom, “Detecting Spam with Artificial Neural Networks”, Retrieved on August 10, 2017 from 2016.
42. A. Chandra, S. Mohammad, B. RizwanWeb, “spam classification using supervised artificial neural network algorithms”, Adv. Comput. Intell.: Int. J. (ACII), 2 (1) (2015), pp. 21-30.
43. Guerra PHC, Guedes D, Meira JW, Hoepers C, Chaves M, Steding- Jessen K, “Exploring the spam arms race to characterize spam evolution”, In: Proceedings of the 7th collaboration, electronic messaging, anti-abuse and spam conference (CEAS), Redmond, 2010.
44. L. Breiman, “Bagging predictors”, Mach. Learn., 24(2), pp. 123-140, 1996.
45. B. Biggio, I. Corona, G. Fumera, G. Giacinto, F. RoliBagging, “Classifiers for fighting poisoning attacks in adversarial classification tasks Multiple Classifier Systems”,Springer Berlin Heidelberg (2011), pp. 350-359.
46. Netsanet S, Zhang J, Zheng D, “Bagged decision trees based scheme of microgrid protection using windowed fast fourier and wavelet transforms”, Electronics 7(5):61, 2018.
47. Chhabra M., Shukla M.K., Ravulakollu K.K., “Bagging- and Boosting-Based Latent Fingerprint Image Classification and Segmentation”, In: Gupta D., Khanna A., Bhattacharyya S., Hassanien A., Anand S., Jaiswal A. (eds) International Conference on Innovative Computing and Communications. Advances in Intelligent Systems and Computing, vol 1166. Springer, Singapore, 2021.
48. J.R. Mendez, F. Díaz, E.L. Iglesias, J.M. Corchado., “A comparative performance study of feature selection methods for the anti-spam filtering domain Advances in Data Mining”, Applications in Medicine, Web Mining, Marketing, Image and Signal Mining, Springer Berlin Heidelberg (2006), pp. 106-120.
49. B. Biggio, I. Corona, G. Fumera, G. Giacinto, F. Roli, Bagging classifiers for fighting poisoning attacks in adversarial classification tasks Multiple Classifier Systems, Springer Berlin Heidelberg (2011), pp. 350-359.
50. J. Friedman, T. Hastie, R. Tibshirani, “Additive logistic regression: a statistical view of boosting Ann”, Stat., 38 (2) (2000).
51. Gangavarapu T., Jaidhar, C.D. & Chanduka, B., “Applicability of machine learning in spam and phishing email filtering: review and approaches” Artif Intell Rev (2020).
52. T. Fawcett, “An introduction to ROC analysis”, Pattern Recogn., Lett., 27 (8) (2006), pp. 861-874.
53. G. Sakkis, I. Androutsopoulos, G. Paliouras, V. Karkaletsis, “ Stacking classifiers for anti-spam filtering of E-mail Empirical Methods in Natural Language Processing”, (2001), pp. 44-50.
54. I. Androutsopoulos, G. Paliouras, E. Michelakis, “Learning to Filter Unsolicited Commercial E-Mail”, Tech. Rep. National Centre for Scientific Research Demokritos, Athens, Greece (2011).
55. Mathswork Detector, “Performance Analysis Using ROC Curves”, - MATLAB & Simulink Example Retrieved August 11, 2017 from (2016).
56. I. Androutsopoulos, J. Koutsias, K.V. Chandrinos, C.D. Spyropoulos, “An experimental comparison of naïve Bayesian and keyword-based anti-spam filtering with personal e-mail messages”, Proc of the Ann Int ACM SIGIR Conf on Res and Devel in Inform Retrieval (2000).
57. W.A. Awad, S.M. Elseuofi, “Machine learning methods for spam E-mail classification”, Int. J. Comput. Sci. Inf. Technol., 3 (1) (2011), pp. 173-184.
58. I. Idris, A.S. Muhammad, “An improved AIS based E-mail classification technique for spam detection”, Proceedings of the Eight International Conference on eLearning for Knowledge-Based Society, Thailand (2012).
59. J.N. Shrivastava, M.H. Bindu, “E-mail classification using genetic algorithm with heuristic fitness function”, Int. J. Comput. Trends Technol., 4 (8) (2013), pp. 2956-2961.
60. Sharma AK, Prajapat SK, Aslam M,“A comparative study between naïve bayes and neural network (mlp) classifier for spam email detection”, Int J Comput Appl., 2014.
61. Renuka DK, Visalakshi P, Sankar T, “Improving e-mail spam classification using ant colony optimization algorithm”, Int J Comput Appl 22-26, 2015.
62. M. Zavvar, M. Rezaei, S. Garavand, “Email spam detection using combination of particle swarm optimization and artificial neural network and support vector machine”, Int. J. Mod. Educ. Comput. Sci. (2016), pp. 68-74.
63. Akshita Tyagi, “Content Based Spam Classification- A Deep Learning Approach”, A Thesis Submitted To The Faculty Of Graduate Studies University Of Calgary, Alberta, Canada (2016).
64. S.P. Rajamohana, K. Umamaheswari, B. Abirami, “Adaptive binary flower pollination algorithm for feature selection in review spam detection”, IEEE International Conference on Innovations in Green Energy and Healthcare Technologies (2017), pp. 1-4.
65. M. Ott, Y. Choi, C. Cardie, J.T. Hancock, “Finding deceptive opinion spam by any stretch of imagination ACM”, Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, 1(2011), pp. 309-319.
66. Bassiouni M, Ali M, El-Dahshan EA, “Ham and spam e-mails classiﬁcation using machine learning techniques”, J Appl Secur Res 13(3):315-331, 2018.
67. Merugu S, Reddy MCS, Goyal E, Piplani L, “Text message classiﬁcation using supervised machine learning algorithms”, In: Kumar A, Mozar S (eds) ICCCE 2018. ICCCE 2018. Lecture Notes in Electrical Engineering, vol 500. Springer, Singapore, p (2019).

[1]	Janarthanan Sekar and Ganesh Kumar T. Hyperparameter Tuning in Deep Learning-Based Image Classification to Improve Accuracy using Adam Optimization [J]. Int J Performability Eng, 2023, 19(9): 579-586.
[2]	Aashita Rajput, Muskan Yadav, Sachin Yadav, Megha Chhabra, and Arun Prakash Agarwal. Patch-Based Breast Cancer Histopathological Image Classification using Deep Learning [J]. Int J Performability Eng, 2023, 19(9): 607-623.
[3]	C. Rohith Bhat and Madhusundar Nelson. Artificial Intelligence Based Credit Card Fraud Detection for Online Transactions Optimized with Sparrow Search Algorithm [J]. Int J Performability Eng, 2023, 19(9): 624-632.
[4]	Savita Khurana, Gaurav Sharma, and Bhawna Sharma. Hybrid Machine Learning Model for Load Prediction in Cloud Environment [J]. Int J Performability Eng, 2023, 19(8): 507-515.
[5]	K. Eswara Rao, Bala Murali Pydi, T. Panduranga Vital, P. Annan Naidu, U. D. Prasann, and T. Ravikumar. An Advanced Machine Learning Approach for Student Placement Prediction and Analysis [J]. Int J Performability Eng, 2023, 19(8): 536-546.
[6]	Babaljeet Kaur and Shalli Rani. Are the Customers Receiving Exact Recommendations from the E-Commerce Companies? Towards the Identification of Gray Sheep Users Using Personality Parameters [J]. Int J Performability Eng, 2023, 19(7): 425-433.
[7]	Kshitij Kumar Sinha, Manoj Mathur, and Arun Sharma. Suitability Index Prediction for Residential Apartments Through Machine Learning [J]. Int J Performability Eng, 2023, 19(7): 434-442.
[8]	Manpreet Kaur and Shalli Rani. Recommender System: Towards Identification of Shilling Attacks in Rating System Using Machine Learning Algorithms [J]. Int J Performability Eng, 2023, 19(7): 443-451.
[9]	Srishti Bhugra and Puneet Goswami. Exploratory Review of Machine Learning-Based Software Component Reusability Prediction [J]. Int J Performability Eng, 2023, 19(7): 452-461.
[10]	Harsha Gaikwad, Sanil Gandhi, Arvind Kiwelekar, and Manjushree Laddha. Analyzing Brain Signals for Predicting Students’ Understanding of Online Learning: A Machine Learning Approach [J]. Int J Performability Eng, 2023, 19(7): 462-470.
[11]	Rakesh Kumar, Sunny Arora, Ashima Arya, Neha Kohli, Vaishali Arya, and Ekta Singh. Ensemble Learning for Appraising English Text Readability using Gompertz Function [J]. Int J Performability Eng, 2023, 19(6): 388-396.
[12]	Pranshu Kumar Soni and Leema Nelson. PCP: Profit-Driven Churn Prediction using Machine Learning Techniques in Banking Sector [J]. Int J Performability Eng, 2023, 19(5): 303-311.
[13]	Ramneet Kaur, Deepali Gupta, and Mani Madhukar. Learner-Centric Hybrid Filtering-Based Recommender System for Massive Open Online Courses [J]. Int J Performability Eng, 2023, 19(5): 324-333.
[14]	Mahima Yadav and Ishan Kumar. Image Processing-Based Transliteration from Hindi to English [J]. Int J Performability Eng, 2023, 19(5): 334-341.
[15]	Vaishali Arya and Tapas Kumar. Boosting X-Ray Scans Feature for Enriched Diagnosis of Pediatric Pneumonia using Deep Learning Models [J]. Int J Performability Eng, 2023, 19(3): 175-183.