A study of spam filtering using support vector machines

Amayri, Ola; Bouguila, Nizar

doi:10.1007/s10462-010-9166-x

A study of spam filtering using support vector machines

Published: 23 May 2010

Volume 34, pages 73–108, (2010)
Cite this article

Artificial Intelligence Review Aims and scope Submit manuscript

Ola Amayri¹ &
Nizar Bouguila²

1118 Accesses
79 Citations
Explore all metrics

Abstract

Electronic mail is a major revolution taking place over traditional communication systems due to its convenient, economical, fast, and easy to use nature. A major bottleneck in electronic communications is the enormous dissemination of unwanted, harmful emails known as spam emails. A major concern is the developing of suitable filters that can adequately capture those emails and achieve high performance rate. Machine learning (ML) researchers have developed many approaches in order to tackle this problem. Within the context of machine learning, support vector machines (SVM) have made a large contribution to the development of spam email filtering. Based on SVM, different schemes have been proposed through text classification approaches (TC). A crucial problem when using SVM is the choice of kernels as they directly affect the separation of emails in the feature space. This paper presents thorough investigation of several distance-based kernels and specify spam filtering behaviors using SVM. The majority of used kernels in recent studies concern continuous data and neglect the structure of the text. In contrast to classical kernels, we propose the use of various string kernels for spam filtering. We show how effectively string kernels suit spam filtering problem. On the other hand, data preprocessing is a vital part of text classification where the objective is to generate feature vectors usable by SVM kernels. We detail a feature mapping variants in TC that yield improved performance for the standard SVM in filtering task. Furthermore, to cope for realtime scenarios we propose an online active framework for spam filtering. We present empirical results from an extensive study of online, transductive, and online active methods for classifying spam emails in real time. We show that active online method using string kernels achieves higher precision and recall rates.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Abadi M, Burrows M, Manasse M, Wobber T (2003) Moderately hard, memory-bound functions. In: Proceedings of the 10th annual network and distributed system security symposium, California, USA, pp 25–39
Anderson T, Bahadur R (1962) Classification into two multivariate normal distributions with different covariance matrices. Ann Math Stat 33(2): 420–431
Article MATH MathSciNet Google Scholar
Androutsopoulos I, Koutsias J, Chandrinos K, Paliouras G, Spyropoulos C (2000a) An evaluation of naive bayesian anti-spam filtering. In: Proceedings of the 11th European conference on machine learning, Barcelona, Spain, pp 9–17
Androutsopoulos I, Paliouras G, Karkaletsis V, Sakkis G, Spyropoulos C, Stamatopoulos P (2000b) Learning to filter spam e-mail: A comparison of a naive bayesian and a memory-based approach. In: Proceedings of the workshop on machine learning and textual information access, 4th european conference on principles and practice of knowledge discovery in databases, Lyon, France, pp 1–13
Back A (2002) Hashcash—a denial of service counter-measure. http://cypherspace.org/hashcash/hashcash/.pdf
Berg Ch, Christensen JPR, Bessel P (1984) Harmonic analysis on semigroups. Theory of positive definite and related functions. Graduate texts in mathematics, vol 100. Springer-Verlag, New York
Google Scholar
Blanzieri E, Bryl A (2006) A survey of learning-based techniques of email spam filtering. Artif Intell Rev 29(1): 63–92
Article Google Scholar
Brinker K (2003) Incorporating diversity in active learning with support vector machines. In: Proceedings of the twentieth international conference on machine learning, pp 59–66
Caropreso MF, Matwin S, Sebastiani F (2001) Text databases and document management: theory and practice, IGI Publishing, chap A learner-independent evaluation of the usefulness of statistical phrases for automated text categorization, pp 78–102
Carreras X, Marquez L (2001) Boosting trees for anti-spam email filtering. In: Proceedings of the 4th international conference on recent advances in natural language processing, Bulgaria, pp 58–64
Cauwenberghs G, Poggio T (2000) Incremental and decremental support vector machine learning. In: Proceedings of the neural information processing systems (NIPS), pp 409–415
Chang EY, Tong S, Goh K, Chang C (2001) Support vector machine concept-dependent active learning for image retrieval. In: Proceedings of the ACM international conference on multimedia, pp 107–118
Cohn D, Atlas L, Ladner R (1994) Improving generalization with active learning. Mach Learn 15(2): 201–221
Google Scholar
Cormack GV, Bratko A (2006) Batch and on-line spam filter comparison. In: Proceedings of the third conference on email and anti-spam, California, USA
Cormack GV, Lynam TR (2005) Trec 2005 spam track overview. In: Proceedings of the fourteenth text retrieval conference (TREC05), Gaithersburg, MD
Cortes C, Vapnik V (1995) Support-vector networks. Mach Learn 20(1): 273–329
MATH Google Scholar
Courant R, Hilbert D (1953) Methods of mathematical physics, vol 1. Interscience Publishers Inc., New York
Google Scholar
Cranor LF, LaMacchia BA (1998) Spam! Commun ACM 41(8): 74–83
Google Scholar
Cukier W, Cody S, Nesselroth E (2006) Genres of spam: expectations and deceptions. In: Proceeding of the 39th annual Hawaii international conference on system sciences, vol 3. Hawaii, USA
Debole F, Sebastiani F (2003) Supervised term weighting for automated text categorization. In: Proceedings of the ACM symposium on Applied computing, Florida, USA, pp 784–788
Drake C, Oliver J, Koontz E (2004) Anatomy of a phishing email. In: Proceeding of first conference on email and anti-Spam (CEAS), California, USA
Drucker H, Vapnik V, Wu D (1999) Support vector machines for spam categorization. IEEE Trans Neural Netw 10(5): 1048–1054
Article Google Scholar
Dwork C, Naor M (1993) Pricing via processing or combating junk mail. In: 12th annual international cryptology conference on advances in cryptology, Springer, no. 740 in LNCS, pp 139–147
Fawcett T (2004) Roc graphs: notes and practical considerations for researchers. Techanical report. HP Laboratories, Palo Alto, USA
Google Scholar
Gates B, Myhrvold N, Rinearson P (1995) The road ahead. Viking Penguin, New York
Google Scholar
Goodman J (2003) Spam: technologies and policies. http://www.research.microsoft.com/~joshuago/spamtech.pdf
Graham P (2002) A plan for spam. http://www.paulgraham.com/spam.html
Hulten G, Goodman J (2003) Tutorial on junk mail filtering. http://research.microsoft.com/~joshuago/tutorialOnJunkMailFilteringjune4.pdf
Joachims T (1998) Text categorization with support vector machines: learning with many relevant features. In: Nédellec C, Rouveirol C (eds) Proceedings of of ECML-98, 10th European conference on machine learning, Springer, Chemnitz, DE, 1398, pp 137–142
Joachims T (1999a) Advances in kernel methods: support vector machines learning, MIT Press, Cambridge, MA, USA, chap Making large-scale support vector machine learning practical, pp 169–184
Joachims T (1999b) Transductive inference for text classification using support vector machines. In: Proceedings of the sixteenth international conference on machine learning (ICML-99), San Francisco, US, pp 200–209
Kasabov N, Pang S (2004) Transductive support vector machines and applications in bioinformatics for promoter recognition. Neural Inf Process 3(2): 31–38
Google Scholar
Kivinen J, Smola A, Williamson R (2004) Online learning with kernels. IEEE Transac Signal Process 52(8): 2165–2176
Article MathSciNet Google Scholar
Kolcz A, Alspector J (2001) Svm-based filtering of e-mail spam with content-specific misclassification costs. In: Proceedings of the Workshop on text mining, California, USA, pp 123–130
Lau KW, Wu QH (2003) Online training of support vector machine. Pattern Recognit 36(8): 1913–1920
Article MATH Google Scholar
Leopold E, Kindermann J (2002) Text categorization with support vector machines. How to represent texts in input space?. Machine Learning 46(13): 423–444
Article MATH Google Scholar
Leslie C, Kuang R (2004) Fast string kernels using inexact matching for protein sequences. J Mach Learn Res 5: 1435–1455
MathSciNet Google Scholar
Leslie C, Eskin E, Noble WS (2002) The spectrum kernel: a string kernel for svm protein classification. In: Proceedings of the pacific symposium on biocomputing, Hawaii, USA, pp 564–575
Lewis DD (1992) An evaluation of phrasal and clustered representations on a text categorization task. In: Proceedings of the annual ACM conference on research and development in information retrieval, Copenhagen, Denmark, pp 37–50
Lodhi H, Saunders C, Shawe-Taylor J, Cristianini N, Watkins C (2002) Text classification using string kernels. J Mach Learn Res 2(1): 419–444
Article MATH Google Scholar
Lugaresi N (2004) European union vs. spam: a legal response. In: Proceeding of first conference on email and anti-Spam (CEAS), California, USA
Nagamalai C, Dhinakaran D, Lee JK (2007) Multi layer approach to defend ddos attacks caused by spam. In: Proceedings of the international conference on multimedia and ubiquitous engineering, Washington, DC, USA, pp 97–102
Porter M (1980) An algorithm for suffix stripping. Program 14(3): 130–137
Google Scholar
Rocchio J (1971) Relevance feedback in information retrieval. In: Proceedings of the SMART retrieval system: expriments in automatic document processing, New Jersey, USA, pp 313–323
Roy N, McCallum A (2001) Toward optimal active learning through sampling estimation of error reduction. In: Proceedings of the eighteenth international conference on machine learning, pp 441–448
Rätsch G, Sonnenburg S (2004) Kernel methods in computational biology, MIT Press, chap Accurate Splice Site Detection for Caenorhabditis elegans
Rätsch G, Sonnenburg S, Schölkopf B (2005) Rase: recognition of alternatively spliced exons in c. elegans. Bioinformatics 21(1): i369–i377
Article Google Scholar
Ruping S (2001) Incremental learning with support vector machines. In: Proceedings of the 2001 IEEE international conference on data mining, Washington, DC, USA, pp 641–642
Salton G (1979) Mathematics and information retrival. J Doc 35(1): 1–29
Article Google Scholar
Schoenberg IJ (1938) Metric spaces and positive definite functions. Trans Am Math Soc 44(3): 522–536
Article MATH MathSciNet Google Scholar
Schohn G, Cohn D (2000) Less is more: active learning with support vector machines. In: Proceedings of the seventeenth international conference on machine learning, California, USA, pp 839–846
Scholkopf B (2000) The kernel trick for distances. In: Proceedings of the advances in neural information processing systems (NIPS), Colorado, USA, pp 301–307
Scholkopf B, Smola A (2001) Learning with kernels: support vector machines, regularization, optimization, and beyond. MIT Press, Cambridge, MA
Google Scholar
Sculley D (2007) Online active learning methods for fast label-efficient spam filtering. In: Proceedings of the fourth conference on email and anti-Spam (CEAS 2007), Berlin, Germany
Sculley D, Wachman G (2007) Relaxed online svms for spam filtering. In: Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, Amsterdam, Netherlands, pp 415–422
Sebastiani F (2002) Machine learning in automated text categorization. ACM Comput Surv 34(1): 1–47
Article Google Scholar
SpamAssassin (2008) http://spamassassin.apache.org/tests
SPAMHAUS (2003) The spam definition and legalization game. http://www.spamhaus.org/news.lasso?article=9, Accessed: 31.05.07
Szummer M, Jaakkola T (2003) Information regularization with partially labeled data. In: Proceedings of the advances in neural information processing systems (NIPS), British Columbia, Canada
Vapnik V (1998) Statistical learning theory. Wiley-Interscience, New York
MATH Google Scholar
Wang J, Shen X (2006) Large margin semi-supervised learning. J Mach Learn Res 8(1): 1867–1891
MathSciNet Google Scholar
Wittel G, Wu S (2004) On attacking statistical spam filters. In: Proceedings of the first conference on email and anti-spam (CEAS), California, USA
Xu C, Zhou Y (2007) Transductive support vector machine for personal inboxes spam categorization. In: Proceedings of the international conference on computational intelligence and security workshops, Washington, DC, USA, pp 459–463
Zhang D, Sun lee W (2006) Extracting key-substring-group features for text classfication. In: Proceedings of the 12th ACM SIGKDD international conference on knowledge discovery and data mining, Pennsylvania, USA, pp 474–483
Zhu X, Ghahramani Z, Lafferty J (2003) Semi-supervised learning using gaussian fields and harmonic functions. In: Proceedings of the twentieth international conference on machine learning (ICML), Washington, DC, USA, pp 912–919

Download references

Author information

Authors and Affiliations

Electrical and Computer Engineering, Concordia University, Montreal, QC, Canada
Ola Amayri
Concordia Institute for Information Systems Engineering, Concordia University, Montreal, QC, Canada
Nizar Bouguila

Authors

Ola Amayri
View author publications
You can also search for this author in PubMed Google Scholar
Nizar Bouguila
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ola Amayri.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Amayri, O., Bouguila, N. A study of spam filtering using support vector machines. Artif Intell Rev 34, 73–108 (2010). https://doi.org/10.1007/s10462-010-9166-x

Download citation

Published: 23 May 2010
Issue Date: June 2010
DOI: https://doi.org/10.1007/s10462-010-9166-x

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A study of spam filtering using support vector machines

Abstract

Access this article

Similar content being viewed by others

Supervised Machine Learning Classifier for Email Spam Filtering

SVM-Based Feature Selection and Classification for Email Filtering

Spam Mail Detection Using Data Mining: A Comparative Analysis

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A study of spam filtering using support vector machines

Abstract

Access this article

Similar content being viewed by others

Supervised Machine Learning Classifier for Email Spam Filtering

SVM-Based Feature Selection and Classification for Email Filtering

Spam Mail Detection Using Data Mining: A Comparative Analysis

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation