Skip to main content
Log in

A study of spam filtering using support vector machines

  • Published:
Artificial Intelligence Review Aims and scope Submit manuscript

Abstract

Electronic mail is a major revolution taking place over traditional communication systems due to its convenient, economical, fast, and easy to use nature. A major bottleneck in electronic communications is the enormous dissemination of unwanted, harmful emails known as spam emails. A major concern is the developing of suitable filters that can adequately capture those emails and achieve high performance rate. Machine learning (ML) researchers have developed many approaches in order to tackle this problem. Within the context of machine learning, support vector machines (SVM) have made a large contribution to the development of spam email filtering. Based on SVM, different schemes have been proposed through text classification approaches (TC). A crucial problem when using SVM is the choice of kernels as they directly affect the separation of emails in the feature space. This paper presents thorough investigation of several distance-based kernels and specify spam filtering behaviors using SVM. The majority of used kernels in recent studies concern continuous data and neglect the structure of the text. In contrast to classical kernels, we propose the use of various string kernels for spam filtering. We show how effectively string kernels suit spam filtering problem. On the other hand, data preprocessing is a vital part of text classification where the objective is to generate feature vectors usable by SVM kernels. We detail a feature mapping variants in TC that yield improved performance for the standard SVM in filtering task. Furthermore, to cope for realtime scenarios we propose an online active framework for spam filtering. We present empirical results from an extensive study of online, transductive, and online active methods for classifying spam emails in real time. We show that active online method using string kernels achieves higher precision and recall rates.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Abadi M, Burrows M, Manasse M, Wobber T (2003) Moderately hard, memory-bound functions. In: Proceedings of the 10th annual network and distributed system security symposium, California, USA, pp 25–39

  • Anderson T, Bahadur R (1962) Classification into two multivariate normal distributions with different covariance matrices. Ann Math Stat 33(2): 420–431

    Article  MATH  MathSciNet  Google Scholar 

  • Androutsopoulos I, Koutsias J, Chandrinos K, Paliouras G, Spyropoulos C (2000a) An evaluation of naive bayesian anti-spam filtering. In: Proceedings of the 11th European conference on machine learning, Barcelona, Spain, pp 9–17

  • Androutsopoulos I, Paliouras G, Karkaletsis V, Sakkis G, Spyropoulos C, Stamatopoulos P (2000b) Learning to filter spam e-mail: A comparison of a naive bayesian and a memory-based approach. In: Proceedings of the workshop on machine learning and textual information access, 4th european conference on principles and practice of knowledge discovery in databases, Lyon, France, pp 1–13

  • Back A (2002) Hashcash—a denial of service counter-measure. http://cypherspace.org/hashcash/hashcash/.pdf

  • Berg Ch, Christensen JPR, Bessel P (1984) Harmonic analysis on semigroups. Theory of positive definite and related functions. Graduate texts in mathematics, vol 100. Springer-Verlag, New York

    Google Scholar 

  • Blanzieri E, Bryl A (2006) A survey of learning-based techniques of email spam filtering. Artif Intell Rev 29(1): 63–92

    Article  Google Scholar 

  • Brinker K (2003) Incorporating diversity in active learning with support vector machines. In: Proceedings of the twentieth international conference on machine learning, pp 59–66

  • Caropreso MF, Matwin S, Sebastiani F (2001) Text databases and document management: theory and practice, IGI Publishing, chap A learner-independent evaluation of the usefulness of statistical phrases for automated text categorization, pp 78–102

  • Carreras X, Marquez L (2001) Boosting trees for anti-spam email filtering. In: Proceedings of the 4th international conference on recent advances in natural language processing, Bulgaria, pp 58–64

  • Cauwenberghs G, Poggio T (2000) Incremental and decremental support vector machine learning. In: Proceedings of the neural information processing systems (NIPS), pp 409–415

  • Chang EY, Tong S, Goh K, Chang C (2001) Support vector machine concept-dependent active learning for image retrieval. In: Proceedings of the ACM international conference on multimedia, pp 107–118

  • Cohn D, Atlas L, Ladner R (1994) Improving generalization with active learning. Mach Learn 15(2): 201–221

    Google Scholar 

  • Cormack GV, Bratko A (2006) Batch and on-line spam filter comparison. In: Proceedings of the third conference on email and anti-spam, California, USA

  • Cormack GV, Lynam TR (2005) Trec 2005 spam track overview. In: Proceedings of the fourteenth text retrieval conference (TREC05), Gaithersburg, MD

  • Cortes C, Vapnik V (1995) Support-vector networks. Mach Learn 20(1): 273–329

    MATH  Google Scholar 

  • Courant R, Hilbert D (1953) Methods of mathematical physics, vol 1. Interscience Publishers Inc., New York

    Google Scholar 

  • Cranor LF, LaMacchia BA (1998) Spam! Commun ACM 41(8): 74–83

    Google Scholar 

  • Cukier W, Cody S, Nesselroth E (2006) Genres of spam: expectations and deceptions. In: Proceeding of the 39th annual Hawaii international conference on system sciences, vol 3. Hawaii, USA

  • Debole F, Sebastiani F (2003) Supervised term weighting for automated text categorization. In: Proceedings of the ACM symposium on Applied computing, Florida, USA, pp 784–788

  • Drake C, Oliver J, Koontz E (2004) Anatomy of a phishing email. In: Proceeding of first conference on email and anti-Spam (CEAS), California, USA

  • Drucker H, Vapnik V, Wu D (1999) Support vector machines for spam categorization. IEEE Trans Neural Netw 10(5): 1048–1054

    Article  Google Scholar 

  • Dwork C, Naor M (1993) Pricing via processing or combating junk mail. In: 12th annual international cryptology conference on advances in cryptology, Springer, no. 740 in LNCS, pp 139–147

  • Fawcett T (2004) Roc graphs: notes and practical considerations for researchers. Techanical report. HP Laboratories, Palo Alto, USA

    Google Scholar 

  • Gates B, Myhrvold N, Rinearson P (1995) The road ahead. Viking Penguin, New York

    Google Scholar 

  • Goodman J (2003) Spam: technologies and policies. http://www.research.microsoft.com/~joshuago/spamtech.pdf

  • Graham P (2002) A plan for spam. http://www.paulgraham.com/spam.html

  • Hulten G, Goodman J (2003) Tutorial on junk mail filtering. http://research.microsoft.com/~joshuago/tutorialOnJunkMailFilteringjune4.pdf

  • Joachims T (1998) Text categorization with support vector machines: learning with many relevant features. In: Nédellec C, Rouveirol C (eds) Proceedings of of ECML-98, 10th European conference on machine learning, Springer, Chemnitz, DE, 1398, pp 137–142

  • Joachims T (1999a) Advances in kernel methods: support vector machines learning, MIT Press, Cambridge, MA, USA, chap Making large-scale support vector machine learning practical, pp 169–184

  • Joachims T (1999b) Transductive inference for text classification using support vector machines. In: Proceedings of the sixteenth international conference on machine learning (ICML-99), San Francisco, US, pp 200–209

  • Kasabov N, Pang S (2004) Transductive support vector machines and applications in bioinformatics for promoter recognition. Neural Inf Process 3(2): 31–38

    Google Scholar 

  • Kivinen J, Smola A, Williamson R (2004) Online learning with kernels. IEEE Transac Signal Process 52(8): 2165–2176

    Article  MathSciNet  Google Scholar 

  • Kolcz A, Alspector J (2001) Svm-based filtering of e-mail spam with content-specific misclassification costs. In: Proceedings of the Workshop on text mining, California, USA, pp 123–130

  • Lau KW, Wu QH (2003) Online training of support vector machine. Pattern Recognit 36(8): 1913–1920

    Article  MATH  Google Scholar 

  • Leopold E, Kindermann J (2002) Text categorization with support vector machines. How to represent texts in input space?. Machine Learning 46(13): 423–444

    Article  MATH  Google Scholar 

  • Leslie C, Kuang R (2004) Fast string kernels using inexact matching for protein sequences. J Mach Learn Res 5: 1435–1455

    MathSciNet  Google Scholar 

  • Leslie C, Eskin E, Noble WS (2002) The spectrum kernel: a string kernel for svm protein classification. In: Proceedings of the pacific symposium on biocomputing, Hawaii, USA, pp 564–575

  • Lewis DD (1992) An evaluation of phrasal and clustered representations on a text categorization task. In: Proceedings of the annual ACM conference on research and development in information retrieval, Copenhagen, Denmark, pp 37–50

  • Lodhi H, Saunders C, Shawe-Taylor J, Cristianini N, Watkins C (2002) Text classification using string kernels. J Mach Learn Res 2(1): 419–444

    Article  MATH  Google Scholar 

  • Lugaresi N (2004) European union vs. spam: a legal response. In: Proceeding of first conference on email and anti-Spam (CEAS), California, USA

  • Nagamalai C, Dhinakaran D, Lee JK (2007) Multi layer approach to defend ddos attacks caused by spam. In: Proceedings of the international conference on multimedia and ubiquitous engineering, Washington, DC, USA, pp 97–102

  • Porter M (1980) An algorithm for suffix stripping. Program 14(3): 130–137

    Google Scholar 

  • Rocchio J (1971) Relevance feedback in information retrieval. In: Proceedings of the SMART retrieval system: expriments in automatic document processing, New Jersey, USA, pp 313–323

  • Roy N, McCallum A (2001) Toward optimal active learning through sampling estimation of error reduction. In: Proceedings of the eighteenth international conference on machine learning, pp 441–448

  • Rätsch G, Sonnenburg S (2004) Kernel methods in computational biology, MIT Press, chap Accurate Splice Site Detection for Caenorhabditis elegans

  • Rätsch G, Sonnenburg S, Schölkopf B (2005) Rase: recognition of alternatively spliced exons in c. elegans. Bioinformatics 21(1): i369–i377

    Article  Google Scholar 

  • Ruping S (2001) Incremental learning with support vector machines. In: Proceedings of the 2001 IEEE international conference on data mining, Washington, DC, USA, pp 641–642

  • Salton G (1979) Mathematics and information retrival. J Doc 35(1): 1–29

    Article  Google Scholar 

  • Schoenberg IJ (1938) Metric spaces and positive definite functions. Trans Am Math Soc 44(3): 522–536

    Article  MATH  MathSciNet  Google Scholar 

  • Schohn G, Cohn D (2000) Less is more: active learning with support vector machines. In: Proceedings of the seventeenth international conference on machine learning, California, USA, pp 839–846

  • Scholkopf B (2000) The kernel trick for distances. In: Proceedings of the advances in neural information processing systems (NIPS), Colorado, USA, pp 301–307

  • Scholkopf B, Smola A (2001) Learning with kernels: support vector machines, regularization, optimization, and beyond. MIT Press, Cambridge, MA

    Google Scholar 

  • Sculley D (2007) Online active learning methods for fast label-efficient spam filtering. In: Proceedings of the fourth conference on email and anti-Spam (CEAS 2007), Berlin, Germany

  • Sculley D, Wachman G (2007) Relaxed online svms for spam filtering. In: Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, Amsterdam, Netherlands, pp 415–422

  • Sebastiani F (2002) Machine learning in automated text categorization. ACM Comput Surv 34(1): 1–47

    Article  Google Scholar 

  • SpamAssassin (2008) http://spamassassin.apache.org/tests

  • SPAMHAUS (2003) The spam definition and legalization game. http://www.spamhaus.org/news.lasso?article=9, Accessed: 31.05.07

  • Szummer M, Jaakkola T (2003) Information regularization with partially labeled data. In: Proceedings of the advances in neural information processing systems (NIPS), British Columbia, Canada

  • Vapnik V (1998) Statistical learning theory. Wiley-Interscience, New York

    MATH  Google Scholar 

  • Wang J, Shen X (2006) Large margin semi-supervised learning. J Mach Learn Res 8(1): 1867–1891

    MathSciNet  Google Scholar 

  • Wittel G, Wu S (2004) On attacking statistical spam filters. In: Proceedings of the first conference on email and anti-spam (CEAS), California, USA

  • Xu C, Zhou Y (2007) Transductive support vector machine for personal inboxes spam categorization. In: Proceedings of the international conference on computational intelligence and security workshops, Washington, DC, USA, pp 459–463

  • Zhang D, Sun lee W (2006) Extracting key-substring-group features for text classfication. In: Proceedings of the 12th ACM SIGKDD international conference on knowledge discovery and data mining, Pennsylvania, USA, pp 474–483

  • Zhu X, Ghahramani Z, Lafferty J (2003) Semi-supervised learning using gaussian fields and harmonic functions. In: Proceedings of the twentieth international conference on machine learning (ICML), Washington, DC, USA, pp 912–919

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ola Amayri.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Amayri, O., Bouguila, N. A study of spam filtering using support vector machines. Artif Intell Rev 34, 73–108 (2010). https://doi.org/10.1007/s10462-010-9166-x

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10462-010-9166-x

Keywords

Navigation