Skip to main content
Log in

Highly discriminative statistical features for email classification

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

This paper reports on email classification and filtering, more specifically on spam versus ham and phishing versus spam classification, based on content features. We test the validity of several novel statistical feature extraction methods. The methods rely on dimensionality reduction in order to retain the most informative and discriminative features. We successfully test our methods under two schemas. The first one is a classic classification scenario using a 10-fold cross-validation technique for several corpora, including four ground truth standard corpora: Ling-Spam, SpamAssassin, PU1, and a subset of the TREC 2007 spam corpus, and one proprietary corpus. In the second schema, we test the anticipatory properties of our extracted features and classification models with two proprietary datasets, formed by phishing and spam emails sorted by date, and with the public TREC 2007 spam corpus. The contributions of our work are an exhaustive comparison of several feature selection and extraction methods in the frame of email classification on different benchmarking corpora, and the evidence that especially the technique of biased discriminant analysis offers better discriminative features for the classification, gives stable classification results notwithstanding the amount of features chosen, and robustly retains their discriminative value over time and data setups. These findings are especially useful in a commercial setting, where short profile rules are built based on a limited number of features for filtering emails.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  1. Abu-Nimeh S, Nappa D, Wang X, Nair S (2007) A comparison of machine learning techniques for phishing detection. In: eCrime ’07: proceedings of the anti-phishing working groups 2nd annual eCrime researchers summit. ACM, New York, pp 60–69

  2. Agrawal R, Imieliński T, Swami A (1993) Mining association rules between sets of items in large databases. In: SIGMOD ’93: proceedings of the 1993 ACM SIGMOD international conference on management of Data. ACM, New York, NY, USA, pp 207–216

  3. Aha DW, Kibler DF, Albert MK (1991) Instance-based learning algorithms. Mach Learn 6: 37–66

    Google Scholar 

  4. Androutsopoulos I, Koutsias J, Chandrinos KV, Ch KV, Paliouras G, Spyropoulos CD (2000) An evaluation of naïve Bayesian anti-spam filtering, pp 9–17

  5. Baudat G, Anouar F (2000) Generalized discriminant analysis using a kernel approach. Neural Comput 12(10): 2385–2404

    Article  Google Scholar 

  6. Bishop C (1995) Neural networks for pattern recognition. Clarendon Press, Oxford

    Google Scholar 

  7. Blei DM, Griffiths TL, Jordan MI, Tenenbaum JB (2003) Hierarchical topic models and the nested Chinese restaurant process. In: Thrun S, Saul LK, Schölkopf B (eds) Neural information processing systems. MIT Press, Cambridge

    Google Scholar 

  8. Blei DM, Ng AY, Jordan MI, Lafferty J (2003) Latent dirichlet allocation. J Mach Learn Res 3: 2003

    Google Scholar 

  9. Borgelt C, Kruse R (2002) Induction of association rules: apriori implementation. In: Proceedings of 15th conference on computational statistics (COMPSTAT 2002). Physica Verlag, Heidelberg, Germany

  10. Brank J, Grobelnik M, Frayling MN, Mladenic D (2002) Feature selection using support vector machines. In: Proceedings of the third international conference on data mining methods and databases for engineering, finance, and other fields, Bologna, Italy, pp 25–27

  11. Bratko A, Cormack G, Filipic B, Lynam T, Zupan B (2006) Spam filtering using statistical data compression models. J Mach Learn Res 7: 2673–2698

    MathSciNet  Google Scholar 

  12. Breiman L (1996) Bagging predictors. Mach Learn 24(2): 123–140

    MathSciNet  Google Scholar 

  13. Brutlag JD, Meek C (2000) Challenges of the email domain for text classification. In: ICML ’00: proceedings of the seventeenth international conference on machine learning. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, pp 103–110

  14. Cai L, Hofmann T (2003) Text categorization by boosting automatically extracted concepts. In: SIGIR ’03: proceedings of the 26th annual international ACM SIGIR conference on research and development in information retrievalm, pp 182–189

  15. Carreras X, Márquez L, Salgado JG (2001) Boosting trees for anti-spam email filtering. In: RANLP-01: 4th international conference on recent advances in natural language processing pp 58–64

  16. Chen C, Tian Y, Zhang C (2008) Spam filtering with several novel Bayesian classifiers. In: ICPR ’08: proceedings of the 19th international conference on pattern recognition, pp 1–4

  17. Cheng H, Yan X, Han J, wei Hsu C (2007) Discriminative frequent pattern analysis for effective classification. In: IEEE 23rd international conference on data engineering, pp 716–725

  18. Cormack GV (2007) Spam track overview. In: TREC-2007: sixteenth text retrieval conference

  19. Cristianini N, Shawe-Taylor J (2000) An introduction to support vector machines and other Kernel-based learning methods. Cambridge University Press, Cambridge, UK

    Google Scholar 

  20. Deerwester S, Dumais ST, Furnas GW, Landauer TK, Harshman R (1990) Indexing by latent semantic analysis. J Am Soc Inf Sci 41: 391–407

    Article  Google Scholar 

  21. Fette I, Sadeh N, Tomasic A (2007) Learning to detect phishing emails. In: WWW ’07: proceedings of the 16th international conference on World Wide Web. ACM, New York, NY, USA, pp 649–656

  22. Fukunaga K (1990) Introduction to statistical pattern recognition. Academic Press, London

    Google Scholar 

  23. Goodman J, Heckerman D, Rounthwaite R (2005) Stopping spam. Sci Am 292(4): 42–88

    Article  Google Scholar 

  24. Guyon I, Weston J, Barnhill S, Vapnik V (2002) Gene selection for cancer classification using support vector machines. Mach Lear 46(1–3): 389–422

    Article  Google Scholar 

  25. Guzella TS, Caminhas WM (2009) A review of machine learning approaches to spam filtering. Expert Syst Appl 36: 10206–10222

    Article  Google Scholar 

  26. Hartley R, Schaffalizky F (2003) PowerFactorization: 3d reconstruction with missing or uncertain data. In: Australia–Japan advanced workshop on computer vision

  27. Hofmann T (1999) Probabilistic latent semantic indexing. In: Uncertainty in artificial intelligence, pp 50–57

  28. Hovold J (2005) Naïve Bayes spam filtering using word-position-based attributes and length-sensitive classification thresholds. In: NODALIDA ’05: proceedings of the 15th nordic conference of computational linguistics, pp 78–87

  29. Huang TS, Dagli CK, Rajaram S, Chang EY, Mandel MI, Poliner GE, Ellis DPW (2008) Active learning for interactive multimedia retrieval. Proc IEEE 96(4): 648–667

    Article  Google Scholar 

  30. Ishii N, Murai T, Yamada T, Bao Y, Suzuki S (2006) Text classification: combining grouping, LSA and knn vs support vector machine. In: ‘Knowledge-Based Intelligent Information and Engineering Systems’ Vol. 4252, pp. 393–400

  31. István B, Jácint S, András B (2008) Latent Dirichlet allocation in web spam filtering. In: AIRWeb ’08: proceedings of the 4th international workshop on adversarial information retrieval on the Web’ pp 29–32

  32. Jolliffe IT (1986) Principal component analysis. Springer, New York

    Google Scholar 

  33. Kanaris I, Kanaris K, Houvardas I, Stamatatos E (2007) Words versus character n-grams for anti-spam filtering. Int J Artif Intell Tools 16(6): 1047–1067

    Article  Google Scholar 

  34. Katakis I, Tsoumakas G, Vlahavas I (2010) Tracking recurring contexts using ensemble classifiers: an application to email filtering. Knowl Inf Syst 22(3): 371–391

    Article  Google Scholar 

  35. Meyer TA, Whateley B (2004) SpamBayes: effective open-source, Bayesian based, email classification syste. In: CEAS ’04: proceedings of the first conference on email and anti-spam

  36. Mitchell TM (1997) Machine learning. McGraw-Hill Science/Engineering/Math, NY

    Google Scholar 

  37. Mladenić D, Brank J, Grobelnik M, Milic-Frayling N (2004) Feature selection using linear classifier weights: interaction with classification models. In: SIGIR ’04: proceedings of the 27th annual international ACM SIGIR conference on research and development in information retrieval. ACM, New York, NY, USA pp 234–241

  38. Moler CB, Stewart GW (1973) An algorithm for generalized matrix eigenvalue problems. SIAM: J Numer Anal (19):241–256

  39. Platt JC (1998) Fast training of SVMs using sequential minimal optimization. In: Schoelkopf B, Burges C, Smola A (eds) Advances in kernel methods-support vector learning. MIT Press, Cambridge, pp 185–208

  40. Pu Q, Yang G-W (2006) Short-text classification based on ICA and LSA. In: Advances in neural networks, vol 3972, pp. 265–270

  41. Qian T, Xiong H, Wang Y, Chen E (2007) On the strength of hyperclique patterns for text categorization. Inf Sci 177(19): 4040–4058

    Article  MathSciNet  Google Scholar 

  42. Quinlan JR (1993) C4.5: programs for machine learning. Morgan Kaufmann, San Mateo

    Google Scholar 

  43. Robinson G (2003) A statistical approach to the spam problem. Linux J (107):3

  44. Schneider K-M (2003) A comparison of event models for naïve Bayes anti-spam e-mail filtering. In: EACL ’03: proceedings of the tenth conference on European chapter of the association for computational linguistics. Association for Computational Linguistics, Morristown, NJ, USA, pp 307–314

  45. Siefkes C, Assis F, Chhabra S, Yerazunis WS (2004) Combining winnow and orthogonal sparse bigrams for incremental spam filtering. In: PKDD ’04: proceedings of the 8th European conference on principles and practice of knowledge discovery in databases, vol 3202. Springer, Morristown, NJ, USA, pp. 410–421

  46. Torkkola K (2004) Discriminative features for document classification. Pattern Anal Appl 6: 301–308

    Article  MathSciNet  Google Scholar 

  47. Tsymbal A, Puuronen S, Pechenizkiy M, Baumgarten M, Patterson DW (2002) Eigenvector-based feature extraction for classification. In: Haller SM, Simmons G (eds) FLAIRS conference. AAAI Press, pp 354–358

  48. Wang F, Zhang C (2007) Feature extraction by maximizing the average neighborhood margin. In: Proceedings of the IEEE conference on computer vision and pattern recognition. IEEE Computer Society

  49. Waugh F (1945) A note concerning hotelling’s method of inverting a partitioned matrix. Ann Math Stat 16(2): 216–217

    Article  MathSciNet  Google Scholar 

  50. Witten IH, Frank E (2000) Data mining: practical machine learning tools and techniques with java implementations. Morgan Kaufmann, San Francisco

    Google Scholar 

  51. Xia Y, Wong K-F (2006) Binarization approaches to email categorization. In: ICCPOL, pp 474–481

  52. Xue G-R, Dai W, Yang Q, Yu Y (2008) Topic-bridged pLSA for cross-domain text classification. In: SIGIR ’08: proceedings of the 31st annual international ACM SIGIR conference on research and development in information retrieval, pp 627–634

  53. Yan J, Zhang B, Liu N, Yan S, Cheng Q, Fan W, Yang Q, Xi W, Chen Z (2006) Effective and efficient dimensionality reduction for large-scale and streaming data preprocessing. IEEE Trans Knowl Data Eng 18(3): 320–333

    Article  Google Scholar 

  54. Yu B, Xu Z-b (2008) A comparative study for content-based dynamic spam classification using four machine learning algorithms. Knowledge-Based Syst 21(4): 355–362

    Article  Google Scholar 

  55. Zhang Z, Phan X-H, SH (2008) An efficient feature selection using hidden topic in text categorization. In: AINAW ’08: proceedings of the 22nd international conference on advanced information networking and applications—Workshops, pp 1223–1228

  56. Zhou S, Li K, Liu Y (2008) Text categorization based on topic model. In: Wang G, Li T, Grzymala-Busse J, Miao D, Skowron A, Yao Y (eds) Rough sets and knowledge technology. Lecture notes in computer science, vol 5009, pp 572–579

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Juan Carlos Gomez.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Gomez, J.C., Boiy, E. & Moens, MF. Highly discriminative statistical features for email classification. Knowl Inf Syst 31, 23–53 (2012). https://doi.org/10.1007/s10115-011-0403-7

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-011-0403-7

Keywords

Navigation