Skip to main content
Log in

Learning to detect spyware using end user license agreements

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

The amount of software that hosts spyware has increased dramatically. To avoid legal repercussions, the vendors need to inform users about inclusion of spyware via end user license agreements (EULAs) during the installation of an application. However, this information is intentionally written in a way that is hard for users to comprehend. We investigate how to automatically discriminate between legitimate software and spyware associated software by mining EULAs. For this purpose, we compile a data set consisting of 996 EULAs out of which 9.6% are associated to spyware. We compare the performance of 17 learning algorithms with that of a baseline algorithm on two data sets based on a bag-of-words and a meta data model. The majority of learning algorithms significantly outperform the baseline regardless of which data representation is used. However, a non-parametric test indicates that bag-of-words is more suitable than the meta model. Our conclusion is that automatic EULA classification can be applied to assist users in making informed decisions about whether to install an application without having read the EULA. We therefore outline the design of a spyware prevention tool and suggest how to select suitable learning algorithms for the tool by using a multi-criteria evaluation approach.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Androutsopoulos I, Paliouras G, Karkaletsis V, Sakkis G, Spyropoulos CD, Stamatopoulos P (2000) Learning to filter spam E-mail: a comparison of a naive bayesian and a memory-based approach. In: 4th European conference on principles and practice of knowledge discovery in databases: workshop on machine learning and textual information access, Springer, Berlin, pp 1–13

  2. Arnett KP, Schmidt MB (2005) Busting the ghost in the machine. Communications of the ACM 48(8)

  3. Boldt M (2007) Privacy-invasive software—exploring effects and countermeasures, Licentiate Thesis Series, No 2007:01, Blekinge Institute of Technology

  4. Boldt M, Carlsson B (2006) Analysing countermeasures against privacy-invasive software. In: 1st IEEE international conference on systems and networks communications

  5. Boldt M, Carlsson B, Jacobsson A (2004) Exploring spyware effects. In: Eight nordic workshop on secure IT systems, Helsinki University of Technology, Espoo, Finland, no. TML-A10 in publications in telecommunication and software multimedia, pp 23–30

  6. Breiman L (1996) Bagging predictors. Mach Learn 24(2): 123–140

    MATH  MathSciNet  Google Scholar 

  7. Carreras X, Màrquez L (2001) Boosting trees for anti-spam email filtering. In: Mitkov R, Angelova G, Bontcheva K, Nicolov N, Nikolov N (eds) European conference on recent advances in natural language processing. Tzigov Chark, Bulgaria, pp 58–64

    Google Scholar 

  8. Caruana R, Niculescu-Mizil A (2006) An empirical comparison of supervised learning algorithms. In: 23rd international conference on machine learning. ACM Press, New York City, pp 161–168

  9. Cohen W (1996) Learning Rules that Classify E-Mail. In: Advances in inductive logic programming. IOS Press, Amsterdam

  10. Coleman M, Liau TL (1975) A computer readability formula designed for machine scoring. J Appl Psychol 60: 283–284

    Article  Google Scholar 

  11. Demzar J (2006) Statistical comparisons of classifiers over multiple data sets. Mach Learn Res 7: 1–30

    MathSciNet  Google Scholar 

  12. Denoyer L, Zaragoza H, Gallinari P (2001) HMM-based passage models for document classification and ranking. In: 23rd European colloquium on information retrieval research

  13. Drucker H, Wu D, Vapnik V (1999) Support vector machines for spam categorization. IEEE Trans Neural Netw 10(5): 1048–1054

    Article  Google Scholar 

  14. Fawcett T (2001) Using rule sets to maximize ROC performance. In: IEEE international conference on data mining. IEEE Press, New York City, pp 131–138

  15. Fawcett T (2003) ROC graphs—notes and practical considerations for data mining researchers. Tech. Rep. HPL-2003-4, Intelligent enterprise technologies laboratories, Palo Alto

  16. Feldman R, Sanger J (2007) The text mining handbook. Cambridge University Press, Cambridge

    Google Scholar 

  17. Flesch R (1948) A new readability yardstick. J Appl Psychol 32: 221–233

    Article  Google Scholar 

  18. Fox S (2005) Spyware—the threat of unwanted software programs is changing the way people use the Internet. http://www.pewinternet.org/pdfs/PIP_Spyware_Report_July_05.pdf

  19. Good N, Grossklags J, Thaw D, Perzanowski A, Mulligan DK, Konstan J (2006) User choices and regret: understanding users’ decision process about consensually acquired spyware. I/S Law Policy Inf Soc 2(2): 283–344

    Google Scholar 

  20. Kang N, Domeniconi C, Barbara D (2005) Categorization and keyword identification of unlabeled documents. In: Fifth IEEE international conference on data mining. IEEE Press, New York City, pp 677–680

  21. Kibriya AM, Frank E, Pfahringer B, Holmes G (2004) Multinomial naive bayes for text categorization revisited. In: Seventh Australian joint conference on artificial intelligence, Springer, Berlin, pp 488–499

  22. Koprinska I, Poon J, Clark J, Chan J (2007) Learning to classify E-mail. Inf Sci 177: 2167–2187

    Article  Google Scholar 

  23. Lavesson N, Davidsson P (2008) Generic methods for multi-criteria evaluation. In: Eighth SIAM international conference on data mining. SIAM Press, Philadelphia, pp 541–546

  24. Lavesson N, Davidsson P, Boldt M, Jacobsson A (2008) Spyware Prevention by Classifying End User License Agreements. In: New challenges in applied intelligence technologies, studies in computational intelligence, vol 134. Springer, Berlin

  25. McFedries P (2005) The spyware nightmare. IEEE Spectr 42(8): 72–72

    Article  Google Scholar 

  26. Metzler D, Croft WB (2005) A markov random field model for term dependencies. In: 28th ACM SIGIR conference on research and development in information retrieval, pp 472–479

  27. Moshchuk A, Bragin T, Gribble SD, Levy HM (2006) A crawler-based study of spyware on the web. In: 13th annual symposium on network and distributed systems security, Internet Society, Reston

  28. Provost F, Fawcett T, Kohavi R (1998) The case against accuracy estimation for comparing induction algorithms. In: 15th international conference on machine learning. Morgan Kaufmann Publishers, San Francisco, pp 445–453

  29. Sakkis G, Androutsopoulos I, Paliouras G, Karkaletsis V, Spyropoulos CD, Stamatopoulos P (2001) Stacking classifiers for anti-spam filtering of E-mail. In: Sixth conference on empirical methods in natural language processing, Carnegie Mellon University, Pittsburgh

  30. Sebastiani F (2002) Machine learning in automated text categorization. ACM Comput Surv 34(1): 1–47

    Article  Google Scholar 

  31. Shukla S, Nah F (2005) Web browsing and spyware intrusion. Communications of the ACM 48(8)

  32. Smith EA, Kincaid P (1970) Derivation and validation of the automated readability index for use with technical materials. Human Factors 12: 457–464

    Google Scholar 

  33. Townsend K (2003) Spyware, Adware, and Peer-to-Peer networks—the hidden threat to corporate security, Technical White Paper, Pest Patro

  34. Wang P, Hu J, Zeng H-J, Chen Z (2009) Using wikipedia knowledge to improve text classification. Knowl Inf Syst 19: 265–281

    Article  Google Scholar 

  35. Wang BX, Japkowicz N (2009) Boosting support vector machines for imbalanced data Sets. Knowl Inf Syst (Online First)

  36. Weiss A (2005) Spyware be gone. ACM Netw 9(1): 18–25

    Google Scholar 

  37. Witten IH, Frank E (2005) Data mining: practical machine learning tools and techniques. Morgan Kaufmann Publishers, San Francisco

    MATH  Google Scholar 

  38. Zhang X (2005) What do consumers really know about spyware. Commun ACM 48(8): 44–48

    Article  Google Scholar 

  39. Zhao H (2008) Instance weighting versus threshold adjusting for cost-sensitive classification. Knowl Inf Syst 15: 321–334

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Niklas Lavesson.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Lavesson, N., Boldt, M., Davidsson, P. et al. Learning to detect spyware using end user license agreements. Knowl Inf Syst 26, 285–307 (2011). https://doi.org/10.1007/s10115-009-0278-z

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-009-0278-z

Keywords

Navigation