Learning to detect spyware using end user license agreements

Lavesson, Niklas; Boldt, Martin; Davidsson, Paul; Jacobsson, Andreas

doi:10.1007/s10115-009-0278-z

Learning to detect spyware using end user license agreements

Regular Paper
Published: 16 January 2010

Volume 26, pages 285–307, (2011)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

Niklas Lavesson¹,
Martin Boldt¹,
Paul Davidsson^1,2 &
…
Andreas Jacobsson²

362 Accesses
14 Citations
30 Altmetric
4 Mentions
Explore all metrics

Abstract

The amount of software that hosts spyware has increased dramatically. To avoid legal repercussions, the vendors need to inform users about inclusion of spyware via end user license agreements (EULAs) during the installation of an application. However, this information is intentionally written in a way that is hard for users to comprehend. We investigate how to automatically discriminate between legitimate software and spyware associated software by mining EULAs. For this purpose, we compile a data set consisting of 996 EULAs out of which 9.6% are associated to spyware. We compare the performance of 17 learning algorithms with that of a baseline algorithm on two data sets based on a bag-of-words and a meta data model. The majority of learning algorithms significantly outperform the baseline regardless of which data representation is used. However, a non-parametric test indicates that bag-of-words is more suitable than the meta model. Our conclusion is that automatic EULA classification can be applied to assist users in making informed decisions about whether to install an application without having read the EULA. We therefore outline the design of a spyware prevention tool and suggest how to select suitable learning algorithms for the tool by using a multi-criteria evaluation approach.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Android Malware Classification Addressing Repackaged Entities by the Evaluation of Static Features and Multiple Machine Learning Algorithms

Android Malware Detection by Machine Learning Apprehension and Static Feature Characterization

Leveraging Support Vector Machine for Opcode Density Based Detection of Crypto-Ransomware

References

Androutsopoulos I, Paliouras G, Karkaletsis V, Sakkis G, Spyropoulos CD, Stamatopoulos P (2000) Learning to filter spam E-mail: a comparison of a naive bayesian and a memory-based approach. In: 4th European conference on principles and practice of knowledge discovery in databases: workshop on machine learning and textual information access, Springer, Berlin, pp 1–13
Arnett KP, Schmidt MB (2005) Busting the ghost in the machine. Communications of the ACM 48(8)
Boldt M (2007) Privacy-invasive software—exploring effects and countermeasures, Licentiate Thesis Series, No 2007:01, Blekinge Institute of Technology
Boldt M, Carlsson B (2006) Analysing countermeasures against privacy-invasive software. In: 1st IEEE international conference on systems and networks communications
Boldt M, Carlsson B, Jacobsson A (2004) Exploring spyware effects. In: Eight nordic workshop on secure IT systems, Helsinki University of Technology, Espoo, Finland, no. TML-A10 in publications in telecommunication and software multimedia, pp 23–30
Breiman L (1996) Bagging predictors. Mach Learn 24(2): 123–140
MATH MathSciNet Google Scholar
Carreras X, Màrquez L (2001) Boosting trees for anti-spam email filtering. In: Mitkov R, Angelova G, Bontcheva K, Nicolov N, Nikolov N (eds) European conference on recent advances in natural language processing. Tzigov Chark, Bulgaria, pp 58–64
Google Scholar
Caruana R, Niculescu-Mizil A (2006) An empirical comparison of supervised learning algorithms. In: 23rd international conference on machine learning. ACM Press, New York City, pp 161–168
Cohen W (1996) Learning Rules that Classify E-Mail. In: Advances in inductive logic programming. IOS Press, Amsterdam
Coleman M, Liau TL (1975) A computer readability formula designed for machine scoring. J Appl Psychol 60: 283–284
Article Google Scholar
Demzar J (2006) Statistical comparisons of classifiers over multiple data sets. Mach Learn Res 7: 1–30
MathSciNet Google Scholar
Denoyer L, Zaragoza H, Gallinari P (2001) HMM-based passage models for document classification and ranking. In: 23rd European colloquium on information retrieval research
Drucker H, Wu D, Vapnik V (1999) Support vector machines for spam categorization. IEEE Trans Neural Netw 10(5): 1048–1054
Article Google Scholar
Fawcett T (2001) Using rule sets to maximize ROC performance. In: IEEE international conference on data mining. IEEE Press, New York City, pp 131–138
Fawcett T (2003) ROC graphs—notes and practical considerations for data mining researchers. Tech. Rep. HPL-2003-4, Intelligent enterprise technologies laboratories, Palo Alto
Feldman R, Sanger J (2007) The text mining handbook. Cambridge University Press, Cambridge
Google Scholar
Flesch R (1948) A new readability yardstick. J Appl Psychol 32: 221–233
Article Google Scholar
Fox S (2005) Spyware—the threat of unwanted software programs is changing the way people use the Internet. http://www.pewinternet.org/pdfs/PIP_Spyware_Report_July_05.pdf
Good N, Grossklags J, Thaw D, Perzanowski A, Mulligan DK, Konstan J (2006) User choices and regret: understanding users’ decision process about consensually acquired spyware. I/S Law Policy Inf Soc 2(2): 283–344
Google Scholar
Kang N, Domeniconi C, Barbara D (2005) Categorization and keyword identification of unlabeled documents. In: Fifth IEEE international conference on data mining. IEEE Press, New York City, pp 677–680
Kibriya AM, Frank E, Pfahringer B, Holmes G (2004) Multinomial naive bayes for text categorization revisited. In: Seventh Australian joint conference on artificial intelligence, Springer, Berlin, pp 488–499
Koprinska I, Poon J, Clark J, Chan J (2007) Learning to classify E-mail. Inf Sci 177: 2167–2187
Article Google Scholar
Lavesson N, Davidsson P (2008) Generic methods for multi-criteria evaluation. In: Eighth SIAM international conference on data mining. SIAM Press, Philadelphia, pp 541–546
Lavesson N, Davidsson P, Boldt M, Jacobsson A (2008) Spyware Prevention by Classifying End User License Agreements. In: New challenges in applied intelligence technologies, studies in computational intelligence, vol 134. Springer, Berlin
McFedries P (2005) The spyware nightmare. IEEE Spectr 42(8): 72–72
Article Google Scholar
Metzler D, Croft WB (2005) A markov random field model for term dependencies. In: 28th ACM SIGIR conference on research and development in information retrieval, pp 472–479
Moshchuk A, Bragin T, Gribble SD, Levy HM (2006) A crawler-based study of spyware on the web. In: 13th annual symposium on network and distributed systems security, Internet Society, Reston
Provost F, Fawcett T, Kohavi R (1998) The case against accuracy estimation for comparing induction algorithms. In: 15th international conference on machine learning. Morgan Kaufmann Publishers, San Francisco, pp 445–453
Sakkis G, Androutsopoulos I, Paliouras G, Karkaletsis V, Spyropoulos CD, Stamatopoulos P (2001) Stacking classifiers for anti-spam filtering of E-mail. In: Sixth conference on empirical methods in natural language processing, Carnegie Mellon University, Pittsburgh
Sebastiani F (2002) Machine learning in automated text categorization. ACM Comput Surv 34(1): 1–47
Article Google Scholar
Shukla S, Nah F (2005) Web browsing and spyware intrusion. Communications of the ACM 48(8)
Smith EA, Kincaid P (1970) Derivation and validation of the automated readability index for use with technical materials. Human Factors 12: 457–464
Google Scholar
Townsend K (2003) Spyware, Adware, and Peer-to-Peer networks—the hidden threat to corporate security, Technical White Paper, Pest Patro
Wang P, Hu J, Zeng H-J, Chen Z (2009) Using wikipedia knowledge to improve text classification. Knowl Inf Syst 19: 265–281
Article Google Scholar
Wang BX, Japkowicz N (2009) Boosting support vector machines for imbalanced data Sets. Knowl Inf Syst (Online First)
Weiss A (2005) Spyware be gone. ACM Netw 9(1): 18–25
Google Scholar
Witten IH, Frank E (2005) Data mining: practical machine learning tools and techniques. Morgan Kaufmann Publishers, San Francisco
MATH Google Scholar
Zhang X (2005) What do consumers really know about spyware. Commun ACM 48(8): 44–48
Article Google Scholar
Zhao H (2008) Instance weighting versus threshold adjusting for cost-sensitive classification. Knowl Inf Syst 15: 321–334
Article Google Scholar

Download references

Author information

Authors and Affiliations

School of Computing, Blekinge Institute of Technology, Ronneby, 371 25, Sweden
Niklas Lavesson, Martin Boldt & Paul Davidsson
School of Technology, Malmö University, Malmö, Sweden
Paul Davidsson & Andreas Jacobsson

Authors

Niklas Lavesson
View author publications
You can also search for this author in PubMed Google Scholar
Martin Boldt
View author publications
You can also search for this author in PubMed Google Scholar
Paul Davidsson
View author publications
You can also search for this author in PubMed Google Scholar
Andreas Jacobsson
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Niklas Lavesson.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Lavesson, N., Boldt, M., Davidsson, P. et al. Learning to detect spyware using end user license agreements. Knowl Inf Syst 26, 285–307 (2011). https://doi.org/10.1007/s10115-009-0278-z

Download citation

Received: 11 February 2009
Revised: 05 November 2009
Accepted: 13 November 2009
Published: 16 January 2010
Issue Date: February 2011
DOI: https://doi.org/10.1007/s10115-009-0278-z

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Learning to detect spyware using end user license agreements

Abstract

Access this article

Similar content being viewed by others

Android Malware Classification Addressing Repackaged Entities by the Evaluation of Static Features and Multiple Machine Learning Algorithms

Android Malware Detection by Machine Learning Apprehension and Static Feature Characterization

Leveraging Support Vector Machine for Opcode Density Based Detection of Crypto-Ransomware

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Learning to detect spyware using end user license agreements

Abstract

Access this article

Similar content being viewed by others

Android Malware Classification Addressing Repackaged Entities by the Evaluation of Static Features and Multiple Machine Learning Algorithms

Android Malware Detection by Machine Learning Apprehension and Static Feature Characterization

Leveraging Support Vector Machine for Opcode Density Based Detection of Crypto-Ransomware

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation