A practical approach on clustering malicious PDF documents

Vatamanu, Cristina; Gavriluţ, Dragoş; Benchea, Răzvan

doi:10.1007/s11416-012-0166-z

A practical approach on clustering malicious PDF documents

Original Paper
Published: 15 June 2012

Volume 8, pages 151–163, (2012)
Cite this article

Journal in Computer Virology Aims and scope Submit manuscript

Cristina Vatamanu^1,2,
Dragoş Gavriluţ^1,3 &
Răzvan Benchea^1,3

484 Accesses
28 Citations
6 Altmetric
Explore all metrics

Abstract

Starting with 2009, the number of advanced persistent threat attacks has increased. In all of the researched cases, this kind of attacks use a zero-day exploit usually found in a frequently used application. Most of the times, the user has to visit a malicious page or open an infected document sent via e-mail. Even though the attack vector can be found in many forms, this paper addresses the case in which the attack relies on PDF files to deliver the payload. We chose PDF format both because of the high number of attacks it was used in and the key advantages it offers to the attacker. From an attackers perspective, the advantage of this attack is clear in that the PDF-files can be opened by an application on the users computer or in a browser, as most of the browsers support plug-ins that can render PDF files. The use of JavaScript inside PDF files offers two further advantages. The first is that code can be executed on the victims computer while the attack avoids different protection methods. The second benefit is that the JavaScript code can be polymorphic in that two files with the same functionality may look very different. This paper unveils a clustering method based on tokenization of the JavaScript code inside PDF files resistant to most of the obfuscation techniques used in script-based malware pieces. Our clustering method is based on the fact that most of the infected PDF-files (over 93 %) are using JavaScript code. By tokenizing the JavaScript code, describing it in an abstract manner and eliminating different operators used in polymorphism, we are able to obtain classes of files, very similar syntax-wise that can be easily clustered using different methods. Given the fact that virus analysts would likely analyse classes of files rather than isolated files, their work will be significantly reduced. The method of abstraction can be taken one step further and used as a detection mechanism—a technique to evaluate prevalent data or to obtain a subset from a large set without losing data variability.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Cova, M., Kruegel, C., Vigna, G.: Detection and Analysis of Drive-by Download Attacks and Malicious JavaScript Code. In Proceedings of International World Wide Web Conference (WWW) (2010)
Alexander Moshchuk, Tanya Bragin, Damien Deville, Steven D. Gribble, and Henry M. Levy: SpyProxy: Execution-based Detection of Malicious Web Content. In Proceedings of the USENIX Security Symposium (2007)
Canali, D., Cova, M., Vigna, G., Kruegel, C.: Prophiler: a Fast Filter for the Large-Scale Detection of Malicious Web Pages. 20th International World Wide Web Conference (2011)
Curtsinger, C., Livshits, B., Zorn, B., Seifert, C.: Zozzle: Low-overhead Mostly Static JavaScript Malware Detection. USENIX Security Symposium, August 2011
Karanth, S., Laxman, S., Naldurg, P., Venkatesan, R., Lambert, J., Shin, J.: Pattern Mining for Future Attacks (2010)
Mozgovoy, M., Fredriksson, K., White, D., Joy M., Sutinen, E.: Fast Plagiarism Detection System. 12th International Conference (SPIRE 2005)
Prechelt, L., Malpohl, G., Phlippsen, M.: JPlag: Finding plagiarisms among a set of programs. Technical report, Fakultat for Informatik, Universitat Karlsruhe (2000)
Feinstein, B., Peck, D.: Automated Collection, Detection and Analysis of Malicious JavaScript. In Proceedings of the Black Hat Security Conference (2007)
Selvaraj, K., Gutierrez, N.F.: The Rise Of PDF Malware. In Symantec Security Response, (2010) (http://www.symantec.com/content/en/us/enterprise/media/security/_response/whitepapers/the_rise_of_pdf_malware.pdf)
Manning C.D., Raghavan P., Schtze H.: Introduction To Information Retrieval, chapter 16 and 17. Cambridge University Press, Cambridge (2008)
Book Google Scholar
Crockford, D.:Ecma Reference, Json.org, July 2006. (http://www.ietf.org/rfc/rfc4627)
Rivest, R.: MIT Laboratory for Computer Science and RSA Data Security, Inc., April 1992. (http://www.ietf.org/rfc/rfc1321.txt)
Eastlake, D.: Motorola P.Jones Cisco Systems, September 2001. (http://tools.ietf.org/html/rfc3174)
Eastlake, D., Hansen, T.: Huawei, AT&T Labs, May 2011. (http://tools.ietf.org/html/rfc6234)
Nikolas, A.: Fast and Compact Hash Tables for Integer Keys (2009) (http://crpit.com/confpapers/CRPITV91Askitis.pdf)
MITRE Corporation. Common Vulnerabilities and Exposures (CVE). http://cve.mitre.org/
Stanley, K.L., Mishra, S.K.: De novo SVM classification of precursor microRNAs from genomic pseudo hairpins using global and intrinsic folding measures (2007)

Download references

Author information

Authors and Affiliations

BitDefender AntiMalware Laboratory, 37 Sfântul Lazăr Street, Solomons Building, Iaşi, Romania
Cristina Vatamanu, Dragoş Gavriluţ & Răzvan Benchea
Gheorghe Asachi University, Iaşi, Romania
Cristina Vatamanu
Alexandru Ioan Cuza University, Iaşi, Romania
Dragoş Gavriluţ & Răzvan Benchea

Authors

Cristina Vatamanu
View author publications
You can also search for this author in PubMed Google Scholar
Dragoş Gavriluţ
View author publications
You can also search for this author in PubMed Google Scholar
Răzvan Benchea
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Dragoş Gavriluţ.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Vatamanu, C., Gavriluţ, D. & Benchea, R. A practical approach on clustering malicious PDF documents. J Comput Virol 8, 151–163 (2012). https://doi.org/10.1007/s11416-012-0166-z

Download citation

Received: 20 December 2011
Accepted: 15 May 2012
Published: 15 June 2012
Issue Date: November 2012
DOI: https://doi.org/10.1007/s11416-012-0166-z

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A practical approach on clustering malicious PDF documents

Abstract

Access this article

Similar content being viewed by others

How different are different diff algorithms in Git?

A Study on Advanced Persistent Threats

Big data in cybersecurity: a survey of applications and future trends

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A practical approach on clustering malicious PDF documents

Abstract

Access this article

Similar content being viewed by others

How different are different diff algorithms in Git?

A Study on Advanced Persistent Threats

Big data in cybersecurity: a survey of applications and future trends

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation