Elsevier

Computers & Security

Volume 99, December 2020, 102037
Computers & Security

Recurrent neural network for detecting malware

https://doi.org/10.1016/j.cose.2020.102037Get rights and content

Abstract

In this paper, we propose an efficient Recurrent Neural Network (RNN) to detect malware. RNN is a classification of artificial neural networks connected between nodes to form a directed graph alongside with a temporal sequence. In this paper, we have conducted several experiments using different values of hyper parameters. From our rigorous experimentations, we found that the step size is a more important factor than the input size when using RNN for malware classification. To justify the proof-of-concept for RNN as an efficient approach for malware detection, we measured the performance of RNN with three different feature vectors using hyper parameters. The three feature vectors are “hot encoding feature vector”, “random feature vector” and “Word2Vec feature vector”. We also performed a pairwise t-test to test the results if they are significant with each other. Our results show that, RNN with Word2Vec feature vector achieved the highest Area Under the Curve (AUC) value and a good variance among three feature vectors. From the empirical analysis, we conclude that RNN with feature vectors pertained by the Skip-gram architecture of Word2Vec model is best for malware detection with high performance and stability.

Introduction

Malware is a program developed with malicious intent and has become a big cyber threat around the world. There are a lot of methods to detect malware. Generally, signature-based methods are widely used for detecting malware. It detects the malware by using a signature that is collected from detected malwares in the past. But the disadvantage of this method is that it is very hard to detect unseen or modified malware. Due to this, researchers are widely using behavior-based and machine learning methods. The behavior-based method detects malware by observing the program in a specific environment, like sandbox. At the same time, we see that the digital technique is getting improved and malware is also evolving in various ways to avoid being detected. Therefore, more complex and fine methods, such as machine learning, are required. Machine Learning techniques, especially Deep Learning is an excellent technique that deals with variants of data because it can not only learn the given feature but also automatically extract features from data to achieve the goal of classification task. In case of malware detection, this method will help in classifying whether a particular file is malware or not. The most recent researches have shown that deep learning techniques are applied to detect malware.

In this paper, we use an RNN model to effectively detect malware. As the first step, we carefully investigate feature vectors for texts (documents) and used three techniques of NLP (Natural Language Processing): one hot encoding, random vector representation, and trained vector representation using a Word2Vec model, followed by several experiments in order to find the influence of hyper parameters on RNN network (e.g. step size, input size, etc). As the second step, we perform experiments with RNN with the above-mentioned three feature vectors. We will discuss the performance and stability of them, and propose an efficient method with RNN for malware detection.

‘Random feature vectors’ and ‘Word2Vec feature vectors’ use different random seeds; whereas, one hot encoding feature vectors use different vocabulary dictionary. This paper signifies the performance between feature vectors by conducting rigorous experiments for each feature vector.

We performed a paired t-test with 95% confidence level and found that the Area Under the Curve (AUC) value of each pair is significant. The result with F1 score was same as the AUC value. But in our study and from the experiments we performed, the random method produced the largest variance in terms of accuracy.

Machine learning-based solutions have been already used as a tool to supersede signature-based anti-malware systems. Andrade et al. (2019) described that malicious and legitimate samples used in RNN to estimate statistical difference in-order to create adversarial examples, and they are better than that of the signature-based methods. Similarly, the vulnerability of machine learning algorithms in malware detection is evaluated by classifying this algorithm using a discriminant function on a set of data points. However, from our results, we see that this eventually gives a higher misclassification rate.

In Word2Vec, the ‘continuous skip-gram’ weighs nearby context words more heavily than more distant context words. While the order still is not captured, each of the context vectors is weighed and compared independently. But in our proposed work, it weighs against the average context. The results show that RNN with the Word2Vec feature vector has the highest Area Under the Curve (AUC) value and gives good variance among any other feature vectors. This results in a very high performance and produces stability. Thus, it makes our proposed model best fit for malware detection.

Novelty: From the analytical results of all the vector approaches, we found that the Word2Vec feature vector has the highest value of average and significant variance. Therefore, we recommend the trained Word2Vec feature vector with small input size to malware detection using RNN, and this is the novelty of our work.

Contributions: Our major contribution in this work is the extensive use of Word2vec model. We applied the Word2Vec model in featured vectors, natural language processing, encoding, random and trained vector representation to check and validate the influence of hyper parameters like step size and input size on RNN networks. Our rigorous experiments using the Word2Vec model indicate a better performance and stability. None of previous researchers have been successful in using Word2Vec to achieve this degree of performance and stability. Therefore, we conclude that our method is the most efficient method for detecting malware using RNN.

Section snippets

Related work

In this section, we first provide a brief discussion of malware detection techniques, with an emphasis on feature extraction, families of malware, Word2Vec, classifiers; which are the basis for the proposed work presented in this paper. We also review relevant related work for malware detection using recurrent neural network. Finally, we discuss AUC, which give us a convenient means to quantify the various experiments that we have conducted. There are many approaches to the malware detection

Experimental results

In the first step of the second phase, we did experiments with hyper parameters. This is to investigate the effect of hyper parameters. Hyper parameters are variables which determine the network structure and how the network is trained. Hyper parameters are set before training; that is before optimizing the weights and bias.

Firstly, in our experiments, we measured the performance according to different input sizes with fixed step sizes. Values of hyper parameters that are used in this

Performance evaluation

From the experimental results, we can see that the input size did not have a significant effect on performances of RNN. The same feature vectors for different step sizes with a fixed input size (which was 50) were used. The results show that when we increase the step size, the performance of RNN is degrading. The reason was that the number of datasets was too high which raised conflicts in the classifier and the larger step size gives a low performance when using RNN for malware detection.

Next,

Conclusion

In this paper, we proposed an efficient method for malware detection using RNN. We made experiments according to different feature vectors to investigate their efficiency. Those are one hot encoding feature vector, random feature vector, and Word2Vec feature vector. As a result, we found that the trained Word2Vec vector is efficient with RNN for malware classification. Also, we did experiments with different values of RNN hyper parameters to see their effect on them. We found that the step size

Limitations and future work

Most of the malware analysis is done by quickly checking the features of a particular piece of code and then examining them. After this, the current malware is compared with the previously found/detected malicious code. However, according to Rhode et al. (2018), behavioral data collected during file execution is more difficult to obfuscate [29]. However, it takes comparatively a long time to identify those malwares. In other words, in most of the cases, the malicious payload gets recovered by

CRediT authorship contribution statement

Sudan Jha: Supervision, Conceptualization, Methodology, Investigation. Deepak Prashar: Data curation, Software, Writing - original draft. Hoang Viet Long: Supervision, Validation. David Taniar: Supervision, Validation.

Declaration of Competing Interest

The authors declare that: They have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper; They have no financial interests/personal relationships which may be considered as potential competing interests. This statement is agreed by all the authors to indicate agreement that the above information is true and correct.

Hoang Viet Long is currently working as the researcher of Institute for Computational Science at Ton Duc Thang University, Ho Chi Minh City, Vietnam. He obtained Ph.D. diploma in Computer Science at Hanoi University of Science and Technology in 2011, where he defensed his thesis in fuzzy and soft computing field with applications to electronic engineering. He has been promoted to Associate Professor in Information Technology since 2017. Recently, he has been concerning in Cybersecurity, Machine

References (28)

  • EO Andrade et al.

    Bernardinia FC. A model based on LSTM neural networks to identify five different types of malware

    Procedia Comput. Sci.

    (2019)
  • M Rhode et al.

    Early-stage malware prediction using recurrent neural networks

    Comput. Secur.

    (2018)
  • F. Ahmed et al.

    Using spatio-temporal information in API calls with machine learning algorithms for malware detection

    Proceedings of the 2nd ACM workshop on security and artificial intelligence

    (2009)
  • TH Ahn et al.

    Malware detection method using opcode and windows API calls

    J. Inst. Internet, Broadcast. Commun.

    (2017)
  • B Athiwaratkun et al.

    Malware classification with LSTM and GRU language models and a character-level CNN

  • Y Awad et al.

    Modeling malware as a language

  • P Bojanowski et al.

    Enriching word vectors with subword information

    Trans. Assoc. Comput. Linguist.

    (2017)
  • A Damodaran et al.

    A comparison of static, dynamic, and hybrid analysis for malware detection

    J. Comput. Virol. Hack Technol.

    (2017)
  • A. Géron

    Hands-on Machine Learning with Scikit-Learn & TensorFlow, First ed

    O'Reilly Media

    (2017)
  • D Gibert et al.

    Convolutional neural networks for classification of malware assembly code

  • D. Gibert

    Convolutional Neural Networks for Malware Classification. A thesis presented for the degree of Master in Artificial Intelligence

    (2016)
  • Grosse K, Papernot N, Manoharan P, Backes M, McDaniel P. Adversarial perturbations against deep neural networks for...
  • S. Hiai et al.

    Sarcasm Detection Using RNN with Relation Vector

    Int. J. Data Warehousing Min.

    (2019)
  • Kaggle. Microsoft malware classification challenge, https://www.kaggle.com/c/malware-classification, 2015, (accessed 15...
  • Cited by (64)

    View all citing articles on Scopus

    Hoang Viet Long is currently working as the researcher of Institute for Computational Science at Ton Duc Thang University, Ho Chi Minh City, Vietnam. He obtained Ph.D. diploma in Computer Science at Hanoi University of Science and Technology in 2011, where he defensed his thesis in fuzzy and soft computing field with applications to electronic engineering. He has been promoted to Associate Professor in Information Technology since 2017. Recently, he has been concerning in Cybersecurity, Machine Learning, Bitcoin and BlockChain and published more than 40 papers in ISI-covered journal.

    Deepak Prashar received the B.Tech in Information Technology from uttar pradesh Technical University, UP, India. he has completed his M.Tech. in Information Technology and Communication (ICT) from the School of Information Technology and Communication, GBU Greater Noida, India in 2012 and currently doing Ph.D. in Computer Science from the School of Computer and Systems Sciences, JNU, New Delhi, India. His primary research interests are Wireless Sensor Body area Network and Network Security.

    Sudan Jha was born in Kathmandu, Nepal. He received proficiency in certificate level from the Saint Xavier's College, Kathmandu and the B.E. degree in electronics engineering from the Motilal Nehru Regional College, Allahabad, Uttar Pradesh, India, in 2001. He joined as a Lecturer with the Nepal Engineering College (nec), one of the premium and largest engineering college and the first one in the private domain in Nepal, where he got full sponsorship from the employer (nec) to pursue master's degree in computer science. He was an Assistant Professor with the Department of Computer Science and Engineering after completion of his master's degree and no sooner than later, he became the Head of the Computer Science and Engineering Department, Nepal Engineering College. In due course of time, he chaired and organized five international conferences, and some of the proceedings of those conferences had been published by Springer Verlag, World Science Series, and Imperial Press London.

    David Taniar received the bachelor's, master's, and Ph.D. degrees all in computer science, specializing in databases. His research areas are in big data processing, data warehousing, and mobile and spatial query processing. He has published a book on the High-Performance Parallel Database Processing (Wiley, 2008). He has also published over 200 journal papers. He is a regular keynote speaker at an international conference, delivering lectures and speeches on big data. He is a founding Editor-in-Chief of the International Journal of Web and Grid Services, and the International Journal of Data Warehousing and Mining. He is currently an Associate Professor with Monash University, Australia.

    View full text