Abstract
Phishing remains a continual security threat, causing global losses exceeding 3.5 billion USD in 2019, according to the FBI’s Internet Crime Complaint Center. The Anti-Phishing Working Group (APWG) reported as many as 2,172 unique phishing websites detected per day in 2019. Most of the methods to solve the phishing websites’ detection problem proposed by the scientific community are based on classical classification algorithms on phishing datasets with hand-extracted features. Although these methods demonstrate high accuracies, unfortunately, they are sensitive to changing environment: phishers can learn the most relevant URL features and adapt their attacks to overcome the security check. Therefore, in search of less sensitive methods, deep neural networks were started to employ, as they do not require manual feature extraction and can directly learn a representation from the URL’s sequence of characters. The purpose of this research is to propose a new method for phishing websites’ URL detection based on ensembles of Recurrent neural networks and other types of deep neural networks. The results of our approach are presented in this paper and compared with the performance of other Recurrent neural networks. These results are additionally compared with the performance of classical classification algorithms on the same dataset with 48 features extracted. Our method with no manually extracted feature gives a significant increase in classification accuracy, compared with single Recurrent neural networks, and matches the accuracy of classical classification ensembles with manually extracted features.
Keywords
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsNotes
- 1.
- 2.
- 3.
- 4.
- 5.
- 6.
- 7.
- 8.
References
Adebowale, M., Lwin, K., Sánchez, E., Hossain, M.: Intelligent web-phishing detection and protection scheme using integrated features of Images, frames and text. Expert Systems with Applications 115, 300–313 (2019). https://doi.org/10.1016/J.ESWA.2018.07.067, https://www.sciencedirect.com/science/article/pii/S0957417418304925?via%3Dihub
Anti-Phishing Working Group, I.: Phishing Activity Trends Reports (2019). https://apwg.org/resources/apwg-reports/
Bahnsen, A.C., Bohorquez, E.C., Villegas, S., Vargas, J., Gonzalez, F.A.: Classifying phishing URLs using recurrent neural networks. In: 2017 APWG Symposium on Electronic Crime Research (eCrime), pp. 1–8 (2017). https://doi.org/10.1109/ECRIME.2017.7945048, http://ieeexplore.ieee.org/document/7945048/
Bengio, Y., Simard, P., Frasconi, P.: Learning long-term dependencies with gradient descent is difficult. IEEE Trans. Neural Netw. 5(2), 157–166 (1994). https://doi.org/10.1109/72.279181
Chiew, K.L., Tan, C.L., Wong, K., Yong, K.S., Tiong, W.K.: A new hybrid ensemble feature selection framework for machine learning-based phishing detection system. Information Sciences 484, 153–166 (2019). https://doi.org/10.1016/j.ins.2019.01.064, https://www.sciencedirect.com/science/article/pii/S0020025519300763?via%3Dihub linkinghub.elsevier.com/retrieve/pii/S0020025519300763
Cho, K., et al.: Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation. arXiv:1406.1078v3 (2014)
Cui, B., He, S., Yao, X., Shi, P., Yao, X., He, S., Cui, B.: Malicious URL detection with feature extraction based on machine learning. Int. J. High Performance Comput. Network. 12(2), 166 (2018). https://doi.org/10.1504/ijhpcn.2018.10015545, http://www.inderscience.com/link.php?id=94367
Gers, F.A., Urgen Schmidhuber, J.J., Cummins, F.: Learning to forget: continual prediction with LSTM. In: Proceedings ICANN 1999 International Conference on Artificial Neural Network, vol. 2, pp. 850–855. IDSIA (1999). http://www.idsia.ch/www.idsia.ch/
Han, J., Moraga, C.: The influence of the sigmoid function parameters on the speed of backpropagation learning. In: Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 930, pp. 195–201. Springer, Cham (1995). https://doi.org/10.1007/3-540-59497-3_175
Hochreiter, S., Urgen Schmidhuber, J.J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997), http://www7.informatik.tu-muenchen.de/~hochreitwww.idsia.ch/~juergen
Internet Crime Complaint Center: Internet Crime Report 2019. Tech. rep., Internet Crime Complaint Center at the Federal Bureau of Investigation of United States of America (2020). https://www.ic3.gov/media/annualreport/2019_IC3Report.pdf
Kleinbaum, D.G., Klein, M.: Introduction to logistic regression. In: Logistic Regression, pp. 1–39. Springer, New York, NY (2010). https://doi.org/10.1007/978-1-4419-1742-3_1, http://link.springer.com/10.1007/978-1-4419-1742-3_1
Lewis, D.D.: Naive (Bayes) at forty: The independence assumption in information retrieval. In: ECML 1998: Machine Learning: ECML-1998, pp. 4–15. Springer, Heidelberg (1998). https://doi.org/10.1007/BFb0026666
Lin Tan, C., et al.: PhishWHO: Phishing webpage detection via identity keywords extraction and target domain name finder. Decision Support Systems 88, 18–27 (2016). https://doi.org/10.1016/j.dss.2016.05.005
Marchal, S., Armano, G., Grondahl, T., Saari, K., Singh, N., Asokan, N.: Off-the-hook: an efficient and usable client-side phishing prevention application. IEEE Trans. Comput. 66(10), 1717–1733 (2017). https://doi.org/10.1109/TC.2017.2703808
Opara, C., Wei, B., Chen, Y.: HTMLPhish: Enabling Accurate Phishing Web Page Detection by Applying Deep Learning Techniques on HTML Analysis. http://arxiv.org/abs/1909.01135arXiv:1909.01135 (2019), http://www.phishtank.com
Pedregosa, F., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011), http://jmlr.csail.mit.edu/papers/v12/pedregosa11a.html, https://scikit-learn.org/stable/
Saxe, J., Berlin, K.: eXpose: A character-level convolutional neural network with embeddings for detecting malicious URLs, file paths and registry keys. arXiv preprint arXiv:1702.08568, February 2017, http://arxiv.org/abs/1702.08568
Seifert, C., Welch, I., Komisarczuk, P.: Identification of malicious web pages with static heuristics. In: 2008 Australasian Telecommunication Networks and Applications Conference, pp. 91–96. IEEE, December 2008. https://doi.org/10.1109/ATNAC.2008.4783302, http://ieeexplore.ieee.org/document/4783302/
Shapiro, S.S., Wilk, M.B.: An analysis of variance test for normality (complete samples). Biometrika 52(3/4), 591–611 (1965)
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15(1), 1929–1958 (2014)
Vaitkevicius, P., Marcinkevicius, V.: Comparison of classification algorithms for detection of phishing websites. Informatica 31(1), 143–160 (2020). https://doi.org/10.15388/20-infor404
Vazhayil, A., Vinayakumar, R., Soman, K.: Comparative study of the detection of malicious URLs using shallow and deep networks. In: 2018 9th International Conference on Computing, Communication and Networking Technologies (ICCCNT). pp. 1–6. IEEE, July 2018. https://doi.org/10.1109/ICCCNT.2018.8494159, https://ieeexplore.ieee.org/document/8494159/
Verma, R., Das, A.: What’s in a URL. In: Proceedings of the 3rd ACM on International Workshop on Security And PrivacyAnalytics - IWSPA 2017, pp. 55–63. ACM Press, New York, New York (2017). https://doi.org/10.1145/3041008.3041016, http://dl.acm.org/citation.cfm?doid=3041008.3041016
Wei, B., Hamad, R.A., Yang, L., He, X., Wang, H., Gao, B., Woo, W.L.: A deep-learning-driven light-weight phishing detection sensor. Sensors 19(19), 4258 (2019). https://doi.org/10.3390/s19194258
Whittaker, C., Ryner, B., Nazif, M.: Large-scale automatic classification of phishing pages. The 17th Annual Network and Distributed System Security Symposium (NDSS 2010) (2010). https://doi.org/10.1109/TDSC.2013.3, http://www.isoc.org/isoc/conferences/ndss/10/pdf/08.pdf%5Cnresearch.google.com/pubs/pub35580.html
Xiang, G., Hong, J., Rose, C.P., Cranor, L.: CANTINA+: A feature-rich machine learning framework for detecting phishing web sites. ACM Trans. Inf. Syst. Secur. 14(2), 1–28 (2011). https://doi.org/10.1145/2019599.2019606, https://www.ml.cmu.edu/research/dap-papers/dap-guang-xiang.pdf
Yang, P., Zhao, G., Zeng, P.: Phishing website detection based on multidimensional features driven by deep learning. IEEE Access 7, 15196–15209 (2019). https://doi.org/10.1109/ACCESS.2019.2892066
Zhao, J., Wang, N., Ma, Q., Cheng, Z.: Classifying malicious URLs using gated recurrent neural networks. In: International Conference on Innovative Mobile and Internet Services in Ubiquitous Computing, pp. 385–394. Springer, Heidelberg (2019). https://doi.org/10.1007/978-3-319-93554-6_36
Zhao, P., Hoi, S.C.: Cost-sensitive online active learning with application to malicious URL detection. In: Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining - KDD 2013, p. 919. ACM Press, New York (2013). https://doi.org/10.1145/2487575.2487647, http://dl.acm.org/citation.cfm?doid=2487575.2487647
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Vaitkevicius, P., Marcinkevicius, V. (2020). Composition of Ensembles of Recurrent Neural Networks for Phishing Websites Detection. In: Robal, T., Haav, HM., Penjam, J., Matulevičius, R. (eds) Databases and Information Systems. DB&IS 2020. Communications in Computer and Information Science, vol 1243. Springer, Cham. https://doi.org/10.1007/978-3-030-57672-1_22
Download citation
DOI: https://doi.org/10.1007/978-3-030-57672-1_22
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-57671-4
Online ISBN: 978-3-030-57672-1
eBook Packages: Computer ScienceComputer Science (R0)