Abstract
Phishing emails, a type of cyberattack using fake emails, are difficult to recognize due to sophisticated techniques employed by attackers. In this paper, we use a natural language processing (NLP) and machine learning (ML) based approach for detecting phishing emails. We compare the efficacy of six different ML algorithms for the purpose. An empirical evaluation on two public datasets demonstrates that our approach detects phishing emails with high accuracy, precision, and recall. The findings from this work are useful in devising more efficient techniques for recognizing and preventing phishing attacks.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Agarwal K, Kumar T (2018) Email spam detection using integrated approach of naïve bayes and particle swarm optimization. In: 2018 second international conference on intelligent computing and control systems (ICICCS). IEEE, pp 685–690
APWG: Phishing Activity Trends Reports. https://www.f5.com/labs/articles/threat-intelligence/2020-phishing-and-fraud-report. (Verified: March 2023)
Bisong E (2019) Building machine learning and deep learning models on Google cloud platform. Springer
Champa A, Rabbi M, Eishita F, Zibran M (2023) Are we aware? An empirical study on the privacy and security awareness of smartphone sensors. In: 21st IEEE international conference on software engineering, management and applications (SERA), p (to appear)
Chatterjee A (2023) Preprocessed TREC 2007 Public Corpus Dataset. https://www.kaggle.com/datasets/imdeepmind/preprocessed-trec-2007-public-corpus-dataset. (Verified: March 2023)
Cormack GV (2007) Trec 2007 spam track overview. In: Proceedings of the 16th text retrieval conference (TREC), vol 500, p 274
Das A, Baki S, El Aassal A, Verma R, Dunbar A (2019) Sok: a comprehensive reexamination of phishing research from the security perspective. IEEE Commun Surv Tutor 22(1):671–708
Dhavale S (2020) C-asft: convolutional neural networks-based anti-spam filtering technique. In: Proceeding of international conference on computational science and applications: ICCSA 2019, pp 49–55
Dou Z, Khalil I, Khreishah A, Al-Fuqaha A, Guizani M (2017) Systematization of knowledge (sok): a systematic review of software-based web phishing detection. IEEE Commun Surv Tutor 19(4):2797–2819
Géron A (2022) Hands-on machine learning with Scikit-Learn, Keras, and TensorFlow. O’Reilly Media, Inc
Halgaš L, Agrafiotis I, Nurse JR (2020) Catching the phish: detecting phishing attacks using recurrent neural networks (rnns). In: 20th international conference on information security applications, pp 219–233
Islam M, Zibran M (2018) Sentistrength-SE: exploiting domain specificity for improved sentiment analysis in software engineering text. J Syst Softw 145:125–146
Islam M, Ahmmed M, Zibran M (2019) Marvalous: machine learning based detection of emotions in the valence-arousal space in software engineering text. In: 34th ACM/SIGAPP symposium on applied computing (SAC), pp 1786–1793
Islam M, Al Amin M, Islam M, Mahbub M, Showrov M, Kaushal C (2021) Spam-detection with comparative analysis and spamming words extractions. In: 9th international conference on reliability, infocom technologies and optimization, pp 1–9
Islam M, Zibran M (2016) A comparative study on vulnerabilities in categories of clones and non-cloned code. In: 10th IEEE international workshop on software clones, pp 8–14
Islam M, Zibran M (2017) Leveraging automated sentiment analysis in software engineering. In: 14th IEEE international conference on mining software repository (MSR), pp 203–214
Islam M, Zibran M (2018) Deva: sensing emotions in the valence arousal space in software engineering text. In: 33rd ACM/SIGAPP symposium on applied computing (SAC), pp 1536–1543
Islam M, Zibran M, Nagpal A (2017) Security vulnerabilities in categories of clones and non-cloned code: an empirical study. In: 11th ACM/IEEE international symposium on empirical software engineering and measurement, pp 20–29
Joseph R, Zibran M, Eishita F (2021) Choosing the weapon: a comparative study of security analyzers for android applications. In: International conference on software engineering, management and applications, pp 51–57
Klimt B, Yang Y (2004) The enron corpus: a new dataset for email classification research. In: Machine learning: ECML 2004: 15th European conference on machine learning, Pisa, Italy, 20–24 Sept 2004. Proceedings, vol 15. Springer, pp 217–226
Magdy S, Abouelseoud Y, Mikhail M (2022) Efficient spam and phishing emails filtering based on deep learning. Comput Netw 206:108,826
MimeCast: How to Stop Phishing Attacks (Whitepaper). https://www.mimecast.com/resources/white-papers/how-to-stop-phishing-attacks/. (Verified: March 2023)
MimeCast: The State of Email Security 2023 (E-book). https://www.mimecast.com/state-of-email-security/. (Verified: March 2023)
Mukherjee A, Agarwal N, Gupta S (2019) A survey on automatic phishing email detection using natural language processing techniques. Int Res J Eng Technol 6(11):1881–1886
Murphy D, Zibran M, Eishita F (2021) Plugins to detect vulnerable plugins: an empirical assessment of the security scanner plugins for wordpress. In: International conference on software engineering, management and applications, pp 39–44
Nazario J (2023) The online phishing corpus. http://monkey.org/jose/wiki/doku.php. (Verified: March 2023)
Pan W, Li J, Gao L, Yue L, Yang Y, Deng L, Deng C (2022) Semantic graph neural network: a conversion from spam email classification to graph classification. Sci Program 2022:1–8
Rajbhandari A, Zibran M, Eishita F (2022) Security versus performance bugs: How bugs are handled in the chromium project. In: International conference on software engineering, management and applications, pp 70–76
Rodriguez J, Zibran M, Eishita F (2022) Finding the middle ground: measuring passwords for security and memorability. In: 20th IEEE international conference on software engineering, management and applications, pp 77–82
Sakkis G, Androutsopoulos I, Paliouras G, Karkaletsis V, Spyropoulos CD, Stamatopoulos P (2003) A memory-based approach to anti-spam filtering for mailing lists. Inf Retr 6:49–73
Sattar N, Arifuzzaman S, Zibran M, Sakib M (2019) Detecting web spam in webgraphs with predictive model analysis. In: 3rd international workshop on big data analytic for cyber crime investigation and prevention, pp 4299–4308
Schwartz A (2023) Apache SpamAssassin. https://spamassassin.apache.org/. (Verified: March 2023)
Uchill J (2016) Typo led to podesta email hack: Report. The Hill 13
Unnithan NA, Harikrishnan N, Vinayakumar R, Soman K, Sundarakrishna S (2018) Detecting phishing e-mail using machine learning techniques. In: Proceedings of 1st anti-phishing shared task pilot, 4th acm iwspa co-located, 8th acm conference on data and application security and privacy (codaspy), pp 51–54
Warburton D (2023) 2020 Phishing and Fraud Report. https://www.f5.com/labs/articles/threat-intelligence/2020-phishing-and-fraud-report. (Verified: March 2023)
Zibran M (2016) On the effectiveness of labeled latent dirichlet allocation in automatic bug-report categorization. In: 38th ACM/IEEE international conference on software engineering (ICSE), pp 713–715
Acknowledgements
This work is supported in part by a grant from the Center for Advanced Energy Studies (CAES) in Idaho, USA.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this chapter
Cite this chapter
Rabbi, M.F., Champa, A.I., Zibran, M.F. (2024). Phishy? Detecting Phishing Emails Using Machine Learning and Natural Language Processing. In: Lee, R. (eds) Software Engineering and Management: Theory and Application. Studies in Computational Intelligence, vol 1137. Springer, Cham. https://doi.org/10.1007/978-3-031-55174-1_9
Download citation
DOI: https://doi.org/10.1007/978-3-031-55174-1_9
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-55173-4
Online ISBN: 978-3-031-55174-1
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)