Phishy? Detecting Phishing Emails Using Machine Learning and Natural Language Processing

Rabbi, Md. Fazle; Champa, Arifa I.; Zibran, Minhaz F.

doi:10.1007/978-3-031-55174-1_9

Md. Fazle Rabbi³,
Arifa I. Champa³ &
Minhaz F. Zibran³

Part of the book series: Studies in Computational Intelligence ((SCI,volume 1137))

10 Accesses

Abstract

Phishing emails, a type of cyberattack using fake emails, are difficult to recognize due to sophisticated techniques employed by attackers. In this paper, we use a natural language processing (NLP) and machine learning (ML) based approach for detecting phishing emails. We compare the efficacy of six different ML algorithms for the purpose. An empirical evaluation on two public datasets demonstrates that our approach detects phishing emails with high accuracy, precision, and recall. The findings from this work are useful in devising more efficient techniques for recognizing and preventing phishing attacks.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 139.00; Price excludes VAT (USA)

Hardcover Book: USD 179.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Agarwal K, Kumar T (2018) Email spam detection using integrated approach of naïve bayes and particle swarm optimization. In: 2018 second international conference on intelligent computing and control systems (ICICCS). IEEE, pp 685–690
Google Scholar
APWG: Phishing Activity Trends Reports. https://www.f5.com/labs/articles/threat-intelligence/2020-phishing-and-fraud-report. (Verified: March 2023)
Bisong E (2019) Building machine learning and deep learning models on Google cloud platform. Springer
Google Scholar
Champa A, Rabbi M, Eishita F, Zibran M (2023) Are we aware? An empirical study on the privacy and security awareness of smartphone sensors. In: 21st IEEE international conference on software engineering, management and applications (SERA), p (to appear)
Google Scholar
Chatterjee A (2023) Preprocessed TREC 2007 Public Corpus Dataset. https://www.kaggle.com/datasets/imdeepmind/preprocessed-trec-2007-public-corpus-dataset. (Verified: March 2023)
Cormack GV (2007) Trec 2007 spam track overview. In: Proceedings of the 16th text retrieval conference (TREC), vol 500, p 274
Google Scholar
Das A, Baki S, El Aassal A, Verma R, Dunbar A (2019) Sok: a comprehensive reexamination of phishing research from the security perspective. IEEE Commun Surv Tutor 22(1):671–708
Article Google Scholar
Dhavale S (2020) C-asft: convolutional neural networks-based anti-spam filtering technique. In: Proceeding of international conference on computational science and applications: ICCSA 2019, pp 49–55
Google Scholar
Dou Z, Khalil I, Khreishah A, Al-Fuqaha A, Guizani M (2017) Systematization of knowledge (sok): a systematic review of software-based web phishing detection. IEEE Commun Surv Tutor 19(4):2797–2819
Article Google Scholar
Géron A (2022) Hands-on machine learning with Scikit-Learn, Keras, and TensorFlow. O’Reilly Media, Inc
Google Scholar
Halgaš L, Agrafiotis I, Nurse JR (2020) Catching the phish: detecting phishing attacks using recurrent neural networks (rnns). In: 20th international conference on information security applications, pp 219–233
Google Scholar
Islam M, Zibran M (2018) Sentistrength-SE: exploiting domain specificity for improved sentiment analysis in software engineering text. J Syst Softw 145:125–146
Article Google Scholar
Islam M, Ahmmed M, Zibran M (2019) Marvalous: machine learning based detection of emotions in the valence-arousal space in software engineering text. In: 34th ACM/SIGAPP symposium on applied computing (SAC), pp 1786–1793
Google Scholar
Islam M, Al Amin M, Islam M, Mahbub M, Showrov M, Kaushal C (2021) Spam-detection with comparative analysis and spamming words extractions. In: 9th international conference on reliability, infocom technologies and optimization, pp 1–9
Google Scholar
Islam M, Zibran M (2016) A comparative study on vulnerabilities in categories of clones and non-cloned code. In: 10th IEEE international workshop on software clones, pp 8–14
Google Scholar
Islam M, Zibran M (2017) Leveraging automated sentiment analysis in software engineering. In: 14th IEEE international conference on mining software repository (MSR), pp 203–214
Google Scholar
Islam M, Zibran M (2018) Deva: sensing emotions in the valence arousal space in software engineering text. In: 33rd ACM/SIGAPP symposium on applied computing (SAC), pp 1536–1543
Google Scholar
Islam M, Zibran M, Nagpal A (2017) Security vulnerabilities in categories of clones and non-cloned code: an empirical study. In: 11th ACM/IEEE international symposium on empirical software engineering and measurement, pp 20–29
Google Scholar
Joseph R, Zibran M, Eishita F (2021) Choosing the weapon: a comparative study of security analyzers for android applications. In: International conference on software engineering, management and applications, pp 51–57
Google Scholar
Klimt B, Yang Y (2004) The enron corpus: a new dataset for email classification research. In: Machine learning: ECML 2004: 15th European conference on machine learning, Pisa, Italy, 20–24 Sept 2004. Proceedings, vol 15. Springer, pp 217–226
Google Scholar
Magdy S, Abouelseoud Y, Mikhail M (2022) Efficient spam and phishing emails filtering based on deep learning. Comput Netw 206:108,826
Google Scholar
MimeCast: How to Stop Phishing Attacks (Whitepaper). https://www.mimecast.com/resources/white-papers/how-to-stop-phishing-attacks/. (Verified: March 2023)
MimeCast: The State of Email Security 2023 (E-book). https://www.mimecast.com/state-of-email-security/. (Verified: March 2023)
Mukherjee A, Agarwal N, Gupta S (2019) A survey on automatic phishing email detection using natural language processing techniques. Int Res J Eng Technol 6(11):1881–1886
Google Scholar
Murphy D, Zibran M, Eishita F (2021) Plugins to detect vulnerable plugins: an empirical assessment of the security scanner plugins for wordpress. In: International conference on software engineering, management and applications, pp 39–44
Google Scholar
Nazario J (2023) The online phishing corpus. http://monkey.org/jose/wiki/doku.php. (Verified: March 2023)
Pan W, Li J, Gao L, Yue L, Yang Y, Deng L, Deng C (2022) Semantic graph neural network: a conversion from spam email classification to graph classification. Sci Program 2022:1–8
Google Scholar
Rajbhandari A, Zibran M, Eishita F (2022) Security versus performance bugs: How bugs are handled in the chromium project. In: International conference on software engineering, management and applications, pp 70–76
Google Scholar
Rodriguez J, Zibran M, Eishita F (2022) Finding the middle ground: measuring passwords for security and memorability. In: 20th IEEE international conference on software engineering, management and applications, pp 77–82
Google Scholar
Sakkis G, Androutsopoulos I, Paliouras G, Karkaletsis V, Spyropoulos CD, Stamatopoulos P (2003) A memory-based approach to anti-spam filtering for mailing lists. Inf Retr 6:49–73
Article Google Scholar
Sattar N, Arifuzzaman S, Zibran M, Sakib M (2019) Detecting web spam in webgraphs with predictive model analysis. In: 3rd international workshop on big data analytic for cyber crime investigation and prevention, pp 4299–4308
Google Scholar
Schwartz A (2023) Apache SpamAssassin. https://spamassassin.apache.org/. (Verified: March 2023)
Uchill J (2016) Typo led to podesta email hack: Report. The Hill 13
Google Scholar
Unnithan NA, Harikrishnan N, Vinayakumar R, Soman K, Sundarakrishna S (2018) Detecting phishing e-mail using machine learning techniques. In: Proceedings of 1st anti-phishing shared task pilot, 4th acm iwspa co-located, 8th acm conference on data and application security and privacy (codaspy), pp 51–54
Google Scholar
Warburton D (2023) 2020 Phishing and Fraud Report. https://www.f5.com/labs/articles/threat-intelligence/2020-phishing-and-fraud-report. (Verified: March 2023)
Zibran M (2016) On the effectiveness of labeled latent dirichlet allocation in automatic bug-report categorization. In: 38th ACM/IEEE international conference on software engineering (ICSE), pp 713–715
Google Scholar

Download references

Acknowledgements

This work is supported in part by a grant from the Center for Advanced Energy Studies (CAES) in Idaho, USA.

Author information

Authors and Affiliations

Idaho State University, Pocatello, ID, USA
Md. Fazle Rabbi, Arifa I. Champa & Minhaz F. Zibran

Authors

Md. Fazle Rabbi
View author publications
You can also search for this author in PubMed Google Scholar
Arifa I. Champa
View author publications
You can also search for this author in PubMed Google Scholar
Minhaz F. Zibran
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Md. Fazle Rabbi .

Editor information

Editors and Affiliations

Software Engineering and Information Technology Institute, Central Michigan University, Mount Pleasant, MI, USA
Roger Lee

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Rabbi, M.F., Champa, A.I., Zibran, M.F. (2024). Phishy? Detecting Phishing Emails Using Machine Learning and Natural Language Processing. In: Lee, R. (eds) Software Engineering and Management: Theory and Application. Studies in Computational Intelligence, vol 1137. Springer, Cham. https://doi.org/10.1007/978-3-031-55174-1_9

Download citation

DOI: https://doi.org/10.1007/978-3-031-55174-1_9
Published: 03 May 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-55173-4
Online ISBN: 978-3-031-55174-1
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics