Skip to main content

Phishy? Detecting Phishing Emails Using Machine Learning and Natural Language Processing

  • Chapter
  • First Online:
Software Engineering and Management: Theory and Application

Part of the book series: Studies in Computational Intelligence ((SCI,volume 1137))

  • 10 Accesses

Abstract

Phishing emails, a type of cyberattack using fake emails, are difficult to recognize due to sophisticated techniques employed by attackers. In this paper, we use a natural language processing (NLP) and machine learning (ML) based approach for detecting phishing emails. We compare the efficacy of six different ML algorithms for the purpose. An empirical evaluation on two public datasets demonstrates that our approach detects phishing emails with high accuracy, precision, and recall. The findings from this work are useful in devising more efficient techniques for recognizing and preventing phishing attacks.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 139.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
USD 179.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Agarwal K, Kumar T (2018) Email spam detection using integrated approach of naïve bayes and particle swarm optimization. In: 2018 second international conference on intelligent computing and control systems (ICICCS). IEEE, pp 685–690

    Google Scholar 

  2. APWG: Phishing Activity Trends Reports. https://www.f5.com/labs/articles/threat-intelligence/2020-phishing-and-fraud-report. (Verified: March 2023)

  3. Bisong E (2019) Building machine learning and deep learning models on Google cloud platform. Springer

    Google Scholar 

  4. Champa A, Rabbi M, Eishita F, Zibran M (2023) Are we aware? An empirical study on the privacy and security awareness of smartphone sensors. In: 21st IEEE international conference on software engineering, management and applications (SERA), p (to appear)

    Google Scholar 

  5. Chatterjee A (2023) Preprocessed TREC 2007 Public Corpus Dataset. https://www.kaggle.com/datasets/imdeepmind/preprocessed-trec-2007-public-corpus-dataset. (Verified: March 2023)

  6. Cormack GV (2007) Trec 2007 spam track overview. In: Proceedings of the 16th text retrieval conference (TREC), vol 500, p 274

    Google Scholar 

  7. Das A, Baki S, El Aassal A, Verma R, Dunbar A (2019) Sok: a comprehensive reexamination of phishing research from the security perspective. IEEE Commun Surv Tutor 22(1):671–708

    Article  Google Scholar 

  8. Dhavale S (2020) C-asft: convolutional neural networks-based anti-spam filtering technique. In: Proceeding of international conference on computational science and applications: ICCSA 2019, pp 49–55

    Google Scholar 

  9. Dou Z, Khalil I, Khreishah A, Al-Fuqaha A, Guizani M (2017) Systematization of knowledge (sok): a systematic review of software-based web phishing detection. IEEE Commun Surv Tutor 19(4):2797–2819

    Article  Google Scholar 

  10. Géron A (2022) Hands-on machine learning with Scikit-Learn, Keras, and TensorFlow. O’Reilly Media, Inc

    Google Scholar 

  11. Halgaš L, Agrafiotis I, Nurse JR (2020) Catching the phish: detecting phishing attacks using recurrent neural networks (rnns). In: 20th international conference on information security applications, pp 219–233

    Google Scholar 

  12. Islam M, Zibran M (2018) Sentistrength-SE: exploiting domain specificity for improved sentiment analysis in software engineering text. J Syst Softw 145:125–146

    Article  Google Scholar 

  13. Islam M, Ahmmed M, Zibran M (2019) Marvalous: machine learning based detection of emotions in the valence-arousal space in software engineering text. In: 34th ACM/SIGAPP symposium on applied computing (SAC), pp 1786–1793

    Google Scholar 

  14. Islam M, Al Amin M, Islam M, Mahbub M, Showrov M, Kaushal C (2021) Spam-detection with comparative analysis and spamming words extractions. In: 9th international conference on reliability, infocom technologies and optimization, pp 1–9

    Google Scholar 

  15. Islam M, Zibran M (2016) A comparative study on vulnerabilities in categories of clones and non-cloned code. In: 10th IEEE international workshop on software clones, pp 8–14

    Google Scholar 

  16. Islam M, Zibran M (2017) Leveraging automated sentiment analysis in software engineering. In: 14th IEEE international conference on mining software repository (MSR), pp 203–214

    Google Scholar 

  17. Islam M, Zibran M (2018) Deva: sensing emotions in the valence arousal space in software engineering text. In: 33rd ACM/SIGAPP symposium on applied computing (SAC), pp 1536–1543

    Google Scholar 

  18. Islam M, Zibran M, Nagpal A (2017) Security vulnerabilities in categories of clones and non-cloned code: an empirical study. In: 11th ACM/IEEE international symposium on empirical software engineering and measurement, pp 20–29

    Google Scholar 

  19. Joseph R, Zibran M, Eishita F (2021) Choosing the weapon: a comparative study of security analyzers for android applications. In: International conference on software engineering, management and applications, pp 51–57

    Google Scholar 

  20. Klimt B, Yang Y (2004) The enron corpus: a new dataset for email classification research. In: Machine learning: ECML 2004: 15th European conference on machine learning, Pisa, Italy, 20–24 Sept 2004. Proceedings, vol 15. Springer, pp 217–226

    Google Scholar 

  21. Magdy S, Abouelseoud Y, Mikhail M (2022) Efficient spam and phishing emails filtering based on deep learning. Comput Netw 206:108,826

    Google Scholar 

  22. MimeCast: How to Stop Phishing Attacks (Whitepaper). https://www.mimecast.com/resources/white-papers/how-to-stop-phishing-attacks/. (Verified: March 2023)

  23. MimeCast: The State of Email Security 2023 (E-book). https://www.mimecast.com/state-of-email-security/. (Verified: March 2023)

  24. Mukherjee A, Agarwal N, Gupta S (2019) A survey on automatic phishing email detection using natural language processing techniques. Int Res J Eng Technol 6(11):1881–1886

    Google Scholar 

  25. Murphy D, Zibran M, Eishita F (2021) Plugins to detect vulnerable plugins: an empirical assessment of the security scanner plugins for wordpress. In: International conference on software engineering, management and applications, pp 39–44

    Google Scholar 

  26. Nazario J (2023) The online phishing corpus. http://monkey.org/jose/wiki/doku.php. (Verified: March 2023)

  27. Pan W, Li J, Gao L, Yue L, Yang Y, Deng L, Deng C (2022) Semantic graph neural network: a conversion from spam email classification to graph classification. Sci Program 2022:1–8

    Google Scholar 

  28. Rajbhandari A, Zibran M, Eishita F (2022) Security versus performance bugs: How bugs are handled in the chromium project. In: International conference on software engineering, management and applications, pp 70–76

    Google Scholar 

  29. Rodriguez J, Zibran M, Eishita F (2022) Finding the middle ground: measuring passwords for security and memorability. In: 20th IEEE international conference on software engineering, management and applications, pp 77–82

    Google Scholar 

  30. Sakkis G, Androutsopoulos I, Paliouras G, Karkaletsis V, Spyropoulos CD, Stamatopoulos P (2003) A memory-based approach to anti-spam filtering for mailing lists. Inf Retr 6:49–73

    Article  Google Scholar 

  31. Sattar N, Arifuzzaman S, Zibran M, Sakib M (2019) Detecting web spam in webgraphs with predictive model analysis. In: 3rd international workshop on big data analytic for cyber crime investigation and prevention, pp 4299–4308

    Google Scholar 

  32. Schwartz A (2023) Apache SpamAssassin. https://spamassassin.apache.org/. (Verified: March 2023)

  33. Uchill J (2016) Typo led to podesta email hack: Report. The Hill 13

    Google Scholar 

  34. Unnithan NA, Harikrishnan N, Vinayakumar R, Soman K, Sundarakrishna S (2018) Detecting phishing e-mail using machine learning techniques. In: Proceedings of 1st anti-phishing shared task pilot, 4th acm iwspa co-located, 8th acm conference on data and application security and privacy (codaspy), pp 51–54

    Google Scholar 

  35. Warburton D (2023) 2020 Phishing and Fraud Report. https://www.f5.com/labs/articles/threat-intelligence/2020-phishing-and-fraud-report. (Verified: March 2023)

  36. Zibran M (2016) On the effectiveness of labeled latent dirichlet allocation in automatic bug-report categorization. In: 38th ACM/IEEE international conference on software engineering (ICSE), pp 713–715

    Google Scholar 

Download references

Acknowledgements

This work is supported in part by a grant from the Center for Advanced Energy Studies (CAES) in Idaho, USA.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Md. Fazle Rabbi .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Rabbi, M.F., Champa, A.I., Zibran, M.F. (2024). Phishy? Detecting Phishing Emails Using Machine Learning and Natural Language Processing. In: Lee, R. (eds) Software Engineering and Management: Theory and Application. Studies in Computational Intelligence, vol 1137. Springer, Cham. https://doi.org/10.1007/978-3-031-55174-1_9

Download citation

Publish with us

Policies and ethics