Elsevier

Future Generation Computer Systems

Volume 125, December 2021, Pages 652-660
Future Generation Computer Systems

Textual analysis of traitor-based dataset through semi supervised machine learning

https://doi.org/10.1016/j.future.2021.06.036Get rights and content

Highlights

  • Insider threat to employers and companies is a complex and growing challenge.

  • Research devoted to “traitor detection” has remained very restricted as compared to “masquerader detection”.

  • Insider threat detection performed through Textual analysis, big data and email logs are worthwhile.

  • In this research Class label identification done through clustering algorithm.

  • Prediction of malicious emails by using multiple Machine Learning Classifiers.

Abstract

Insider threats are one of the most challenging and growing security threats which the government agencies, organizations, and institutions face. In such scenarios, malicious (red) activities are performed by the authorized individuals within the company. Because of which, an insider threat has become a taxing and difficult task to identify among other attacks. Along with other monitoring parameters; email logs play a vital role in many research areas such as stalking Insider Threat involving Collaborating Traitors, Textual Analysis, and Social Media exploration. This paper presents a semi-supervised machine learning framework which embraces the pre-processing and classification techniques together for unlabeled dataset i.e. emails. Enron Corporation dataset has been used for experiments and TWOS for evaluation of the proposed framework. Initially, dataset is transformed into vector form using Term Frequency–Inverse Document Frequency (TF–IDF). Thereafter, K-Means is used to classify emails based on message content. Finally, Machine Learning algorithm Decision Tree (DT) is applied to classify the malicious activities. The proposed framework has also been tested with other algorithms such as Logistic Regression (LR), Naive Bayes (NB), KNN, Support Vector Machine (SVM), Random Forest (RF) and Neural Network (NN). However, Decision Tree (DT) combined with pre-processing steps has given the desired results with 99.96% Accuracy and 0.994 AUC for identification of malicious content.

Introduction

The “insider threat” involves the activities of a privileged and trusted user, who is gaining access of the systems and inappropriately disseminating the secret information. With the passage of time, communication methods have proven their worth and have become an essential element of advanced Information Technology. Among other methods, emails have turned out to be very efficacious in public and private sector because of its low sending cost, accessibility, and expeditious message transfer. Every organization is generating immense amount of data every day in textual form. Hence, textual data is the main area of interest and threat from inside cannot be ruled out [1].

For research in many fields i.e. textual analysis, email logs, big data and link analysis are worthwhile. The major bottleneck of these fields is lack of real-world and suitable dataset; researchers are constrained to use synthetic data for experiments. To solve this problem, Enron data has been proven a hallmark for research work. This dataset has many similarities to the data gathered for insider threats, counter terrorism and fraud detection. Therefore, it is a seamless testing platform for analyzing the efficacy of proposed models [2].

Federal Energy Regulatory Commission exposed the Enron email dataset during its investigation. Initially, it had many problems such as integrity and veracity. Later on, to solve these problems, Melinda Gervasio collected and prepared a dataset at SRI (Socially Responsible Investing) for the CALO (A Cognitive Assistant that Learns and Organizes) project and majority of these issues of subject dataset were resolved. This dataset consists of personal and official emails of the organization. It is publicly available on the following link for researchers (http://www-2.cs.cmu.edu/enron/). The version of the dataset comprises around 517,431 emails distributed in 3500 folders and contains the information of each of the (151) employees. Although, these emails do not include attachments nevertheless, each message contains the emails address of the sender/receiver, subject, date/time, body and other technical details [3].

Our prior research [4] has analyzed the unstructured datasets, while unlabeled datasets were yet to be explored. To extend the existing work, we have proposed a complete framework for handling unlabeled datasets. The core focus of the paper is to classify the emails using different Machine Learning techniques. In the first place, Term Frequency–Inverse Document Frequency (TF–IDF) has been applied to find out the important and useful words in emails. Consequently, unsupervised techniques such as K-means has been used to classify the emails into clusters. Afterwards the Supervised learning algorithms including Decision Tree (DT), Naıve Bayes (NB), Logistic Regression (LR), KNN, Support Vector Machine (SVM), Random Forest (RF) and Neural Network (NN) have been implemented on the classified datasets. Experiments’ results have been analyzed and compared with state-of-the-art techniques.

The paper is structured as follows. Section 2 discusses the work done in literature. The details of the procedure for building the prediction framework are highlighted in Section 3. The experiments are mentioned in Section 4 and results are discussed in Section 5. Finally, we conclude the paper in Section 6.

Section snippets

Insider threats

One of the serious concerns for any organization is the damage caused by insiders. Eventually, significant acknowledgment was given from both research and industrial communities. Though, it is very difficult to completely abstain the malicious insider during its launching stage when it is being executed. However, different models have been proposed by researchers to stop, reduce and even to predict the malicious attacks.

Duc C. Le, et al. [5] recommended a smart user centric framework for the

Proposed methodology

The proposed framework consists of following two main steps:

  • Class label identification through clustering algorithm

  • Prediction of malicious emails by using machine learning classifiers

Fig. 2 illustrates the proposed model, which contain four components: Datasets selection and cleaning, Dataset Pre-processing, Transformation, and Data Labeling respectively. Each component is detailed in the following subsections.

The data acquisition method gains data from Enron repository and TWOS research lab.

Experimentation and results

Tensor flow library using python language in Anaconda IDE setup is used for development of the framework. The proposed framework is represented in Algorithm 1. We tuned our model with different parameters for obtaining the optimum results.

A semi-supervised technique is used by combining unsupervised clustering algorithm with supervised learning classifiers. We have used multiple classifiers for attaining wider spectrum of diversity. The range to which each distinct classifier disagrees about

Comparative analysis

Initial experiments were carried out on unstructured/ raw dataset using multiple single classifiers namely, Decision Tree (DT), K Nearest Neighbor (KNN), Neural Network (NN), Logistic Regression (LR), Random Forest (RF), Naive Bayes (NB), Support Vector Machine (SVM). Results showed an average accuracy of 73% and AUC 0.72, which were inadequate. Hence, to attain the best results, experiments were re-conducted on the proposed model (explained in Section 3). The classifiers Decision Tree (DT) and

Conclusion

This paper presented a semi-supervised Framework. We have performed two different analyses to better understand our model’s behavior. In the first instance, we have taken unstructured dataset and applied multiple single classifiers. Since the model was trained on raw dataset, therefore accuracy score for all users was 72%. In our second analysis, we have studied the pre-processing methods like TF–IDF and data labeling technique K-Means combined with machine learning algorithms. For experiments,

CRediT authorship contribution statement

Faisal Janjua: Conceptualization, Methodology, Software (Python and Machine Learning Classifiers), Validation, Formal analysis, Data curation, Writing and documentation. Asif Masood: Methodology, Validation, Investigation methodology and results, Data curation, Writing and documentation. Haider Abbas: Conceptualization, Validation, Investigation methodology and results, Writing and documentation, Review and editing. Imran Rashid: Formal analysis, Investigation methodology and results, Review

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Mr Faisal Janjua completed his BE in Computer Software Engineering from NUST. He did his first MS in Computer Software Engineering from NUST in 2008 and completed his second Masters in Engineering Project Management from Melbourne University, Australia in 2014. Mr Janjua is pursuing his Ph.D. from NUST and after completing his Ph.D. Course work he is undergoing through his Research phase. The Ph.D. research topic is “Insider Threat detection using AI/ Machine Learning Techniques”. He has

References (27)

  • JanjuaF. et al.

    Handling insider threat through supervised machine learning techniques

    Procedia Comput. Sci.

    (2020)
  • SohCharlie et al.

    Employee profiling via aspect-based sentiment and network for insider threats detection

    Expert Syst. Appl.

    (2019)
  • L. Acevado, The advantages of email in business communication [Online]. Available:...
  • KlimtB. et al.
  • https://pdfs.semanticscholar.org/0cf7/0295f779d20cf38d34d3e948af9298e1c367.pdf (Accessed on...
  • LeDuc C. et al.

    Analyzing data granularity levels for insider threat detection using machine learning

    IEEE Trans. Netw. Serv. Manag.

    (2020)
  • SheykhkanlooNaghmeh Moradpoor et al.

    Insider threat detection using supervised machine learning algorithms on an extremely imbalanced dataset

    Int. J. Cyber Warfare Terrorism (IJCWT)

    (2020)
  • Esteban Castillo, et al. Email threat detection using distinct neural network approaches, in: Proceedings for the First...
  • MakiMohamed Abdulhussain Ali Madan et al.

    Using an artificial neural network to improve email security

  • YuGaoqing

    An explainable method of phishing emails generation and its application

  • H. Chi, C. Scarllet, Z.G. Prodanoff, D. Hubbard, Determining predisposition to insider threat activities by using text...
  • J. Jiang, et al. Prediction and detection of malicious insiders motivation based on sentiment profile on webpages and...
  • HomoliakI. et al.

    Insight into Insiders and IT: A Survey of Insider Threat Taxonomies, Analysis, Modeling, and Countermeasures, Vol. 52

    (2019)
  • Cited by (0)

    Mr Faisal Janjua completed his BE in Computer Software Engineering from NUST. He did his first MS in Computer Software Engineering from NUST in 2008 and completed his second Masters in Engineering Project Management from Melbourne University, Australia in 2014. Mr Janjua is pursuing his Ph.D. from NUST and after completing his Ph.D. Course work he is undergoing through his Research phase. The Ph.D. research topic is “Insider Threat detection using AI/ Machine Learning Techniques”. He has various international research publications at his credit.

    Dr Asif Masood did his BE in software Engineering from NUST in 1999. He completed his MS and Ph.D. in Computer Science from University of Engineering and Technology Lahore in 2007. Currently, He is working as Dean at NUST. He is an active researcher. His research publications include 1 book, 8 book chapters, 18 journal papers and 15 conference papers. In recognition to his research contributions, he was awarded Best Research paper award by HEC in 2011, Research Productivity Award 2010–2011 by Pakistan Council for Science and Technology and Dr M. N. Azam prize in Computer Science in 2009 by Pakistan Academy of Sciences. His biography has been published in Who is Who in the World (in 2009 & 2010) and Top 100 Engineers 2009 by International Biography Centre, Cambridge, England. He is also a reviewer for various international journals and conferences.

    Dr Haider Abbas received the M.S. degree in engineering and management of information systems and the Ph.D. degree in information security from the KTH-Royal Institute of Technology, Stockholm, Sweden, in 2006 and 2010, respectively. He is currently heading the R&D department and National Cyber Security Auditing and Evaluation Lab (NCSAEL) at MCS NUST. He is a Cyber Security Professional, an Academician, a Researcher, and an Industry Consultant who took professional trainings and certifications from Massachusetts Institute of Technology, Cambridge, USA; Stockholm University, Stockholm; Stockholm School of Entrepreneurship, Stockholm; IBM, USA; and EC-Council. His professional career consists of activities ranging from research and development and industry consultations (government and private), through multinational research projects, research fellowships, doctoral studies advisory services, international journal editorships, conferences/workshops chair, invited/keynote speaker, technical program committee member, and reviewer for several international journals and conferences. In recognition of Dr. Abbas services to the international research community and excellence in professional standing, he has been awarded one of the youngest Fellows of the Institution of Engineering and Technology (IET) UK; a Fellow of the British Computer Society (BCS), UK and a Fellow of the Institute of Science and Technology, UK.

    Dr Imran Rashid did his B.E. in Electrical (Telecomm) Engineering from National University of Sciences and Technology, Pakistan, in 1999. He received his M.Sc. degree in Telecomm Engineering (Optical Communication) from D.T.U Denmark in 2004 and his Ph.D. in Mobile Communication from University of Manchester, UK in 2011. He has qualified four EC-Council certifications i.e. Certified Ethical Hacker (CEH), Computer Hacking Forensic Investigator (CHFI), EC-Council Certified Security Analyst (ECSA) and EC-Council Certified Incident Handler (ECIH). He is also a Certified EC-Council Instructor (CEI) and has conducted numerous trainings. Currently, he is Chief Instructor (Engineering Wing) at National University of Sciences and Technology, Pakistan. His research interests are Mobile and Wireless Communication, MIMO Systems, Compressed Sensing for MIMO OFDM systems, Massive MIMO Systems, M2M for Mobile systems, Cognitive Radio Networks, Cyber Security and Information Assurance.

    Dr. Malik Muhammad Zaki Murtaza Khan completed his Ph.D. (Computer Science)- 2012, and Masters in Science (Computer Science)- 2006, from University of Southern California, Los Angeles. He completed his Postdoc (High-Performance Computing) from Norges Teknisk-Naturvitenskapelige Universitet –NTNU- Trondheim, Norway in 2018.

    View full text