Efficient Classification of Enciphered SCADA Network Traffic in Smart Factory using Decision Tree Algorithm

Vulnerability detection in Supervisory Control and Data Acquisition (SCADA) network of a Smart Factory (SF) is a high-priority research area in the cyber-security domain. Choosing an efficient Machine Learning (ML) algorithm for intrusion detection is a huge challenge. This study performed an investigative analysis into the classification ability of various ML models leveraging public cyber-security datasets to determine the best model. Based on the performance evaluation, all adaptions of Decision Tree (DT) and KNN in terms of accuracy, training time, MCE, and prediction speed are the most suitable ML for resolving security issues in the SCADA system.


I. INTRODUCTION
Due to the present day Smart factories' (SF) constant advancement in technology and its implementation, the need for advanced systems for the control of industrial processes either locally or at remote locations has become pertinent. The SCADA system monitors, gather and process real-time data, thereby maintaining efficiency. It is required to ensure secured communication at every level of the system exactly connecting with systems like pumps,sensors, motors, valves, and more through the human-machine interface (HMI) software. With the advent of Cloud Computing, smart factories, the Industrial Internet of Things (IIoT), and the addition of recent IT models, protocols and usage such as web-based applications, the SCADA will continue to gain importance. However, its vulnerability is exploited due to a lack of efficient security and access authorization method [1]- [3].
In a 5G+ enabled smart factory, the Domain Name System (DNS) is the hub of communication and data transmission, assigning a chart between eligible host-names and the computer recognizable internet protocol (IP) addresses. Numerous firewalls do not scrutinize the recurrence and nature of DNS packets, which is leeway for discreet data communication. Also, since the traffic packet goes through modified name-servers via a variety of hops past DNS iteration. Transmission is quite tough to detect and mitigate. This aids intruders take a certain measure of dominance over network devices and use them to plan attacks. Distinct perspectives have been used for the identification of irregular domain names and applying customary processes like IP boycott, domain boycott, and discarding suspicious DNS packets to attain DNS restriction [4]- [6]. Most security researchers critique DoH for making DNS channels difficult for detection and mitigation.
For safeguarding DNS traffic, the ideology of enciphering DNS over HTTPS, alternatively known as DoH for enhancing client authentication, access control security introduced [4]. Consider DoH sheathes the DNS packets in the DNS traffic, which is undetectable to the network framework between the DoH server and the malware. It constructively renders detection techniques based on investigating the DNS traffic extinct for the firewalls. However, this technique did not eliminate DNS intrusion, considering the nature of 5G+ connectivity.
Several alleviation methodologies exist, like firewall, Network-Based Intrusion Detection System (NB-IDS), SCADA hardware security and encryption methods [7]. NB-IDS has become a popular approach in reality and is widely used for security evaluation in networked systems (such as SCADA), thus the basis for this study. Studies exist that implemented various NB-IDS to tackle the vulnerability challenges in the SCADA. These NB-IDS techniques are Machine Learning-based (ML) due to their proficiency in the inherent process of identifying and categorizing attacks. Nevertheless, the premise for establishing the type of ML and its attributes creates a research challenge. There is always uncertainty on the need to select and confirm the best ML prospect.
Notwithstanding the attempts by researchers, there is no focus on a decisive choice of the apt ML algorithm for SCADA vulnerability detection. Hence, developers face uncertainty and predicament in arriving at the algorithm of choice. Therefore, some analytical questions will be; what strategy and measures impressed the ML prospects? What are the reasons for the poor performances of ML techniques? Moreover, in the case of similar achievement between two ML techniques, what arouses the decision for the preferred ML prospect? This study aims to confront the above.
Prompted by the above problems, the aim of this study is to accomplish the following: 1) Assessment of discrete ML algorithms expedient for SCADA network attack detection. An arithmetical intuition into the rationale for the ML algorithm and its achievement. It is pertinent to researchers as it aids developers to focus on the choicest and most suitable ML approach for their studies. 2) The determination and application of the most suitable ML algorithms effective for SCADA network attack detection. 3) Option of the excellent performed and light-weighted ML algorithm based on computational complexity and accuracy in model training time, system constraints considering controls in the processing ability of industrial equipment and operations of a Smart factory. The rest of the study is organized thus: the background study is followed by Section II, related works which give insight into current studies on ML IDS classification algorithms. Then Section III describes the concept of algorithm and methodology of the study. In Section IV, performance evaluation and conscientious analysis of various ML prospects simulated were presented. The rest of the study was concluded in Section V with a presentation of a prospective NIDS in the SCADA network system.

II. RELATED WORKS
Authors in [8]- [10] presented studies on current attacks on cyber-security and mitigation approaches using machine and deep learning (ML/DL). This studies highlighted the benefits of the use of ML and DL in IDS. Collective researchers in [11], [12] examined the recent advancements in cybersecurity risk evaluation in application to SCADA systems utilizing entrenched investigation procedures. The study contains a variety of security and vulnerability linked to research on SCADA. In [13], authors provided an extensive review on approaches that can be applied for vulnerability detection in the system and ascertaining the extent of safeguarding against possible attacks. Examples of such approaches presented in this work are simulating SCADA attacks, testbeds simulation frameworks, etc.
The internet connectivity proffers numerous functional services such as remote interconnection and flexibility; contrarily, it endangers systems such as SCADA by making them prone to the vulnerability of global attacks on cybersecurity. Therefore, several studies on IDS algorithms for SCADA vulnerability were accounted for and reported in the literature. Ensemble learning approaches use a blend of distinct classifiers to make predictions. This yield enhanced performance for various attack types and protocols used in IoT networks [14]- [16] Though this system delivered superior results, the approach can be complex with a lack of computational speed. [17] presented an autoencoder-based IDS targeted at the Distributed Network Protocol 3 (DNP3) of a SCADA system. The proposed model had better performance when compared with other IDS solutions. However, this achievement cannot be generalized due to restricted usage in a small dataset and specific DNP3 operation environments.
In recent times, the convolutional neural network (CNN) framework in deep learning has made a tremendous impact on computer vision. It is associated with the two most important features; hierarchical feature representations and learning long-term dependencies in large-scale sequence data. The study by [18] presented an approach for detecting attacks using LSTM and CNN, evaluating the popular conventional NSL-KDD dataset. Though the model had an excellent performance, the classification speed needs improvement. In the same vein, [19] proposed an approach that attempts to recognize and classify DNS over Hypertext traffic in two layers using classifiers. The authors claimed that the main superiority of the study is the ability to avidly detects and classifies DoH traffic using a small amount of input data. Hence, not a robust model therefore not suitable for Smart factory operations. Another study by [20] presented a hybridization method with universal optimization approach for detecting DDoS attacks in IoT. This approach evaluated the early version of the CICIDS2017 dataset. Though the model seemed efficient, lacked computational speed and is not robust enough for a Smart factory, authors hope to try it on distributed IDS.
Sundry ML algorithms have emanated as series of studies addressed the development of IDS. It is to determine and deploy the best viable and efficient algorithms. [21] in a study presented a flow-based intrusion detection research for a SCADA system using deep ANN. The proposed model evaluated attacks online and offline. The approach had an excellent performance. However, it requires an extension to multiple attacks. In [7], the authors used the Decision Tree, Random Forest, Naive Bayes, Logistic Regression and K-Nearest Neighbour (KNN) classifiers on a water storage facility testbed generated dataset. From their evaluation, utilized classifiers had a good performance in accuracy and false alarm rates for offline and online phases with an unbalanced dataset. However, the study did not specify the training time of the model. It is important to note that for critical industrial systems, accuracy alone is not the ideal measure to assess the capability of a model; other metrics such as false alarm rate, training time are needed for comparison.
Besides, [22] in a study, attempted to compare the performance of NB, KNN, ANN techniques. The individual classifiers performed poorly. Hence, the ensemble technique was employed to enhance performance. In another study [23], the authors implemented an ensemble approach to boost performance. The trial results show that the proposed ensemble framework copes well in precision, accuracy, recall and detection rate. However, this fusion requires sparse computational cost as one of the factors for time-critical operations.

III. SYSTEM MODEL A. UNDERSTANDING ALGORITHMS AND GROWTH RATE FUNCTIONS
An algorithm is a sequence of computational steps that transforms input to output, applied in situations that require human ingenuity. It provides efficient solutions for emerging trends in IDS. Determine the best solution in addition to stabilizing memory management and running time in a computational environment. In choosing an algorithm, efficiency and preciseness are apt. It can also be affected by the implementation of hardware and software. However, having a solid base of algorithmic knowledge and its techniques is vital.
In implementing algorithms, a good understanding of the relationship among features of variables in a dataset is crucial. This guides in choice of solution based on the algorithm growth rate. Analysis of an algorithm helps to determine the behavioural pattern as input size is affected by variation. This change in the behaviour of an algorithm is known as the asymptotic growth rate. Stated below are the equations for algorithm growth functions: Quadratic where a and x are constant, M is input size. This implies that an increase in the size of M , influences the value of a and x. Consider the asymptotic growth rate based on the Big (O) notation, see Table 1 for functions of algorithm growth rate. Table 2 shows the description of the actions performed by the various algorithms.
With the rise of intricacy, SCADA has become so relevant that it requires safeguarding. IDS in SCADA helps recognize real-time observable frontiers, filter DNS traffic between established and un-established name servers with advance or discard guidelines. IDS in SCADA has been handled by different studies as presented in Section I. This work analyzed algorithmic functions and growth rate in addition to a comparison of various ML algorithms. The ML algorithms were developed to monitor SCADA network traffic and identifying the abnormal nature in the traffic to address security issues in the 5G+ enabled smart factory.
The proposed architecture consists of 3 main phases; training, testing, and model selection phase, see Fig 1. During the training and testing phase, the various datasets used in this study is split into training and testing sets and imported into the ML algorithms after implementing a five (5)

B. CASE STUDY/DATASET DESCRIPTION
This study applied three (3) dataset types as case studies to evaluate the different ML algorithms' efficiency and performance. It enables the validation and determination of the most suitable solution in the prevailing emerging vulnerability situations. The data for each case study was split into 70%:30% for training and test validation. For even data VOLUME 4, 2016  The preprocessed cyber-security intrusion dataset (CIRA-CIC-DoHBrw-2020) is accessible at [24]. The dataset was generated by the use of web browsing activities for the benign-DoH. While DNS channelling mechanisms were used for generating malicious-DoH traffics, [19]. The cybersecurity data traffic contains a total of 226406 observations, 28 predictors and two (2) responses represented as benign and malicious scenarios.

2) Case Study 2 and 3 (NSL-KDD Dataset)
NSL-KDD dataset is an offshoot after analyzing the KDD cup'99 dataset, prior to this, KDD is a conventional dataset. The challenge of the KDD cup'99 dataset is the presence of redundant data, which is biased towards repeated data. These issues were addressed hence, the NSL-KDD dataset was proposed [18]. Afterwards, it is being used as a standard dataset. The dataset is divided into KDDTest+ and KDDTrain+ comprising of a total of 22543 observations and 41 predictors and 2 responses represented as anomalous and normal scenarios.
In the training phase of the model, the dataset is fed into the model followed by data pre-processing, which is as follows:

C. DATA PRE-PROCESSING
This stage starts with data cleaning, replacing fields with infinity (∞) and nulls in the column with the mean value of that column. It ensures that only meaningful values get passed into the model. Subsequently, since the dataset contains correlated features as seen in the correlation matrix represented in Fig. 2, the Pearson's Correlation Coefficient (PCC) was performed on the dataset. This technique was necessary as it ensures the reduction of over-fitting. PCC was implemented for consecutive variables, with a correlation score between -1 and 1 as represented in equation 7, with the selection of variables with a high correlation value at a threshold of +/-1. This aids in ensuring that only reliably significant features are selected, thereby enhancing the model performance. Fig. 3 depicts the selection of the correlated variables using a threshold of +/-1.
where X depicts the Pearson Correlation Coefficient, a i , are content of the variables in the dataset,â, represents the mean values of the a variables, b i is the variables of the sample andb shows the mean values in the b variables.

D. DESCRIPTION OF MACHINE LEARNING PROSPECTS
In this study, MATLAB 2019Rb was used to examine all ML prospects such as: 1) Trees: In this model, prediction of the value of an objective attribute is confirmed. A depiction of a tree is applied to ascertain basic decision precepts deduced from data characteristics. To achieve this, the internal node of the Tree is represented by the leaf node comparable to a class label and characteristics. Assess unique fixed values each for n range A 1 -A n as indicated in equation 8, then Q i (B) = 1 when B in Ri This enables the prediction of a class A based on a feature B by a model.Â 2) Support Vector Machines (SVM): This classifier builds a class of resolution confines which aids to classify data values. A division is obtained by the conclusion confines with a huge interval to the closest training data position of any class. Hence, the generalization error of the classifier is determined the extent of the margin (either higher or lower). Given a training set, a i ∈ Q p , f ori = 1, ....n in two sets, and a variable y ∈(1, -1) n . The aim is to ascertain α ∈ Q p andb ∈ Q such that the prediction given by α T φ(a) + b) is accurate for many samples. SVM is represented as shown in equation 9.
subject to 3) Ensemble Learning (EL) Algorithms: The objective this algorithm as depicted in equation 10 is the conjugation of constant evaluators into a major one with the goal of enhancing its reliability over an individual evaluator.
where ρ is the dataset array, β shows the variable feedback. ∈ is the method of combination, σ marks the diverse hyper-parameter tuning and types of learners while represents the ensemble learning method. Then, outlines the correlation between the attributes.

IV. PERFORMANCE EVALUATION
Accuracy appears to be the most accessible metric when choosing an algorithm for an ML task. Nevertheless, accuracy alone is not enough to help in selecting the best algorithm. The model needs to meet other conditions, such as data relationship, training and prediction time, interpret-ability, data format and other performance metrics. A combination of a broad scope of these factors aids in making a more confident decision.

A. EVALUATION OF THE DECISION TREES TECHNIQUE
In the class of Decision Tree (DT), Fine, Coarse, Medium, and Optimizable were analyzed for all the datasets.

C. EVALUATION OF SUPPORT VECTOR MACHINES (SVMS) TECHNIQUE
This study examined all adaptions of SVM. The evaluation shows that for all the datasets, Fine Gaussian, Medium and Coarse Gaussian SVMs recorded the best accuracy in contrast to Linear, Quadratic and Cubic SVM, which recorded the lowest accuracy in all three (3) datasets. This technique is best for modelling decision boundaries which are non-linear [25].

D. EVALUATION OF LOGISTIC REGRESSION (LR) TECHNIQUE
The LR recorded accuracy of 96.3%, 93.6% and 91.5% respectively for NSL-KDDTrain+, CIRA-CIC-DoHBrw-2020 and NSL-KDDTest+ datasets. It is not acceptable considering the significance and intensity of accuracy in forestalling SCADA network attacks. Besides, LR is most fitting in a probabilistic classification rather relationship than linear [26].

E. EVALUATION OF DISCRIMINANT ANALYSIS TECHNIQUE
The investigation reveals that the optimizable and quadratic discriminant had 92.9% accuracy. While linear discriminant options and 92.8% respectively for CIRA-CIC-DoHBrw-2020. However, the algorithm failed in NSL-KDDTrain+ and NSL-Test+. Consequently, the result depicts the unsuitability of discriminant ML for attack detection and classification.
It is in the ensemble subspace discriminant approach and the non-applicability of the algorithm in NSL-KDDTest and NSL-KDDTrain datasets.

F. EVALUATION OF K-NEAREST NEIGHBOURS (KNN) TECHNIQUE
In the MATLAB toolbox, the KNN technique comprised Fine, Medium, Cosine, Coarse, Cubic, Weighted and Optimizable. All the KNN algorithms had above 99% accuracy except the Cosine and Coarse KNN with 98.8% and 98.4% accuracy respectively in the CIRA-CIC-DoHBrw-2020 dataset. Though with high interpretability through feature importance, it is most suitable with linearly related data. However, it was observed that this algorithm is not fit and inapplicable to NSL-KDDTrain+ and NSL-KDDTest+ datasets. It is, therefore, evident that the KNN algorithm is sensitive to features of data relationships.

G. EVALUATION OF NAIVE BAYES (NB) TECHNIQUE
The Kernel Naive Bayes (KNB) recorded a not too encouraging accuracy in the three (3) datasets except for the optimizable Naive Bayes (OGNB) with an accuracy of 96% and 96.5% across the datasets, respectively. However, this shows that the NB is not fit for countering attacks in the SCADA network.

H. REVIEW OF THE ANALYSIS OF PERFORMANCE EVALUATION
In selecting an ML algorithm, knowing how to make the right choice that is most suitable for the specific problem is imperative. A comprehensive understanding of the relationship amongst the data features is critical in choosing an ML algorithm. Considering data relationship, training and prediction time, interpret-ability, and data format, etc. A total of twentynine (29) models were simulated, see Table 3 for features of model parameters. The result of the evaluation for the best and least performed techniques leveraging on three (3) cybersecurity datasets is as shown in Tables 4 and 5. A comparative analysis of the three (3) cyber-security datasets was carried out for further validation of the proposed choice of models for SCADA IDS, see Fig. 4 for the confusion matrix of the best-performed algorithm in the respective datasets and Fig. 5 shows the ROC curve of the best-performed algorithm across the three (3) datasets based on the MCE.