NETWORK INTRUSION DETECTION SYSTEMS USING SUPERVISED MACHINE LEARNING CLASSIFICATION AND DIMENSIONALITY REDUCTION TECHNIQUES: A SYSTEMATIC REVIEW

Protecting the confidentiality, integrity and availability of cyberspace and network (NW) assets has become an increasing concern. The rapid increase in the Internet size and the presence of new computing systems (like Cloud) are creating great incentives for intruders. Therefore, security engineers have to develop new technologies to match growing threats to NWs. New and advanced technologies have emerged to create more efficient intrusion detection systems using machine learning (ML) and dimensionality reduction techniques, to help security engineers bolster more effective NW Intrusion Detection Systems (NIDSs). This systematic review provides a comprehensive review of the most recent NIDS using the supervised ML classification and dimensionality reduction techniques, it shows how the used ML classifiers, dimensionality reduction techniques and evaluating metrics have improved NIDS construction. The key point of this study is to provide up-to-date knowledge for new interested researchers.


INTRODUCTION
With the new development in NWs and communications, cybersecurity has become a vital requirement to defend new cyber-attacks [1]. Recently, IDSs in general and NIDSs in particular, have been increasingly used as tools to constantly monitor NW traffic and provide desired security protection against cyber-attacks [2]. The earliest IDS was produced in 1980 by Jim Anderson and since then, such systems have continuously developed and improved, to keep pace with the rapid growth in the NW and communication fields [3]. The growth of cyberspace has introduced the Big Data concept to the IDS field, in which massive volumes of data are continually generated around the Internet. Security engineers have used this Big Data with different ML techniques for further IDS improvements [1]. Supervised ML NIDS depends on pre-collected datasets to learn how to distinguish between normal and abnormal NW traffic, to be able to detect any intrusions in the future [3].
The main purpose of this systematic review is to provide a broad analysis of developments in modern supervised ML NIDSs. The core idea is to provide updated information on supervised ML NIDSs to provide a starting point for new researchers who want to explore this field. This study undertakes three main objectives to contribute to existing knowledge: (1) To conduct a systematic review of selected research papers concerned with supervised ML NIDS published during 2017 and until March 2021 in Science-Direct (Elsevier), Springer-Link (Springer) and IEEE-Explore (IEEE) libraries; (2) To review each research paper extensively and discuss its used ML classifiers, dimensionality reduction algorithms and evaluation metrics; and (3) To highlight recent trends in using such technologies for building NIDSs and various future challenges.
There are many survey papers in the literature providing reviews on NIDSs, but this study is unique in applying a systematic approach to collect more relevant research papers on NIDSs designed by supervised ML classification and dimensional minimization techniques. This study reviews the most recent research papers from the past three years, providing up-to-date knowledge for researchers.
Section 2 reviews related studies in this area to present background information about IDSs and Section 3 details IDS categorization. Section 4 explains the research methodology, followed by the application of supervised ML and dimensionality reduction techniques in Section 5. Section 6 presents the evaluation metrics. Section 7 discusses the salient findings and identified challenges, while Section 8 concludes the paper.

RELATED WORKS
Numerous researchers have taken an interest in NIDSs and machine learning and a variety of surveys and systematic reviews have summarized previous studies in this field. Zebari et al. [29] conducted a comprehensive review of dimensionality reduction techniques used in the previous IDSs. For each study, they provided some details about the algorithms used, datasets, dimensionality reduction techniques (categorized into feature selection and feature extraction) and they summarized the achieved results. Although they analyzed recent studies (between 2018 and 2020), they did not follow a systematic approach, unlike the current study (which provides a systematic approach to collect the analyzed research papers, to make the data collection more accurate and comprehensive).
Martins et al. [56] presented a systematic review of ML-based systems to detect intrusion and malware scenarios. They reviewed 20 research papers from multiple scientific e-libraries and compared them based on attack techniques, used algorithms, datasets, evaluation metrics and their results. The limitation of their study was that they did not provide details about their systematic approach and did not mention whether the analyzed studies were recent or not. In our study, we provide a detailed description for our followed approach.
Ahmad et al. [1] reviewed recent studies (from 2017 to 2020) that generally used machine learning and deep-learning techniques. Their review was notable in identifying the strengths and weaknesses for each reviewed study, which we have also applied in the current work.
Some studies that introduced the software system IDS were analyzed by Ramaki et al. [57]. They limited their study to ML techniques that used "Hidden Markov" models and did not provide a systematic approach for collecting research papers for analysis. This systematic review spans a wider domain, including NIDSs based on supervised ML techniques. Gonzalez et al. [58] developed a method for improving security inside secure military self-protected software and comprehensively analyzed software present position and potential responses to threats. Their method consisted of three stages: user detection, analysis of current situation and reactive action. The detection phase consists of analyzing location, timing at present location and identifying user type (friend or foe). The analysis phase entails determining whether self-protected software should be present at the current site, predicting future locations and analyzing the level of hazard at the current location. Analytical results showed that selfprotected software that incorporates user detection provides higher protection than self-protected software that does not contain such detection capability.
Nassif et al. [59] analyzed ML approaches utilized to detect cloud system attacks with a detailed systematic review for 63 relevant research papers from 2004 to 2021. For each study, they identified the related security threats, ML techniques used and evaluation metrics' results. This systematic review provides more comparison criteria between the analyzed research papers.

IDS CATEGORIZATION
An IDS system identifies abnormal events by constantly monitoring network traffic, keeping a network log and alerting the network administrator in the event of any intrusion. IDS copies the network traffic for read-only analysis, to detect any suspicious events and notify the administrators about what is going on (to take manual responsive actions). IDS is implemented outbound of the network line, without affecting the network data flow [4]. IDS can be categorized based on its monitoring environment and detection approach. Types of IDS according to monitoring environment are host-based (HIDS), NWbased (NIDS) and hybrid IDS, while according to their detection approaches they can be classified as signature-based, anomaly-based and hybrid [5] (Figure 1).

IDSs According to Monitoring Environment
HIDS operates in a local machine that detects local abnormal behaviours; any changes to the host registry, unauthorized access attempts or attacks cannot be detected by firewalls [5]. IDS is considered a reliable system, because it analyzes the log files so that it can efficiently determine whether an attack is active or not [6]. NIDS operates in an NW node used to monitor and analyze network traffic on a single network node to detect any abnormal traffic [7]. Some NIDSs are created by analyzing the payload of an NW packet (packet level) or analyzing only that packet's header (flow level) [7]. Hybrid IDS integrates HIDS and NIDS in an effective way [2].

IDSs According to Detection Approach
Signature-based IDS (also called "misuse detection IDS" or "knowledge-based IDS") uses a blacklist of predefined intrusions and attacks. When any intrusion in the blacklist occurs, this IDS can detects it accurately, with no false alarms [8]. The disadvantages of this type are the required storage size and that it cannot detect any novel predefined intrusion on its blacklist. This blacklist requires constant updates to be able to detect any new intrusions [2].
Anomaly-based IDS (also called "behaviour-based IDS") uses the definition of the normal NW traffic and any deviation of that normality is detected as an intrusion. It compares the actual NW traffic with the predefined characteristics of normal traffic to detect any intrusions [9]. It can detect any novel intrusion, but it suffers high false alarms, as it is difficult to define uniform traffic among all NWs [2].
Hybrid IDS efficiently combines signature-based and anomaly-based approaches, to detect known attacks in the blacklist while simultaneously detecting new ones [2]. Anomaly-based NIDS is the main focus of this study, developed using supervised, unsupervised or reinforcing ML techniques [7].

RESEARCH METHODOLOGY
The methodology used in this study is adapted from [10], collecting papers on NIDSs built with ML techniques published from 2017 to March 2021 in the Science-Direct (Elsevier), Springer-Link (Springer) and IEEE-Explore (IEEE) libraries. Search keywords have been used to achieve results related to the search questions, according to the inclusion and exclusion criteria phases ( Figure 2).

RQ1.
What are the proposed supervised ML classification and dimensionality reduction techniques used to build the NIDS?
This question describes the supervised ML classification and dimensionality reduction techniques which have been used in previous studies to build NIDS against cyber-attacks, to investigate the popular techniques used for more enhancement in this domain.

RQ2.
What are the evaluation metrics used to evaluate the proposed NIDS?

RQ3.
What are the best supervised ML classification and dimensionality reduction techniques used to build the NIDS?
The main purpose of NIDSs is to detect intrusions in real time, with high sensitivity and low false alarms. This question explores whether the built NIDS provides a noticeable enhancement in this domain, as well as to identify techniques that enhance NIDS sensitivity without increasing processing overhead or affecting real time detection.

E-Library Search Phase
Three e-libraries were selected to conduct this systematic review; Science-Direct (Elsevier); Springer-Link (Springer); and IEEE-Explore (IEEE). All are Scopus-indexed, constituting the biggest database of peer-reviewed research papers. The search was conducted directly in the selected e-libraries during 2017-2021, using the search keywords shown in Table 1.

Selecting Pre-processing Phase
The initial search process using the chosen keywords resulted in many initial hits, the titles of which were then cross-checked with the research questions and inclusion and exclusion criteria, to eliminate 550 papers not directly related to machine learning-based NIDSs. All authors independently scan the resulted 550 research papers (titles and abstracts). The resulting group was categorized into unrelated research papers (NR), partially related (PR) and related research papers (R). In this stage, the process of exclusion was performed on research papers the abstracts of which did not mention any techniques for supervised ML NIDS classification, feature selection, feature transformation and dimensionality reduction. A total of 170 research papers were marked NR and PR by the first review and then another review was performed on the unmarked set to judge 44 R papers. Further reviewing, to avoid any bias, was conducted. All reviewers met later to verify the exclusion of research papers deemed NR and PR. The final set of research papers was approved by all reviewers as related to this study, as shown in Figure  2. A total of 34 research papers remained and finally, the quality assessment criteria were followed again during the final full-text analysis for a total of 34 research papers.

Quality Assessment
The quality assessment eliminated bias in research papers selection and ensured that clear criteria were used to determine the quality of the selected research papers, as shown in Table 2. Scores for quality relied upon the following criteria: score 1 indicates that a research paper explicitly follows the assessment criteria, score 0 indicates that a research paper no doubt did not meet the criteria and research papers suspected to be related that necessitated more analysis and clarification or which did not fully meet the criteria were scored 0.5. Section 5 analyzes the papers that achieved over 50% in the quality assessment in detail. Table 2. Quality assessment criteria.
Assessment Question Assessment Q1 Does the paper topic cover NIDS domain?
1/ Zero/ 0.5 Q2 Does the paper use "machine learning techniques" or "machine learning and optimization techniques" or "machine learning and dimensionality reduction techniques"?
Q3 Is the proposed methodology fully defined? 1/ Zero/ 0.5 Q4 Are the research results verified by clearly defined evaluation metrics? 1/ Zero/ 0.5

Information Extraction
The research questions require extracting information from the selected research papers, such as the use of: ML classification algorithms; dimensionality reduction techniques; and evaluation metrics and their results.

SUPERVISED ML AND DIMENSIONALITY REDUCTION TECHNIQUES
Answering RQ1 requires a complete analysis of the most popular supervised ML techniques (their implementation and algorithms) used to build NIDSs and detailed analysis of the dimensionality reduction techniques used.

Building Supervised ML NIDSs
Supervised ML provides an intelligence technique to extract patterns from previously labelled datasets [11], learning from previous datasets to predict future values [12]. Studies built NIDSs through several phases, including data pre-processing, training and testing and evaluation.

Data Pre-processing Phase
Dataset intensive care is required in supervised ML NIDSs to achieve the highest prediction accuracy rate and the most efficient performance in real-time intrusion detection; higher data quality indicates more NIDS efficiency [1]. Data pre-processing stages depend on dataset and ML algorithm requirements and researcher experience [1], [13]- [15]. In dataset cleaning, all duplicated or missing values are handled; duplicate values are deleted and rows with missing values may be deleted or filled with median, mean or most frequent corresponding values. Tables 3-5 show the research paper results for the ScienceDirect, IEEE and Springer-Link databases (respectively).   All string values are transformed into numeric values, to be in a suitable format for the classification algorithm. For feature selection, unnecessary features are dropped either manually [14] or automatically (using dimensionality reduction techniques, as explained below). Some ML algorithms require normalization (data scaling) to ensure a uniform range between values (e.g. K-NN algorithm). Data splitting is applied by splitting datasets' columns and rows: columns are split into X, comprising all columns with independent variables; and Y is the column of the dependent variable that classifies rows (normal or abnormal traffic), called the "label column," which is the key data classification element in supervised learning.

Training Phase
To make the supervised ML algorithm goes through the learning experiment; it needs a partition of the dataset, called the training set [16]. The supervised ML classifier is fed with the independent (X) and dependent (Y) variables in the training set, to be able to predict Y values on its own in the future [12]. The size of the training set is important to help the ML algorithm learn efficiently with a highly accurate prediction rate in the least amount of time [17]. Most commonly, the training set consists of 70-80% of the original dataset, with the remainder for the testing set [16].

Testing and Evaluation Phase
The testing set is fed to the trained ML algorithm with only the X values, to test its ability to predict Y values. Predicted and actual Y values are then compared using evaluation metrics [16] (Section 6), to measure the trained ML algorithm's prediction ability and test its suitability with real NW traffic [18]. Supervised ML classification algorithms thus use independent (X) and dependent (Y) values and learn how they relate to each other in the training phase, then the trained algorithm is provided X values to evaluate performance in predicting Y values in the testing phase. Finally, the predicted results are evaluated using evaluation metrics [12].

Decision Tree (DT)
DT algorithm represents the feature values as nodes in a hierarchal tree, to divide the classification problems into sub-sets [19]. DT consists of nodes that represent features, branches represent roles and leaves represent a class value (e.g. malicious or normal traffic) [12]. DT algorithm forecasts class values based on learning decision rules extracted from features [20]. DT algorithm may be implemented by C4.5 (J48), an open-source Java implementation [21]; ID3, an extension of the former and REP-Tree [22], another open-source implementation for DT [23]. Aljawarneh et al. [19] proposed anomaly ML NIDS using REPTree classifier, pre-processed with feature selection using Vote scheme, training and testing phases. Their proposed NIDS obtained highly accurate results for detecting NW intrusions.

K-Nearest Neighbour (K-NN)
K-NN algorithm represents the given training data as neighbour points in a graph and assigns the new data point to the nearest specified K neighbour points. Figure 3 shows K-NN performance with K= 5.
The distance between the new data point (X1, Y2) and any other neighbour point (X2, Y2) is calculated using Manhattan (Eq. 1) or Euclidean (Eq. 2) equations [1]. After calculating the distances, the new data point is classified according to the closest points [12].

Naïve Bayes (NB)
It is one of the most common machine learning classifiers in general. NB classification is based on Bayes' theorem [24]. NB measures the likelihood of a given prediction based on available features, as each feature independently contributes to predicting unknown data [2].

Support Vector Machine (SVM)
SVM algorithm combines statistical theory with supervised learning by finding the best way to split data into two classes by adding a boundary between them, regardless of whether the data can be divided linearly or not [8]. Essentially, this algorithm finds the best possible boundaries in the data collection to distinguish between classes [24].

ML Ensemble Methods
Ensemble supervised ML classifiers are integrated to solve a complex problem and increase accuracy by pooling individual classifiers' strengths [20]. For example, some algorithms may perform well in detecting a certain type of attack, but poorly in detecting others, thus combinations form a stronger classifier [25]. Several ML techniques (Random Forest (RF), Ada-Boost, XG-Boost, …etc.) use ensemble method to enhance performance. RF classifier integrates many DT classifiers, instead of depending on a single decision tree, taking predictions from each tree to forecast final performance based on the majority vote of predictions [17]. AdaBoost improves the performance of binary classifiers by employing an iterative approach, learning from the errors of weak classifiers and transforming them into strong ones [26]. XG-Boost consists of multiple DTs to solve a wide range of data-mining problems quickly and accurately [27].

Dimensionality Reduction Techniques
Supervised ML and Big Data mining techniques are very complex and require high computational costs due to the voluminous data processed [1]. Real-time detection and accurate detection rates in NIDSs are major concerns in relation to the "dimensional curse," referring to ML model complexity due to a large number of both necessary and unnecessary features, with high dimensionality [28]. Dimensionality reduction techniques seek to reduce the number of features processed by selecting or extracting only relevant ones from the feature set, excluding irrelevant, noisy or redundant ones [29]. For dimensionality reduction, several algorithms reduce feature space either by removing features that do not provide important information or extracting relationships between available features to produce less space with new features [30]. This reduces complexity, increases understanding of data, facilitates easier analysis, improves visualization and reduces processing costs and storage space requirements [6], [29]. The ML model learning process is thus enhanced, resulting in higher performance and prediction accuracy rates, providing real-time prediction results [30]. Dimensionality reduction can be conducted by two approaches.
First, feature transformation/ extraction transforms the available features into more beneficial ones using optimization algorithms [28]. Second, feature selection approach selects features according to their relevance and effectiveness related to the classification problem [29], without changing representation [31].
Researchers can choose one of four methods to implement their feature selection approach, which differ in how the ML algorithm functions [31] (Figure 4), as discussed below.

Filter Method
Weights are assigned to features to determine their relevance and essence (dependency, consistency, …etc.) using statistical standards, without involving the ML algorithm [29]. Depending on the assigned weights, features are either discarded or retained [31]. Filter method has been found to outperform other feature selection methods, with less computational costs, more scalability in high-dimensional datasets and more efficiency [29], [32]. Its drawbacks are that it does not integrate between the selected subset and the ML algorithm [29] and it is only suited to independent features [32].

Wrapper Method
Wrapping creates an interaction between the ML algorithm later used for classification and each selected feature subset. The ML algorithm is used with each subset designated as a black box, to evaluate prediction accuracy and determine which subset has the fewest errors [30]. It is thus accurate and efficient [31], but is time-consuming, as selected subsets work only with particular ML algorithms, which may cause over-fitting, as well as being expensive [29].

Embedded Method
Embedding feature selection with the ML algorithm assigns weights independently and the highly weighted features are recursively used to construct subsets until finding the optimal one; its prediction accuracy outperforms others with the ML algorithm [30], [31]. The embedded method reduces the computational cost and the possibility of over-fitting [29].

Hybrid Method
The hybrid combination of filter and wrapper methods is the most commonly used solution, accruing the constituent advantages to achieve better performance [29]. This systematic review noticed that adopted dimensionality reduction techniques vary according to the research paper problem. In some problems, feature selection was forbidden, as removing features from the dataset would be misleading. Others preferred feature selection techniques, to keep meaningful original features and shorten the dimensionality reduction techniques within the selected features. Mazini et al. [33] proposed hybrid anomaly NIDS to detect attacks that threaten network activities. They mentioned that data-mining techniques were implemented to get rid of imbalanced database disadvantages and the complicity of feature values. Furthermore, to reach the best performance of the AdaBoost classification algorithm, they used the Artificial Bee Colony algorithm (the wrapper method) for feature selection. Selecting the most significant features to learn the classifier increases accuracy detection rate and reduces false alarms.

EVALUATION METRICS
Answering RQ2 and RQ3 requires a complete analysis of evaluation metrics used to evaluate the proposed NIDS in each research paper.

Evaluation Metrics
During the building of any ML model, particularly in the testing phase, many metrics are used to evaluate performance [7], [27], [32]. Most of these measures are derived from the confusion matrix, which consists of two columns displaying predicted values and two rows displaying the actual values. In NIDS, predicted or actual values are positive if NW traffic is positive or negative if normal, as shown in Figure 5  Recall Value (Re-V) (detection rate) measures NIDS sensitivity [12], [36].
Receiver Operating Character -Area Under the Curve (Roc-Auc) rate is the area under the curve that virtualizes the relation between the True Positive Rate (TPR) and False Positive Rate (FPR) for every confusion matrix, resulting from every threshold in binary classification [8], [37]. The higher the TPR and the lower the FPR, the higher the Roc-Auc score [13]. For further evaluation of the NIDS performance, researchers calculate the time consumed in the training and testing phases (Tr-T and Ts-T, respectively), so the NIDS is lightweight and easy to install and provides real-time detection of NW intrusions [12]. Tables 9-11 show that most researchers relied on AR, DR and FPR to evaluate their proposed NIDS, so these metrics are considered to answer RQ3 in the next section. Table 9. Results of evaluation metrics for each research paper -ScienceDirect.

RQ2 and RQ3 Results
For RQ2, after analyzing the used evaluation metrics to determine the highest evaluation results achieved by the selected research papers, Tables 12-14 show that most researchers relied on AR, DR and FPR to evaluate their proposed NIDS, so these metrics are considered to answer RQ3. Determining the best ML and dimensionality reduction techniques used to build NIDS requires summarizing all techniques used in the selected research papers, showing their AR, DR and FPR (Tables 12-14).    Figure 6 shows how many supervised ML classifiers are used in the selected research papers. It can be observed that RF classifier is generally preferred, due to its accurate classification performance (i.e., ability to detect zero-day attacks) and low computational costs in real time (Table 6). From the selected research papers, feature selection is the most used dimensionality reduction technique for the proposed NIDS ( Figure 7). These techniques reduce feature dimensionality to reduce the complexity of the training and testing phases, ultimately ensuring real-time detection, but at the cost of more computational resources. Figure 8 shows that the most used evaluation metrics are AR, DR and FPR. Efficient NIDS requires high AR and DR, with low FPR. Thus, to evaluate the efficiency of the NIDS, these values must be calculated.

Research Challenges
Most of the proposed NIDSs were constructed in laboratory conditions (not in a real environment), using predefined datasets and there is no proof of their efficiency in real-world implementations. Testing NIDS effectiveness in real NW traffic remains a research challenge. The proposed NIDS is complex and its computational and time costs are considerable, which may affect real-time detection. Although dimensionality reduction techniques are being used for this purpose, more improvement is still needed in the field.  Figure 6. Use of supervised ML classifier. Figure 7. Use of dimensionality reduction techniques. Figure 8. Use of evaluation metrics.

CONCLUSIONS
This systematic review extensively analyzed NIDSs based on supervised ML classifiers and dimensionality reduction techniques to provide updated knowledge for new interested researchers in this field. A systematic approach was adopted to select relevant research papers to answer the RQs. According to the results, RF is the most supervised ML classifier, due to its accurate classification performance and low computational costs. Feature selection techniques are the most used for dimensionality reduction in recently proposed NIDSs. These techniques reduce feature dimensionality to reduce the complexity of the training and testing phases and eventually ensure accurate real-time detection, but they need more computational resources. The most commonly used metrics are AR, DR and FPR. An efficient NIDS requires high AR and DR, with low FPR; these values must be determined for NIDS efficiency evaluation. This systematic review concludes that despite all efforts in the ML NIDS field, there are still some challenges facing interested researchers, including proving the effectiveness of the proposed ML NIDS implementation in a real NW traffic environment and reducing its complexity to ensure real-time detection. This systematic review is limited by being restricted to only 34 research papers within the domain of supervised anomaly ML-based NIDSs. Future work needs to address more research papers in a broader domain, including ML and deep learning techniques.