Deep Learning-Based Hybrid Intelligent Intrusion Detection System

: Machine learning (ML) algorithms are often used to design effective intrusion detection (ID) systems for appropriate mitigation and effective detection of malicious cyber threats at the host and network levels. However, cybersecurity attacks are still increasing. An ID system can play a vital role in detecting such threats. Existing ID systems are unable to detect malicious threats, primarily because they adopt approaches that are based on traditional ML techniques, which are less concerned with the accurate classification and feature selection. Thus, developing an accurate and intelligent ID system is a priority. The main objective of this study was to develop a hybrid intelligent intrusion detection system (HIIDS) to learn crucial features representation efficiently and automatically from massive unlabeled raw network traffic data. Many ID datasets are publicly available to the cybersecurity research community. As such, we used a spark MLlib (machine learning library)-based robust classifier, such as logistic regression (LR), extreme gradient boosting (XGB) was used for anomaly detection, and a state-of-the-art DL, such as a long short-term memory autoencoder (LSTMAE) for misuse attack was used to develop an efficient and HIIDS to detect and classify unpredictable attacks. Our approach utilized LSTM to detect temporal features and an AE to more efficiently detect global features. Therefore, to evaluate the efficacy of our proposed approach, experiments were conducted on a publicly existing dataset, the contemporary real-life ISCX-UNB dataset. The simulation results demonstrate that our proposed spark MLlib and LSTMAE-based HIIDS significantly outperformed existing ID approaches, achieving a high accuracy rate of up to 97.52% for the ISCX-UNB dataset respectively 10-fold cross-validation test. It is quite promising to use our proposed HIIDS in real-world circumstances on a large-scale.


Introduction
The intrusion detection (ID) system is a renowned solution for detecting malicious activities in a network. The types of malicious network attacks have grown exponentially, and the ID system has become an essential component of defense in addition to network security infrastructure. In 1931, John Anderson published the rst signi cant paper on ID, Computer Security surveillance, and threat monitoring [1]. An ID system usually monitors all internal and external packets of a network to detect whether a packet has a sign of intrusion. A well-made ID system can determine the properties of numerous malicious activities and automatically respond to them by sending cautions.
In general, there are three common ID system classes; these classes are based on detection approaches. The rst class is the signature-based system (SBS), which includes the misuse detection technique. The second is the anomaly-based system (ABS), also known simply as "anomaly." The third one is the stateful protocol analysis detection [2]. SBS relies upon a pattern matching technique, taking a database of known attack signatures and comparing these to signatures present in the observed data. An alarm goes off when a match is identi ed. SBS detects attacks based on existing knowledge; as such, the misuse detection technique is also recognized as a knowledge-based technique. The misuse detection technique features a minimum false alarm rate and maximum accuracy; it cannot, however, identify strange attacks. Similarly, the behaviorbased ID system, also known as ABS, can detect intrusion by matching normal behavior to an abnormal one. The stateful protocol ID method compares the known malicious activities and identi es the eccentricity of protocol activity, taking advantage of both anomaly and signaturebased ID techniques. ID systems can be further categorized into three types according to their architectures: Network-based detection system (NIDS), Host-based detection system (HIDS), and the hybrid approach [3]. For a HIDS application, the software is xed, and the host computer plays an important role in evaluating and monitoring system behavior and event log les play active roles in ID [4]. Unlike a HIDS, which analyzes each host separately, a NIDS analyzes the packets that ow above the network. This gives the NIDS an edge over the HIDS because it can test the whole network with a unique system structure. However, while the NIDS is superior in terms of installation cost and time of application software, it is vulnerable to distribution into a system over the network and affects the complete network. The hybrid IDS combine both the HIDS and NIDS with better-quality security mechanisms. The hybrid system joins the spatial sensors to identify vulnerabilities, which can occur at a particular point or over the whole network. There are two main ID system types, which are de ned according to the system's deployment structure: distributed structure and non-distributed structure. A distributed structure involves several ID subsystems that communicate with each other over an extensive network. In contrast, a non-distributed system can be mounted only at a single, unique location, for example, an open-source snort.
Most approaches currently used in ID systems are unable to deal with the complex and dynamic nature of malicious threats on computer networks. Therefore, effective adaptive methods, such as several ML techniques, can achieve a higher intrusion detection rate (DR), a low false alarm rate (FAR), reasonable communication, and computation costs. There are various customary approaches to intrusion identi cation, including access control, cryptography, and rewalls. These traditional ID techniques have few limitations in fully protecting a system; most notably, when systems are facing a high volume of malicious attacks, DOS and systems can obtain high values of FP and FN attack DR. Recently, numerous researchers have used ML techniques for ID to improve ID rates. Several studies have been done to enhance and apply this method to the ID system. In the present study, we also reviewed several articles that use state-of-the-art ML methods for ID. The ML models were found to have many issues that slow down the training process; these issues included the size of the dataset and the optimal parameters for the most suitable model. These kinds of problems prompted the researchers to look for the most effective methodology. The use of open source clustering computing, such as Spark (a big data analytics system), is one potential solution to such problems. We propose a unique tactic to improve ID system performance. However simple ML approaches are limited, while intrusion methods are expanding and growing increasingly complex. Advanced learning approaches are essential, especially in the analysis of intrusion and feature extraction. Hinton et al. [5] stated that DL has attained great achievements in domains such as image processing, NLP, and weather prediction.
With the growth of cyber-security capabilities, cyber-attacks have risen to the challenge of breaching new security defenses. Considering the possibility of simultaneous attacks, it is vital to select an appropriate action by proactively predicting and evaluating the effects of a speci c security event. Cyber-attacks, mainly in the sphere of large-scale military networks, can have a lethal in uence on security; therefore, numerous tests and extensive research are required preparations. Today, cyber-space is known as the fth battlespace, following land, air, space, and sea; cyber warfare can affect the military strategies and activities that are associated with national security. Although the military is working to recognize and minimize cyber-attacks, cyber-attacks are consistently on the rise [6,7]. It is signi cant to observe the malicious threat that arises in distinct ways to respond effectively to it. The signi cance of cyber-attacks on the infrastructure that should be protected and security policies established. It is not only to analyze cyber threats but also to increase the possibility of more proactive reactions. Numerous studies have been carried out on cyber-attacks modelings, such as the attack graph, attack tree, and cyber kill chain modeling approach [8,9]. Notice that previous works on cyber-attack modeling were limited in a large-scale network environment, due to problems such as scalability. Nowadays, cyber threats do not stop with a single attack but come in complex forms that involve numerous kinds of cyberattacks. Besides, novel attacks are constantly emerging. To overcome these challenges, a novel approach to modeling that is exible enough to adapt to new attacks easily and systematically is required [10].
As mentioned above, misuse and anomaly ID methods have their limitations. Our proposed HIID approach combines the two approaches to overcome their respective shortcomings while maintaining their advantages, which involve improved performance compared to conventional techniques. To increase IDS learning ability and performance, we propose a better-quality ID system that consists of Spark MLlib and state-of-the-art DL approaches, such as LSTMAE. The key contributions of our research may be summarized as follows: • The development of HIIDS, which relies on Spark MLlib and state-of-the-art DL techniques, such as LSTMAE, which merges both shallow and deep networks to overwhelm their analytical overheads and exploit their bene ts. This HIIDS investigates how to solve the class imbalance problem that usually occurs in ISCX ID datasets.
• Further investigation of the packet capture le directly on Spark; prior studies did not evaluate the raw packet dataset.
• Comparison of the HIIDS with other conventional ML methods. The simulation results demonstrate that the HIIDS approach is highly appropriate for malicious traf c detection. It has higher attack detection accuracy and was found to correctly detect network misuses in 97.52% of the cases through 10-fold cross-validation.
The rest of this paper is organized as follows. The background of ID and related work are brie y reviewed in Section 2. A brief overview of our proposed HIIDS and a detailed description of the dataset that was used for classi cation are provided in Section 3. A simulation of our proposed framework with performance metrics is discussed in Section 4. The paper is concluded with a possible direction for future work described in Section 5.

Related Work
Over the last two decades, the application of machine learning (ML) and deep learning (DL) to intrusion detection (ID) systems has been suggested by several researchers. Therefore, various models have been developed for network intrusion detection (NID) using conventional ML techniques. Examples include K nearest neighbors (KNN) as suggested by Khammassi et al. [11] Logistic Regression (LR) as suggested by Moustafa et al. [12] Support vector machine (SVM) as suggested by Khan et al. [13] Random Forest (RF) as suggested by Farnaaz et al. [14] Decision Tree (DT) as suggested by Sindhu et al. [15] Naïve Bayes (NB) as suggested by Buczak et al. [16] and Arti cial Neural Networks (ANN) as suggested by Vincent et al. [17]. However, these prior techniques demonstrate inadequate classi cation performance with the maximum false alarm rate (FAR) and low attack detection rate (DR) in an ID system. Kim et al. [18] developed a hybrid system that incorporates misuse and anomaly using supervised ML classi ers SVM and DT, respectively, and assessed their hybrid approach using NSL_KDD older data. The authors claimed that the improved attack detection accuracy was owing to the hybrid ID system. Paulauskas et al. [19] developed a novel approach for ID using various weak learners; this is known as the ensemble approach. The weak learners have low malicious detection accuracy. There were some weak learners, such as J48, C5.0, Naïve Bayes, and rule-Based classi ers that were used by the authors. Zaman et al. [20] used a better-quality ID algorithm recognized as enhanced support vector decision function (ESVDF) and evaluated their proposed IDS using the DARPA dataset; the proposed IDS was found to be superior to other conventional ID approaches.  [27] MCA + EMD 87.29 ISCX-2012 Thi-Thu et al. [28] FS + DT 95.33 ISCX-2012 Mighan et al. [29] SAE + SVM 90.3 ISCX-2012 Wang et al. [30] HAST − IDS 96.6 ISCX-2012 Although the above ID approaches have demonstrated decent accuracy up to a certain level, certain improvements, such as decreasing the number of FAR and increasing the ID accuracy, are necessary. In this regard, DL is a powerful technique. DL is a branch of ML that has become progressively dominant in various elds, such as speech recognition and natural language processing (NLP). DL's popularity is due to its two fundamental characteristics: (a) hierarchical features representations and (b) handling of long-term dependencies of sequential patterns. Today, state-of-the-art DL approaches that are used for NID includes auto-encoders (AE), deep belief networks (DBNs), deep neural networks (DNNs), and restricted Boltzmann machines (RBMs) as well as variants of these approaches. An overview of the state-of-the-art approaches is presented in Tab. 1.
DL has shown that its attack detection accuracy in the ID domain effectively exceeds conventional approaches [31]. Erfani et al. [32] developed a novel tactic that joined one-class linear SVM with the DBN for ID, evaluating it with various benchmark ID data. Fiore et al. [33] proposed a new technique, discriminative RBM, to learn compressed attributes from data attributes; these compressed attributes are then used for binary classi cation purposes into softmax classi er for benign and malicious network behaviors. Wang et al. [34] presented a DL-based IDS based on AE for detecting network traf c from the raw dataset and achieved a very high ID performance. Javaid et al. [35] used the DNN technique for anomaly detection. Their evaluation based on DNN DL found that the DNN technique is a novel and effective approach for ID in a softwarede ned network (SDNs). Yin et al. [36] introduced a neural network (NN) DL-based NIDS. This DL-based ID was tested using the NSL_KDD dataset; it was found that the DL-based IDS outperformed conventional ML-based classi cation techniques. Khan et al. [37] presented the hybrid DL approach for ID and applied it to real-time ID data. Their simulation outcomes showed that the hybrid DL-based IDS was superior in terms of attack classi cation accuracy and performance. Alrawashdeh et al. [38] proposed a DL-based IDS using the DBN of RBM with four and one hidden layers for attribute reduction purposes; the weights of the DBN were restructured during ne-tuning, and attack classi cation was accomplished using an LR classi er. The developed methodology was evaluated on the benchmark KDD99 data and attained an attack classi cation accuracy of up to 97.9% with a FAR of 0.5%. However, the attack classication accuracy as evaluated using this ancient data is not suf cient to show that this a robust approach for NID. Shone et al. [39] proposed a non-symmetric deep AE-based ID and evaluated the proposed framework with the benchmark KDD99 dataset, achieving an attack classi cation accuracy of 97.87% and a FAR of 2.15%. In [40], the authors aimed for a Deep Neural Network (DNN) of 100 hidden units. To improve performance, they utilized a GPU and the KDD99 dataset. The authors proposed that the models of both recurrent neural network (RNN) and long short-term memory (LSTM) are better for enhancing the attack detection accuracy. These ID systems based on DL techniques were found to be superior to traditional approaches; the authors also presented various ideas by joining DL and ML techniques, with the primary goal of developing an ef cient and robust ID system. Wang et al. [41] developed a novel approach for ID by combining fuzzy clustering and ANN; they tested a novel hybrid approach on the KDD99 dataset and demonstrated that their hybrid FC-ANN approach outperforms traditional ML approaches in terms of ID. Mukkamala et al. [42] used a hybrid approach by combining the SVM and ANN; they evaluated this approach on the benchmark KDD99 dataset. Here, SVM and ANN were used for classi cation tasks and data patterns, respectively. Various researchers have used the ISCX-2012 ID dataset to conduct suitable system validation. However, there is still much room for enhancements, such as improving attack detection accuracy and reducing FAR [43][44][45][46][47][48]. ID research has been carried out by various scholars for developing both the ABS and SBS using separate classi cation methods. These methods fail to afford the ef cient possibility of attack detection, so a hybrid ID system is an important research challenge. MLbased techniques have been mostly used by scientists and engineers to develop an ABS, which can make a model by comparing normal with abnormal behavior and then attempting to classify whether upcoming new packets are "attack" or "normal." DL is enormously valuable for the ID system because it automatically extracts features of the speci c problem without requiring robust preceding knowledge. The main downside of using the DL model for the ID domain is the extent of the training; obtaining the right model is time-consuming.
The research community has drawn substantial attention to the issue of class imbalance [49]. The problem of class imbalance is created by insuf cient data distribution; one class contains most samples, while others contain comparatively few. The classi cation problem becomes more complicated as data dimensionality increases due to unbounded data values and unbalanced classes. Bedi et al. [50] utilized numerous ML approaches to deal with the class imbalance issue. Thabtah et al. [51] also evaluated various approaches to the class imbalance problem. Most data samples are targeted by most of the algorithms while missing the minority data samples. As a result, minority samples appear irregularly but constantly. The main algorithms for solving the unbalanced data problem are data preprocessing and feature selection techniques, and every approach has both bene ts and shortcomings. The ID dataset has a high-dimensional imbalance problem including missing features of interest, missing feature values, or the sole existence of cumulative data. The data appear to be noisy, containing errors and outliers, and unpredictable, comprising discrepancies in codes or names. We used over-sampling to resolve the problem of the imbalance; this involved enlarging the number of instances in the minority class by arbitrarily replicating them to increase the presence of the minority class in the sample. Although this procedure has some risk of over tting, no information was lost, and the over-sampling approach was found to outperform the under-sampling alternative.
With the accelerated growth of big data, DL approaches have ourished and have been widely utilized in numerous domains. In contrast to previous studies, we took a hybrid approachthe Anomaly-Misuse ID method-to two-stage classi cation to overwhelm the condition face by separate classi cation methods. We used Spark MLlib and the LSTMAE DL approach for ID, on the well-known real-time contemporary dataset ISCX-2012. Fig. 1 presents the anticipated ID framework. It comprises two learning stages. For this HIID, we planned to construct a two-stage ID system, in such a way that Spark MLlib as an anomaly in Stage-1 and LSTMAE as misuse in Stage-2.

Proposed Approach
These two stages of ID framework are ef cient in terms of computational complexity while using full features datasets and offer a higher accuracy with a low probability of FAR.
Stage-2 Misuse detection using state-of-the-art deep learning approaches such as LSTMAE.

Overview of the Proposed IDS
The hybrid framework concentrates on resolving real-time ID problems, using enormous data analysis models (Apache Spark and Apache Hadoop) and AI (ML and DL). Controlling such an issue is a complex task due to space and time restrictions. Big data is enormous and consistently increasing in volume but requires prohibitive amounts of power, specialized resources, and a computational device that can effectively handle the data. The hybrid ID framework overcomes these problems by using the MLlib with LSTMAE. The main structure of the HIIDS existing here, forms the source of the experiment, to use Spark MLlib and deep learning.
Numerous ML techniques were used due to the huge volume of data. We selected the competent Spark to implement a logistic regression (LR) and extreme gradient boosting (XGB) classi ers. Initially, preprocessing data was delivered through these machine learning classi ers to produce regression models that present the opportunity of all data. Generally, this is the binary learning phase. In this hybrid ID approach, the NIDS using both anomaly and misuse techniques. The proposed hybrid ID architecture contains a data preprocessing module, a Spark MLlib classi cation component integrating the anomaly detection module (Stage-1) of the proposed hybrid IDS with misuse detection, and DL classi cation (Stage-2), followed by the alarm module.
Stage-1 utilized Spark MLlib to perceive anomalies that may be intrusions, and Stage-2 utilized the LSTMAE DL model, which further classi es attacks in the event they occur. The details of the proposed Spark MLlib and LSTMAE model are shown in Fig. 1. The architecture of the hybrid IDS is as shown in Fig. 1; initially network traf c was arranged and preprocessed. During preprocessing, all necessary conversions were made for both Stage-1 Spark MLlib and Stage-2 LSTMAE-based modules of HIIDS; both stages had their supported data formats. For our hybrid ID experiment, we used 1,512,000 network traf c packets attained from ISCX-2012 datasets to demonstrate the effectiveness of the proposed HIIDS.

Datasets
Choosing a suitable ID dataset plays a signi cant role in testing the ID system; therefore, the simulation of the proposed HIID approach was carefully deliberated.

Explanation of the ID Dataset
There are various standard datasets, and some of them comprise in exible, outdated, and irreproducible attacks. To overcome these shortcomings and create further up-to-date traf c patterns the ISCX-2012 data was created by the Canadian institution of cybersecurity [52]. It contains various types of ID data to assess anomaly ID approaches. The ISCX-2012 data shows real network activities and includes numerous attack scenarios. Additionally, it is shared as a complete network capture with completely internal network traf c to assess payloads for network packet analysis. The ISCX 2012 ID data comprises both malicious and normal traf c actions for seven consecutive days. The data was created by pro les containing abstract representations of network traces' particular behaviors and activities.
Communication between the destination and source host over HTTP can be represented by the sending and receiving of packets, endpoint attributes, and other similar features. This illustration produces a unique pro le. It produces realistic network traf c for POP3, SSH, FTP, HTTP, SMTP, and IMAP protocols.
The ISCX-2012 contains dual distinct pro les to make network traf c activities and states. The multi-stage or abnormal states of abnormal attacks are identi ed by an α pro le, while feature characterization and mathematical dissemination of the method are done with the β pro le. For example, the β pro le can comprise network traf c packet size distributions in the explicit patterns and time distribution request of the protocol, whereas the α pro le is created based on the sophisticated preceding attacks of a distinct day. The α pro le consists of four kinds of attack scenarios.
(1) Internal in ltration of the network traf c A vulnerable application program, such as Adobe Acrobat Reader, generally takes the advantage of the internal in ltration of a network. A backdoor can be performed on the victim's machine after successful penetration and will execute several malicious attacks on the victim's network. To detect these kinds of malicious threats, mostly applied Nmap and port scan.
(2) HTTP DOS attacks The attacker causes a network resource to be unavailable for a particular time. This is typically done by overwhelming a network resource with super uous requests to overwork the network and impede the ful llment of some or all legitimate requests. To collect these kinds of DoS attacks, mostly utilized the Slow HTTP test, Hulk, Slow loris, and Goldeneye.
(3) DDOS using IRC botnet These types of attacks generally occur when various networks ood the bandwidth or various resources of a particular victim. Therefore, a DDoS attack is often the result of various infected networks (for example, a botnet), which ood the target network by creating massive network traf c. These types of attacks have utilized LOIC for UDP, TCP, and HTTP.
(4) Brute force SSH This is the most common type of attack that can be used not only to crack passwords but also to determine secret content and pages of several web applications. These types of attacks have been launched via FTP and SSH Patator tools.
The full ISCX-2012 dataset is shown in Tab. 2. It can be observed that each attack state was realistic for a single day, while two daysn consisted of normal traf c. The variety of normal behavior and the complication of malicious attack states in the network have been previously described [52].

Data Preparation and Feature Engineering
The dump network traf c was initially preprocessed and prepared, as shown in Fig. 1. The ISCX-2012 dataset was evaluated, and after preprocessing, it was composed of seven consecutive days with the systematic and practical circumstances re ecting network attacks. The data were labeled for malicious and benign streams for a full of 68,792, and 2,381,532 records in the own class. The abnormal attacks were detected in the initial traf c data and were divided into two classes: benign/normal and abnormal/malicious. Furthermore, several multi-stage malicious intrusion scenarios were executed to generate various attack traces (e.g., HTTP, DoS, brute force SSH, in ltration from the interior, DDoS via an IRC botnet). The detailed descriptions of training and testing data distributions are presented in Tabs. 3 and 4. The core idea of this research was to evaluate the reliability of the hybrid system against anomalies and the unknown, via the misuse approach. Tab. 4 presents the testing and training network traf c data for misuse attack detection using a state-of-the-art DL approach, such as an LSTMAE.

Implementation Details
The dominance of the HIIDS is evaluated through experiments applying the ISCX-2012 ID datasets via normal and attack classi cations: false positive, false negative, true positive, attack detection precision, and error rate. To show the ef cacy of our suggested ID system, we executed the rst stage in Scala by Spark MLlib for anomaly detection; the second stage was executed for misuse detection; the DL approach was executed in Java with Deeplearning4j. The simulation was done on a 64-bit cluster computer with 32 cores, 32 GB RAM, and Ubuntu version 14.04 OS. The software stack contained Java (JDK) 1.8, Spark v2.3.0, Deeplearning4j 1.0.0. alpha, and Scala 2.11.8. The deep learning was trained on an RTX 2080 Ti GPU with cuDNN, and CUDA facilitated the pipeline speed.
To measure the HIIDS performance, we rst split the dataset into train and test datasets. To form an ef cient HIID framework, we utilized the training data and analyzed our hybrid approach with testing data. The block diagram of our anticipated HIID is presented in Fig. 1. The ISCX-2012 with complete, original features are utilized to demonstrate the dominance of our proposed hybrid approach. The network traf c mixed with malicious and normal pass through spark MLlib Stage-1 which categorized data into malicious and normal classes. Stage-2 LSTMAE was modeled with malicious traf c; malicious traf c was further categorized into 4 analogous attacks. The hybrid approach overcomes the computational complexity while applying comprehensive features to the ISCX-2012 dataset with higher ID accuracy and low FAR. 80% of the data with 10fold cross-validation was utilized for training purposes, and the model was evaluated with a 20% held-out dataset.

Stage-1: The Anomaly-Based Detection Module
Apache Spark is a competent big data processing engine for detecting cybersecurity attacks. Spark MLlib is the most ef cient big data analytics library currently available, executing over 55 ML algorithms [53,54]. Spark MLlib is most suitable for ML tasks and is 10 times faster than Hadoop-based big data processing tools for iterative tasks. MLlib of spark evolution was initiated in 2012 as a portion of an ML-based project, and in 2013 it became an effective open-source library for ML tasks. Spark MLlib contains several ML algorithms for instance classi cation, clustering algorithms, and regression and dimensionality reductions that are crucial to the development of classic ML real-time applications; its mechanisms have been established by several scholars to progress high dimensional data analytics worldwide.
MLlib-based anomaly attack detection at Stage-1 was rst modeled based on an established training set, which contains both normal and malicious traf c. The test data that contain unknown, regular, and malicious traf c are used to validate the anomaly module of IDS. The attack observed on original traf c data were divided into two classes: Abnormal (malicious) and normal. Abnormal network traf c behavior is known as anomaly traf c. Detection of this kind of abnormal network traf c was passed through the Stage-2 that LSTMAE, where the misuse attack detection technique did further attack detection and classi cation.

Stage-2: A Misuse Detection Module
LSTMAE was used in this stage to de ne the misuse of network traf c and goals of further classifying the anomalous traf c according to speci c policies. An overview of the misuse detection module using an LSTMAE is given in Fig. 2. LSTM is an upgraded version of the RNN, which was introduced in [55,56] to ef ciently address vanishing and exploding gradient issues. All hidden layers of RNN are substituted with memory blocks that comprise a memory cell intended to reserve information, with three important gates that play dynamic roles in LSTM (Input, Output, forget gate) [57]. The most powerful feature of LSTM lies in its capability to capture long dependencies and learn competently from variable amount sequences.

Figure 2: The micro-overview of LSTMAE
Research has shown that LSTM demonstrates high con dence and effectiveness for resolving issues of video classi cation [58], sentiment analysis [59], emotion recognition [60], and abnormal activities [61].
In this module, LSTMAE is used as a misused attack detection technique. LSTMAE misuse attack detection techniques aim to further categorize the abnormal data from Stage-1 among equivalent classi cation: DOS, Scan, HTTP, and R2L. While misuse ID uses the LSTMAE, the technique was initially trained in the abnormal traf c to create a model that provides the baseline pro le for abnormal traf c. A test set is an input to the training model that tests whether the training model performance is malicious (abnormal) or normal. An alarm goes off when a match is found. More internal information can be effectively obtained with LSTMAE, compared with other hand-crafted techniques.

Experimental Evaluations
A detailed description of the experimental results will be discussed in this section. Since the dominance of the proposed HIID is sensibly analyzed, this can only be realized throughout experiments applying the ISCX 2012 ID datasets via normal and attack classi cation, false positive, false negative, true positive, attack detection accuracy, and error rate.

Performance Metrics
The elements of the confusion matrix that assist in representing the expected and predicted classi cation are given in Tab. 5. The outcome of classifying is predicated among twoclass issues such that correctly and incorrectly. Four essential states must be computed in the confusion matrix.
• True Positive (TP). It presents that model is accurate as normal and predicts positive and it is represented by x.
• False-negative (FN). It represented the wrong prediction and denoted by y. It identi es instances that are malicious in certainty, as normal and the model inaccurately predicts negative.
• False-positive (FP). It presents a model that mistakenly predicts positive and, the number of detected attacks is normal. It is represented by z.
• True negative (TN). It is represented by t and speci es instances that are correctly observed as an attack predicts negative. From the above-mentioned conditions of the confusion matrix, we can compute the performance of the system as follows. The two most essential and general parameters for the evolution of the ID system are TPR or DR and FAR. The percentage of intrusion instances recognized by the ID model is known as DR, while the amount of misclassi ed normal instances is known as FAR.
We claim that the HIIDS is superior to conventional IDS, as it increases DR and decreases FAR.

Evaluation of the Hybrid IDS
Tab. 6 presents the overall performance of several classi ers. The results of the random search are described in this section. As presented in the table, the classical LR model gave an F1score accuracy of ∼83%, whereas tree-based ML classi ers managed to considerably increase the accuracy to 88%.
However, the most signi cant improvement that we observed was with state-of-the-art DL approaches such as LSTMAE, which correctly identi ed misuse for up to 97.0% of cases. This improvement was due to the temporal feature's extraction with LSTM and the extraction of more important internal information by the AE.

Overall Analysis
Tab. 7 summarizes the results of the current approach for the ISCX-2012 data. These datasets were produced later than the KDD and DARPA data, so only a few corresponding tentative results exist. Therefore, using the existing simulation results, the best outcomes for each stage are de ned by FAR and accuracy. It is evident that the proposed ID system performs well, both in terms of accuracy and FAR related to state-of-the-art techniques. This is owing to the Spark MLlib and LSTMAE approach. It is essential to observe that the comparisons are for just reference, as several researchers have utilized diverse volumes of data distributions, sampling techniques, and preprocessing methods. Therefore, a simple evaluation for metrics, such as testing and training time, is generally not suitable. Although the proposed ID system attained enhanced performance for the considered evaluation metrics, it cannot be fascinated that the proposed approach fully outclassed other methods. It is possible to attain an extraordinary level of network security with the HIID approach, which is vigorous, fast, simple, and highly applicable to real-time scenarios.

Conclusion and Future Work
In this article, the HIIDS was developed using the Spark MLlib and LSTMAE deep learning approach, which is an ef cient cybersecurity method. We trained the HIIDS using an ISCX-2012 dataset. We implemented the HIIDS using several robust classi cation algorithms, such as LR and XGB, for anomaly detection at Stage1 and the LSTMAE deep learning technique for misuse detection at Stage 2. The proposed HIIDS, based on DL classi cation, combines the bene ts of both Signature-based (SB) and Anomaly-based (AB) approaches, reducing computational complexity and increasing ID accuracy and DR.
Both conventional ML and LSTMAE deep learning models were evaluated using well-known classi cation metrics, such as F1 score, Precision, Recall, DR, and accuracy of classi cation.
We believe that our approach can be expanded to other domains in the future; misuses and anomalies can be recognized in several real-time image data, emphasis on exploring deep learning as a features extraction mechanism to learn knowledgeable data illustrations in case of other anomaly detection issues in modern real-time datasets.

Con icts of Interest:
The authors declare that they have no con icts of interest to report regarding the present study.