IoT-based intrusion detection system using convolution neural networks

In the Information and Communication Technology age, connected objects generate massive amounts of data traffic, which enables data analysis to uncover previously hidden trends and detect unusual network-load. We identify five core design principles to consider when designing a deep learning-empowered intrusion detection system (IDS). We proposed the Temporal Convolution Neural Network (TCNN), an intelligent model for IoT-IDS that aggregates convolution neural network (CNN) and generic convolution, based on these concepts. To handle unbalanced datasets, TCNN is accumulated with synthetic minority oversampling technique with nominal continuity. It is also used in conjunction with effective feature engineering techniques like attribute transformation and reduction. The presented model is compared to two traditional machine learning algorithms, random forest (RF) and logistic regression (LR), as well as LSTM and CNN deep learning techniques, using the Bot-IoT data repository. The outcomes of the experiments depicts that TCNN maintains a strong balance of efficacy and performance. It is better as compared to other deep learning IDSs, with a multi-class traffic detection accuracy of 99.9986 percent and a training period that is very close to CNN.


INTRODUCTION
The Internet of Things (IoT) paradigm is comprised of intelligent sensors linked by the Internet, including home appliances, phones, cars, and computers (Bhatia & Sood, 2016;Bhatia & Sood, 2017a). This form of network is becoming an increasingly important part of our daily lives, with implementations as diverse as smart grids, cities, homes, agriculture, and smart transportation (Bhatia & Sood, 2017b;. Specifically, an IoT ecosystem is made up of web-enabled smart devices that gather, send, and act on data from their surroundings using embedded systems such as CPUs, sensors, and communication hardware. By connecting to an IoT gateway or other edge device, IoT devices may exchange sensor data that is either routed to the cloud for analysis or examined locally, Bhatia, Sood & Kaur (2019) and . These gadgets may occasionally interact with one another and act on the information they receive (Bhatia, Sood & Manocha, 2020). Although individuals may engage with the devices to set them up, give them instructions, or retrieve data, the gadgets conduct the majority of the work without human participation .

Research area
While the IoT will make life easier for humans, the aspect of data protection is of great concern (Thakkar & Lohiya, 2021; Mahboub, Ahmed & Saeed, 2021). IoT platform has been a popular target for cybercriminals as it faces significant risks. According to a survey from Palo Alto Networks' Unit 42, 98 percent of intelligent system data is not encrypted, and 42 percent are having IoT sensor flaws (Gill, Saxena & Sharma, 2020;Einy, Oz & Navaei, 2021;Heal & Kunreuther, 2005). Adversaries may use the compromised computers to enter an IoT botnet and carry out effective and massive attacks. As an illustration (Antonakakis et al., 2017), the first IoT botnet was capable of exploiting compromised CCTV camcorders with generic credentials to unleash a DDoS attack over servers in October 2016. As a result of this attack, Internet service in certain areas of the United States was disrupted. Mozi, an IoT botnet discovered in April 2020, was found to be capable of launching multiple DDoS attacks. IDS have been commonly utilized for the detection of malicious network traffic to cope with the threat, particularly when security fails IoT device-end point. When threats on IoT devices become complex and stealthy, IDS must adapt to keep up with the changing security threats.

Research gaps
IoT networks produce high-dimensional, multimodal, and temporal data due to heterogeneous structures. It is possible to uncover previously unseen trends, expose latent similarities, and obtain new insights by using big data analytics on such data. Artificial intelligence is becoming more prevalent in big data processing. Deep learning (DL) methods, in particular, have shown their ability to work with heterogeneous data. It can also analyze dynamic and large-scale data to gain information, spot data dependencies, and learn from past attack patterns to distinguish current and unknown attack patterns. Heavyweight functions like big data processing and constructing learning models must be offloaded to fog and cloud servers because IoT computers are space-restricted and have minimal storage and computing capacities. As a result, computation offloading will help minimize task execution latency and save resources in battery-powered and handheld IoT computers, but it also raises security issues. Many deep learning methods for IDS have been suggested, with some of them focusing directly on IoT . However, there is still the large number of research gaps that have been identified from the previous solutions. Some of these include 1. Limited work has been done for incorporating deep learning techniques for IDS focusing on temporal aspects of the data. 2. Minimal work has been performed for heterogeneous data elements for the detection of attacks. 3. The presented approach does not incorporate the energy efficiency aspect of the IDS. 4. Even though researchers have explored predictive aspects of IDS, however, the limited focus is led on the temporal variability of the IDS.

Fundamentals
The term ''deep learning'' refers to the aggregation of several layers. The input layer is the first layer, which is processed to generate output via the final layer. In between, hidden layers are also added. Each layer is made up of a group of units known as neurons. The input layer's size is determined by the input data's dimension, while the output layer is made up of N nodes, which correspond to the N categories in a categorization task. Figure 1 shows a convolutional neural network with multiple layers. Below are the 3 major categories of layers: 1. Convolutional layer: It adds a series of filters to the input results, also known as convolutional kernels. Each filter creates a function map by sliding over the input data. The final production of the convolution layer is obtained by adding all of the generated attribute maps together. 2. Pooling layer: It performs subsampling over the feature maps to reduce feature reduction. The most popular pooling strategies are average pooling and maximum pooling. 3. Fully connected layer: This layer accumulates the outputs of the connected layers uni-formally for generating output.

State-of-the-art contributions
The current paper suggests five architecture principles to consider when designing an accurate and efficient deep learning IDS for IoT in this article and presents TCNN, a convolutions-based version of CNN. The beneficial aspect of CNN includes minimal dependence on pre-processing, therefore decreasing the need for human effort for developing its functionalities. Moreover, it is easy to understand and fast to implement. Furthermore, it has the highest accuracy among all algorithms. Data balancing and effective function engineering were integrated with TCNN. The following are the paper's key contributions in detail: 1. For the creation of IoT-IDS, the current paper defines 5 core design concepts, including handling over-fitting, data-set balance, function engineering, algorithm optimization, and testing on IoT dataset. 2. The current paper builds and test the Temporal Convolution Neural Network (TCNN), a DL platform to detect IoT intrusion. Convolution neural network (CNN) and causal convolution are combined in TCNN. 3. The current paper combines TCNN with synthetic minority oversampling techniquenominal continuous to address the problem of an imbalanced data-set (SMOTE-NC). 4. The current paper makes use of effective function engineering, which includes the following: (1) Feature space reduction: It aids in memory utilization reduction.
(2) Function transformation: This converts warped data to a Gaussian-like distribution by applying log transformation and regular scaler to continuous numerical functions. Label-encoding, which replaces a categorical column with a special integer value, is often used on categorical features.

Paper structure
The article structure is laid out as follows. The core architecture concepts for DL-IDS for IoTs are presented in Section 3. The next section provides an outline of similar tasks. The architecture and implementation of TCNN are described in Section 4 and 5, respectively. Section 6 brings the paper to close-by outlining prospective study directions.

LITERATURE SURVEY
Many areas in cybersecurity have used DL, involving virus identification and IDS. The current paper provide an overview of IoT-IDS networks in this section. A hybrid of recurrent neural network (RNN) and convolutional neural network (CNN), was introduced by Lopez- Martin et al. (2017) (CNN). Authors introduced layers like max-pooling, batch normalization, and dropout to cope with overfitting. To boost the model's usefulness, authors just looked at a subset of functions. However, temproal aspects of the data is not considered for effectiveness. On the KDD 99Cup dataset, Putchala (2017) used the Gated Recurrent Unit (GRU) algorithm. As a function filtering tool, Author used the random forest classifier. Minimizing the loss function yields the best possible output outcomes. Bidirectional Long Short-Term Memory Recurrent Neural Network (BLSTM RNN) was proposed by Roy & Cheung (2018). Authors translated categorical functions to numeric values using function normalization. The limiting aspect of the presented model is that energy effectiveness of IDS is not discussed. On the NSL-KDD dataset, Diro & Chilamkurti (2018) used a DNN model. Using stochastic gradient descent, the DNN's loss function is minimized. Certain fog devices are in charge of DL-algorithm preparation. For updating, the local parameters are transmitted to fog-node. It encourages the right criteria to be shared and helps to prevent local overfitting. However, no aspect of prediction is discussed in the proposed model.
On the CICIDS2017 dataset, Roopak, Tian & Chambers (2019) used 4 separate categorizations of MLP, LSTM, 1 Dimensional CNN, and CNN and LSTM. The authors have duplicated documents to balance the dataset. However, no specification is presented over the balancing approach. This problem is solved by using phases like max-pooling and dropout in the model. However, the computational complexity of the presented model is very high.
The Deep Belief Network is utilized to build a deep neural network in Thamilarasu & Chawla (2019), which is then extended to an IoT simulation dataset. Each layer of the DNN model is given a cost function, which is then optimized. The limiting aspect of the presented model is that it doesnot addressees the issue of data balancing.
On the NSL-KDD data, Otoum, Liu & Nayak (2019) suggested the Stacked-Deep Polynomial Network. Authors used the Spider Monkey Optimization (SMO) algorithm (Bansal et al., 2014) to pick the best functions. The L2 regularization strategy is combined with the loss function to prevent over-fitting. However, data balancing is not discussed in the presented model.
On 2 generic data instances and the BoTIoT dataset, Ferrag & Maglaras (2019) utilized RNN with a reduced backpropagation technique. Before feeding the features to RNN-BPTT, normalization was performed.
Roopak, Tian & Chambers (2020) suggested and tested a sequential architecture that combined CNN and LSTM on the CISIDS2017 dataset. Authors used a multi-objective optimization algorithm called a non-dominated sorting genetic algorithm (NSGA) to pick the best functions. Authors used max-pooling among CNN to LSTM modules for preventing over-fitting. However, computational complexity is very high for the presented model. The BoT-IoT dataset was created by Koroniotis et al. (2019), who incorporated it to validate the prediction models. Authors used attribute normalization to scale the data within the range [0, 1] and the calculated correlation between data attributes for selection. Aldhaheri et al. (2020) suggested a DeepDCA system by incorporating Self Normalizing Neural Network and Dendritic Cell Algorithm. To choose the collection of features to feed to the BoT-IoT dataset, they used Information Gain as an attribute identification mechanism. About the fact that the authors reported findings with a balanced dataset, there is little detail about the balancing process. The author incorporated a loss function for weight modification of DNN layers for model optimization.
In the Bot-IoT dataset, Soe, Santosa & Hartanto (2019) suggested an Artificial Neural Network (ANN) for detecting DDoS vulnerabilities. Authors used the SMOTE technique to align the data set. Authors also applied function normalization to the input data before feeding it to the ANN.
On the BoT-IoT dataset, Ge et al. (2019) used feed-forward neural networks. The data instances are balanced using the algorithmic technique, rather than by oversampling, by assigning class weights to the training results. Authors modified weights using the Adam optimizer and a sparse-cross entropy loss module to refine the formula. Authors used various regularization methods, including L1, and L2 with dropout, to cope with overfitting. Authors utilized one-hot encoding to encode categorical attributes as numerical.  Antonakakis et al. (2017) (2018) suggested detecting malicious behaviors in industrial IoT using a collaborative deep autoencoder and DNN. Evaluation of lossfunction, which allows weight modification and minimization of the dissimilarity between the real and expected results, yields the desired parameters. Table 1 describes and contrasts the IDS systems in terms of the 5 techniques as described earlier. Unlike previous works, which use algorithmic-level data balancing, the current model uses the SMOTE-NC algorithm, which can accommodate both continuous and categorical functions, on the Bot-IoT dataset. To achieve successful IDS, the current paper use over-fitting and optimization strategies. For memory utilization and training time, feature space reduction is utilized along with feature transformation to achieve effective IDS.

BACKGROUND STUDIES
DL-based IoT-IDS solutions have the goal of generating accurate models and efficient (Bhatia, Sood & Manocha, 2020). However, each model makes design decisions that can restrict its ability to achieve the goal. Some IoT DL-IDSs, for example, ignore the over-fitting issue, implements over-unbalanced data, or fail to use function engineering, all of which harm memory consumption, accuracy, and computing time. Furthermore, some IDSs do not refine the prediction model, and others are assessed using obsolete data that does not represent IoT traffic. As a result of the above findings, deep learning-specific solutions for IoT should adhere to the following aspects: 1. Over-fitting : It occurs when a methodology acquires a strong match on trained data but fails to generalize well to unknown data. Overfitting in deep learning can be avoided using the following techniques: (1) Using regularisation, which increases the expense of the model's loss function for high weights.
(2) Using dropout layers, which set functions to 0 and delete them at random. 2. Data disparity: Data imbalance corresponds to a dataset's disproportionate allocation between groups. When a model is conditioned on an unbalanced sample, it becomes skewed, favoring the dominant groups and failing to detect the minority. The model's effectiveness would be increased by balancing the dataset. 3. Feature engineering : It enables a reduction in the deep learning workflow's memory and time costs. It also allows for optimizing the model's accuracy by removing unnecessary features and adding feature transformation to increase the learning model's accuracy. 4. Model optimization: Model optimization aims to minimize the size of a reduction function, which calculates the disparity between the expected and real outputs. This is accomplished by iteratively changing the model's weights. The model's effectiveness can be increased by using an optimal technique.

PROPOSED MODEL
The proposed TCNN combines CNN architecture with causal padding to produce causal convolutions. Figure 1 displays time-series input data being subjected to 1-Dimensional general convolution having 3 kernel-size. Conspicuously, it is deduced that output is convoluted with elements of the previous layer at a specific time. As a result, the data's temporal order is preserved, without overlapping historical data. The layers have 0 paddings with 1 kernel size to make the same length as the input to TCNN.

Architecture in general
The proposed TCNN framework's overall architecture is depicted in Fig. 2, and its implementation is outlined in Section 5. The following stages make up the proposed architecture: 1. Dataset balancing : An unbalanced dataset will lead to inaccurate performance. To address this problem, the current paper employs the SMOTE-NC technique to generate synthetic samples of minimal groups and can manage mixed data with categorical attributes. 2. Initial Attribute reduction: The current paper enhances the data instances, reducing the attribute vectors by eliminating redundant attributes and transferring memoryintensive attributes to smaller data instances. 3. Dataset splitting: To avoid overfitting, the dataset is divided into 3 subsets: preparation, evaluation, and checking. 4. Second Attribute Transformation: The current paper use attribute transformation over the trained data in this process. On the continuous numerical features, a log transformation with scaling is applied. Also, categorical attributes are encoded to replace every class column with a numerical estimate. The validation and checking subsets are then subjected to this transformation process. 5. Trained Optimization: TCNN model is constructed in this step, and its attributes are optimized using the Adam optimizer to validate data. It includes (i) Get gradients concerning stochastic objective at time-step t. (ii) Update biased first-moment estimate. (iii) Update biased second raw moment estimate. (iv) Compute bias-corrected firstmoment estimate. (v) Compute bias-corrected second raw moment estimate.(vi) Update parameters. 6. Categorization: On the testing subset, the created TCNN model is used to assign every tested instance to either standard or a special category for the attack.

TCNN model optimized training
2 1-Dimensional generic convolution layers, 2 interconnected layers, and a softmax layer for application of softmax functions make up the proposed TCNN's training and optimization process (Fig. 3). Global maximum pooling, batch normalization, and dropout are utilized to avoid over-fitting. To change weights and refine the cross-entropy loss function, the current paper uses the Adam optimizer. It incorporates the benefits of 2 stochastic gradient descent algorithms namely Root Mean Square Propagation (RMSP) and Adaptive Gradient Algorithm. The proposed TCNN architecture's training and optimization process are comprised of the following: 1. Initial 1-Dimensional generic convolution layer: with 64 filters and a filter scale of 3, it comprise the input vectors. 2. The 2 nd 1-Dimensional generic convolution layer employs 130 filters with a 4 filter dimension. Until pooling, the model learns more advanced functionality in the second layer. 3. 1-Dimensional global maximum pooling layer: This layer substitutes data that is filtered with its maximum value. By taking the highest value, it avoids over-fitting of the studied functions. 4. Batch normalization layer: Before moving on to the next layer, it normalizes the data from the previous layer. 5. Fully connected dense network: This layer has 130 intermediate nodes and a 29% dropout value. 6. Fully connected inter-connection with soft-max activation: It generates 5 units for multiclass grouping, 1 for each of the 5 traffic groups.
The topmost fully connected layers of the model estimate the outcome probability, as p(y|u 1:n ) = p(N y (h)), where N y is the irregular event. The gate is modified to reflect the admission of IDS events as follows: i t = 1 m t σ (W i X t + U i h t −1 + b i ) where m t represents the type of the event. In this proposed study, the value of m t will always be equal to 1, and σ represents the sigmoid function of a vector. Moreover, the output can be generated Based on the aforementioned steps, the computational complexity can be determined based on the number of layers (w), epochs(x), nodes at each layer(y) and input features(z). Specifically, the time complexity can be determined as O(xz * ((w-1)y 2 )). In general, it is O(n 2 * (n-1)n 3 ).

EXPERIMENTAL IMPLEMENTATION
Intel Quad-core i7-8449U processor has been utilized with 16 GB RAM and a 256 GB hard drive to introduce the detection learning models. To build deep learning networks, the current paper uses the Python 3.6 programming language and TensorFlow. Furthermore, Scikit-learn, Keras API, Panda, and Inmblearn are among the libraries used.

Data instances
Bot-IoT instances is an IoT data repository launched in 2018 by the Cyber Center of New South Wales University. Legitimate and malicious traffic is created by the virtualization of intelligent systems such as smart fridges, remotely triggered garage doors, motioncontrolled lights, and intelligent thermostats. The dataset contains over 73,000,000 instances, each of which is represented by 42 features. Each record is either classified as regular or as an assault. The data instances are also categorized into 4 classes including DoS, reconnaissance, DDoS, and intelligence stealing, with every class, is subdivided as shown in Tables 2 and 3.

Dataset balancing
Dataset balancing is the process of balancing several datasets. There are 9,429 standard samples and 73,000,000 vulnerable instances in the repository. There are 489 regular samples and 3,459,125 vulnerable instances. More than 98 percent of the instances fall into the DDoS and DoS groups. As a result, the learning algorithm will correctly estimate the majority classes but miss the minority classes, indicating that the model is biased. To address this problem, various re-sampling techniques are presented, including random oversampling, which replicates the same samples of minority classes at random, and oversampling by creating synthetic samples of minority classes using techniques including synthetic minority oversampling technique, and synthetic minority oversampling technique. The current paper uses the SMOTE-NC methodology in this study because it can handle mixed data instances with continuous attributes. In the training subset, the minority groups, such as common and stealing, are increased to 99,98,989 samples.

Attribute minimization
The vital goal is to create an effective IDS for use in the IoT environment. As a result, it is critical to increasing the detection models' performance by attribute reduction, as well  as memory consumption optimization with minimized computing complexity. A total of 2.9 GB of memory is used while all of the functions are used. Feature space reduction reduces computational overhead while still speeding up preparation and identification.
The following steps are added to the dataset, and the memory usage is successfully reduced to 668 MB, a 77 percent reduction. 1. Data type conversion to categorical class: The number of attributes and corresponding data formats are encoded for each form. The memory-intensive features flags, proto, saddr, sport, daddr, dport, state, category, and subcategory are encoded as elements.
Object attributes are translated to category datatype since it is more powerful. 2. Int64 to Int32 data type conversion: The dataset's integer properties, as depicted in Tables 4 and 5, are accumulated in the Int64 (8-byte) data types by default. These features do not surpass the capability of the Int32 (4-byte) sort, according to current investigation. As a result, all the values of the Int64 class are converted into the Int32 type, which consumes 50% storage as compared to the Int64 type. 3. Removing superfluous features: The current paper removed certain superfluous features from the dataset, such as (1) ''pkSeqID'': it serves the same purpose as the index that is created automatically.

Transformation of attribute
The current paper define the transformation of mathematical and categorical elements. The transition is only extended to the training subset after the data instances have been broken for train, validation, and testing subsets. The validation and testing subsets are then subjected to the same transformation. Transformation of numerical features: The data includes 32 mathematical attributes, both specific and seamless estimates. There are 2 distinct functions, namely ''spkts'' and ''dpkts'', each of which has a finite number of values. As a result, no feature engineering is needed. The graphical structure of the 4 attributes is shown in Fig. 4. The continuous attributes are not necessarily distributed, which typically affects the output of linear models. To do this, the log-based modification and regular scaler are added to the seamless attributes to achieve a Gaussian distribution: 1. Log transformation: The current feature value y = log 10 y, where y is the attributes original value. 2. Standard scaler: This function calculates the average and standard deviation of a range.
The characteristics are then transformed into a Gaussian distribution.

Splitting a dataset
The most popular methods for separating data sets are traditional splitting and crossvalidation. It is primarily utilized in machine learning to avoid over-fitting. Cross-validation raises the training cost when a massive data repository is utilized for deep learning. The data-set is divided into 3 subsets in this study, using the traditional 3-way split: preparation, evaluation, and research. Also, where over-fitting occurs, regularization is used to deal with it. A stratified split is often utilized to guarantee that each split contains a part of each category.

Deep learning techniques
TensorFlow with Keras API is used to build all deep learning models. Preprocessing, templates, layers, and optimizers are among the Keras programs included. In both versions, the same activation functions are used. The ReLu activation function is used to analyze cognitive interactions between given input and output. Softmax is utilized as the output activation module, and a generalized logistic regression-based activation function is used as the input layer activation function. In softmax, the frequency of production units is equal to the number of attack groups plus the standard class. Table 6 depicts the deep learning architectures of TCNN as well as hyperparameters. Overfitting is addressed using methods such as global limit pooling, batch normalization, and dropout. The Adam optimizer is chosen to change the weights because it outperforms other optimizers such as SGD and AdaGrad.

Assessment
The current paper equates TCNN's output to that of 5 machine/deep learning algorithms of Random Forest (RF), Long Short Term Memory (LSTM), Convolution Neural Network (CNN), Decision Tree (DT), and Logistic Regression (LR).

Metrics for performance estimation
The following metrics are used to test the multiclass detection models: 1. Effective Measure: In this, assessment is done on how well the identification model distinguishes between various types of network traffic. The current paper use the following metrics to do this:  Table 7 shows the effects of evaluating RF and LR on initial and re-balanced data-sets. All of the experiments had almost identical training and testing ratings, indicating that there is no over-fitting. In terms of LR, SMOTE-NC oversampling improves accuracy, recall, and F1-score, implying that it is better at identifying minority groups. Oversampling, on the other hand, has little bearing on the RF's efficacy.

Deep learning model evaluation
To achieve the best efficiency, The current paper runs a sequence of experiments with various hyper-parameter measures (as an instance batch size, learning rate, number of units, and number of layers in each layer). The optimizer learning rates are investigated. When the learning rate is 0.001, the best results are obtained. Also measured are various epoch counts (10,15,25,55, and 100) as well as different batch sizes of 105, 512, 1024, and 2048. The current paper can see that elevation of the frequency of epochs causes the learning mechanism to slow down. Similarly, reducing the batch size has little effect on results. The batch size and number of epochs for TCNN are set to 25 and 2048. Table 8 depicts the TCNN's multi-class classification accuracy and log failure during the preparation and testing processes. TCNN achieves enhanced efficiency in the first epochs, implying that 15 epochs are sufficient. Furthermore, there is no evidence of overfitting in the training and validation findings. Table 7 depicts the LSTM and CNN log failure data. In terms of log loss, The current paper can see that TCNN is better than LSTM and CNN. As some accuracy outcomes surpass 99.99 percent, the current paper can see that DL models outperform RF and LR. The precision findings are very similar, but in terms of efficacy measures, TCNN marginally outperforms LSTM and CNN. It can be depicted that DL models do well even though data balancing is not used. It is observed that a small and very minor reduction of efficacy for LSTM and TCNN when SMOTE-NC oversampling is used. The efficacy of CNN, on the other hand, improves marginally by using SMOTE-NC oversampling. CNN, therefore, takes less time to practice than TCNN and LSTM. Since it is the nearest rival to CNN in terms of preparation time and accuracy, TCNN has a fair trade-off between efficacy and performance. Moreover, the energy conservation analysis is depicted in Fig. 4 for the proposed model.

Evaluation of related work
Table compares the success of the current model to that of related approaches validated on the Bot-IoT data. In terms of precision, accuracy, F1-score, recall, training time, and categorization task, the distinction is made.