ABCNN‑IDS: Attention‑Based Convolutional Neural Network for Intrusion Detection in IoT Networks

This paper proposes an attention-based convolutional neural network (ABCNN) for intrusion detection in the Internet of Things (IoT). The proposed ABCNN employs an attention mechanism that aids in the learning process for low-instance classes. On the other hand, the Convolutional Neural Network (CNN) employed in the ABCNN framework converges toward the most important parameters and effectively detects malicious activities. Furthermore, the mutual information technique is employed during the pre-processing stage to filter out the most significant features from the datasets, thereby improving the effectiveness of the ABCN model. To assess the effectiveness of the ABCNN approach, we utilized the Edge-IoTset, IoTID20, ToN_IoT, and CIC-IDS2017 datasets. The performance of the proposed architecture was assessed using various evaluation metrics, such as precision, recall, F1-score, and accuracy. Additionally, the performance of the proposed model was compared to multiple ML and DL methods to evaluate its effectiveness. The proposed model exhibited impressive performance on all the utilized datasets, achieving an average accuracy of 99.81%. Furthermore, it demonstrated excellent scores for other evaluation metrics, including 98.02% precision, 98.18% recall, and 98.08% F1-score, which outperformed other ML and DL models.


Introduction
The Internet of Things (IoT) envisions a connected network of various intelligent objects in our surroundings, capable of gathering, processing, and transmitting information [1].In recent years, the IoT had a significant impact on various industries, including agriculture, medicine, transportation, automobiles, and water monitoring [2][3][4][5].This is the era where all businesses rely on technology, everything is going digital, and as we see, the demand for IoT devices has increased significantly, escalating from 15.42 billion in 2015 to a staggering 35.8 billion in 2021 [6][7][8].IoT devices often have limited computational power and memory, making it challenging to implement robust security measures [9,10].As businesses deploy more IoT devices, the risk of vulnerabilities being targeted and exploited increases [11].As shown in Fig. 1, By the year 2025, it is projected that the IoT will reach a staggering number of 75.44 billion devices, resulting in an enormous data output of 79 zettabytes [12].The IoT has been recognized as a crucial factor in digitization for societal transformation [13,14].
Many IoT devices gather, save, and handle sensitive data, while their diverse configuration and openness make them an attractive target for attackers [15][16][17].Ensuring confidentiality is crucial for the successful implementation of IoT networks.To identify malicious activity, an intrusion detection system (IDS) is necessary to monitor IoT network operations [18][19][20][21].IoT networks often involve a large number of heterogeneous devices, each with its own communication protocols and data formats [22,23].Traditional IDS solutions may struggle to handle the diversity and complexity of IoT network traffic, making it difficult to identify abnormal behavior specific to IoT devices [24,25].Numerous researchers have collaborated on IDS development, leveraging the power of ML and DL algorithms [26][27][28].ML and DL methods find extensive applications in diverse domains including agriculture, medicine, and transportation [29][30][31][32].DL, a subset of ML, is particularly useful for addressing problems involving high-dimensional and intricate data.Moreover, DL enables systematic training of nonlinear models on big datasets.
An imbalanced and inadequate dataset may result in low performance on the current IDS.For instance, consider a security dataset that exhibits imbalanced data, where the disparity between the high and low instances of classes is substantial.The intrusion detection model can be affected by this data imbalance, as it tends to primarily focus on the highinstance classes while disregarding or gradually learning from the low-class instances.As a result, the IoT network utilizing this model may fail to detect attacks that were underrepresented in the training data.Furthermore, a significant challenge in IDS design is feature engineering to extract the most salient attributes.To enhance the effectiveness of existing systems, it is essential to extract the most significant features.To address these issues, this paper proposes an attention-based convolutional neural network (ABCNN) for intrusion Fig. 1 Projected growth of IoT devices from 2018 to 2025 [12] detection in IoT networks.The proposed ABCNN employs an attention mechanism that computes attention values for each input attribute.This mechanism aids in the learning process for low-instance classes.On the other hand, the Convolutional Neural Network (CNN) employed in the ABCNN framework converges toward the most important parameters and effectively detects malicious activities [33].Furthermore, this study utilizes preprocessing techniques such as feature filtering, normalization, and stratified splitting.The mutual information technique is applied during pre-processing to filter out the most significant features from the dataset.The proposed architecture was evaluated using the Edge-IIoTset, IoTID20, ToN_IoT, and CIC-IDS2017 datasets.The performance of the proposed methods was measured using several metrics, including precision, recall, F1-score, and accuracy.The main contributions of this article are: -In this study, A novel deep learning technique attention-based convolutional neural network (ABCNN) is proposed for intrusion detection in IoT networks.The attention layer computes the attention value for each input, and the CNN is utilized to predict the network's behavior on high-attention features.-In this study, we employed the mutual information method to select the most significant features.This method calculates the mutual information between each attribute and the target variable based on entropy.-To demonstrate the effectiveness of the proposed approach in comparison to other several ML and DL methods, a series of experiments were conducted.It is worth noting that all preprocessing steps used in the comparison of the proposed and other models were identical.
The rest of this article is organized as follows: Section 2 presents recent research on intrusion detection in IoT.Section 3 covers the mathematical modeling, overall architecture flow, and experimental methodology.In Section 4, a concise discussion of the experimental results obtained from the proposed model is provided.Finally, Section 5 presents a brief conclusion.

Related Work
The proliferation of IoT technology has led to a significant increase in the connectivity of smart devices to the internet.However, this interconnectedness also opens up opportunities for attackers to exploit IoT networks and carry out malicious activities.In response to this pressing issue, numerous researchers have put forth various models aimed at identifying and mitigating such malicious activities in IoT networks.Altunay et al. [34] proposed a hybrid DL model that incorporates both CNN and long short-term memory (LSTM) for the detection of intrusions in IoT networks.They evaluated the model using the UNSW-NB15 and X-IIoTID datasets for both binary and multi-class classifications.The results section compares the effectiveness of the model with that of CNN and LSTM models and shows that the hybrid CNN and LSTM model outperforms the other models.Wu et al. [35] adopted the Geometric Graph Alignment (GGA) approach to effectively handle the variations in geometry between different domains, thus improving the transfer of intrusion knowledge.In this method, each intrusion domain was represented as a graph, with the vertices and edges corresponding to intrusion categories and their interrelationships.To assess the performance of the GGA approach, the authors employed five publicly available datasets, including NSL-KDD, UNSW-NB15, CIC-IDS2017, UNSW-BoT-IoT, and UNSW-TONIoT.Their proposed model achieved an accuracy of 71.72%, which outperformed other approaches in the comparative analysis of results.
Javadpour et al. [36] introduced a multi-agent-based model designed for the detection and prevention of cyberattacks in the Cloud Internet of Things (CIoT) environment.These agents utilize association rules to effectively identify intrusions.The performance of the multi-agentbased model was assessed using the KDD Cup 99 and NSL-KDD datasets, achieving an accuracy of 71.12%.Thakkar et al. [37] presented a bagging method based on deep neural networks (DNN) to detect intrusions in IoT networks.Their primary emphasis was on addressing the challenge of unbalanced datasets.The evaluation of their proposed bagging model involved the use of NSL-KDD, UNSW-NB15, CIC-IDS2017, and BoT-IoT datasets.Their presented approach yielded an average accuracy of 98.22% across all the datasets.Alghanam et al. [38] introduced the pigeon-inspired optimization local search (LS-PIO) method for the purpose of detecting intrusions in IoT networks.Their proposed method was evaluated using four public datasets, namely BoT-IoT, UNSW-NB15, NLS-KDD, and KDDCUP-99.The LS-PIO method achieved an average accuracy of 96.58% across all the datasets used in the evaluation.
Saba et al. [39] implemented a CNN-based approach for anomaly-based intrusion detection in IoT networks.Their proposed method was trained and evaluated using two distinct datasets: the network intrusion detection (NID) dataset and the Botnet (BoT-IoT) dataset.The CNN model achieved an average accuracy of 96.18% on both datasets.Eme et al. [40] proposed a hybrid model called BGH that utilizes a combination of bi-LSTM and gated recurrent units (GRU) to effectively detect eight known IoT network attacks.The model was trained and evaluated using two widely recognized IoT network traffic datasets: CIC-IDS-2018 and BoT-IoT.Remarkably, the BGH technique achieved an impressive average accuracy of 99.38% on both datasets.
Sharma et al. [41] adopted a deep neural network (DNN) approach for detecting anomalies in IoT networks.They employed a feature filtering technique to extract the most important features from the dataset.To evaluate the performance of their model, they utilized the UNSW-NB15 dataset, achieving 84% accuracy for imbalanced data.However, by utilizing generative adversarial networks (GANs) to balance the data, the accuracy improved significantly to 99%.El-Ghamry et al. [42] proposed a CNN-based intrusion detection system specifically designed for agriculture IoT networks.They preprocessed the data, selected relevant features, and transformed it into colored images.The authors employed CNN to analyze the images and identify malicious activities within the networks.To evaluate the effectiveness of their system, they used the NSL-KDD dataset, achieving 99% accuracy in their model's outcomes.A short overview of the literature is presented in Table 1.After reviewing the relevant studies, it becomes clear that numerous studies have focused mainly on a select few classes because of the highly imbalanced datasets.As a result, when dealing with a greater number of attack classes, these systems often encounter difficulties in achieving precise detection.In contrast, this paper presents a novel method known as ABCNN, which improves the effectiveness of current models for both smaller and larger sets of attack classes.

The Proposed Attention-Based CNN
This study proposes an attention-based convolutional neural network (ABCNN) model for detecting malicious attacks.The model consists of an attention layer and convolutional neural network (CNN) layers (Fig. 2).The attention layer calculates the attention value of each input attribute/element, while the CNN layers focus on the importance of each attribute and predict the behavior of the network.The basic architecture of a CNN for intrusion detection usually comprises one or more convolutional layers, pooling layers, and fully connected layers [43,44].For the proposed model, we utilized one convolutional layer, one max-pooling layer, and three fully connected layers.This decision is based on the results illustrated in Tables 4, 5, 6, and 7.It can be clearly seen from the results in these tables that the model based on selected set of configurations performs the best among all other configurations when applied to all the different datasets (Fig. 2).
The proposed model uses the attention layer to calculate a set of attention weights that indicate the importance of each key element for the current query.This is achieved by computing a similarity score between the query and each key element and then applying a softmax function to obtain a set of normalized attention weights.Once the attention weights have been calculated, they are used to weight the values and summed up to obtain the output of the attention layer.The attention weights a ij for each query i and key j in the sets of queries Q, keys K, and values V are calculated using Eq.(1).
Here, Q i represents the i-th query, K j represents the j-th key, d k represents the dimensionality of the key vectors, and softmax refers to the softmax function used for computing the attention weights.Next, the attention weights are used to weight the corresponding values, and the resulting weighted values are summed up to obtain the output (X) of the attention layer.This process is described mathematically by Eq. (2).
Here, X i represents the i-th output (X), V j represents the j-th value, and the sum is taken over all keys j.The input to the CNN layer is obtained as the output of the previous layer.Eqs. ( 3) and ( 4) provide a demonstration of the 1D convolutional layer. (1) Where x u represents the input in 1D convolutional layer, and the output of the previous layer neurons are denoted by s u .The kernel from i to u is represented by w iu , and the bias value of the neuron in the convolutional layer is denoted by b u .The ReLU activation func- tion is used in convolutional layers which is represented by f () , and its mathematical form is presented in Eq. ( 5).The output of the 1D convolutional layer is denoted by y u , which becomes the input to the max pooling layer, as indicated in Eq. ( 6).During pooling, the maximum values from the output of the convolutional layer within the region ℜ (3) Fig. 2 Proposed ABCNN model architecture are selected, and the result is denoted by s u .Once the max-pooling layer is applied, the flatten method is utilized to convert the output shape of the final pooling layer into a onedimensional array, which is subsequently fed as input to the fully connected layers.In the fully connected layers, the ReLU activation function is employed.Finally, the last fully connected layer utilizes the softmax function, as demonstrated in Eq. ( 7), to produce the final result.The hyperparameters used in the proposed approach are listed in Table 2.

Performance Evaluation
This section provides details of the preliminary measures required to implement the approach, such as data cleaning, feature filtering, and normalization.Furthermore, it presents the test results of the proposed ABCNN architecture for identifying malicious attacks in IoT networks.Various experiments were conducted on the proposed model to identify the optimal configuration.These experiments were conducted with varying model layers, batch sizes, and optimization functions.For the purposes of this specific experiment, a sparse categorical cross-entropy loss function was utilized for loss calculation, and the model was trained over six epochs.To evaluate the effectiveness of the proposed model, a five-fold cross-validation method was employed.

Datasets
In this study, we utilized Edge-IIoTset, IoTID20, ToN_IoT, and CIC-IDS2017 datasets which are well-established datasets that are frequently used in the area of ML-based IDSs by researchers.The Edge-IIoTset dataset includes IoT and IIoT network communication instances obtained from a real-world testbed implementation of seven interconnected layers, including cloud computing, network functions virtualization, blockchain network, fog computing, software-defined networking, and edge computing, as well as IoT and IIoT perception layers [45].The data was generated by more than ten different types of devices, such as water and soil measuring, temperature, humidity, and other IoT devices.Edge-IIoTset contains fourteen attacks related to IoT and IIoT communication protocols, which are divided into five categories: DoS/DDoS, Information gathering, Man in the middle, Injection, and Malware attacks.Data was collected from network packets using pcap (5) files, which were then converted to CSV using the Zeek and TShark tools.The dataset contains 2219201 samples to evaluate the DL methods.The IoTID20 dataset was created specifically for detecting cyberattacks in IoT networks.It was generated by utilizing homeconnected smart devices, such as SKT NGU and EZVIZ Wi-Fi cameras [46].The significant benefit of this dataset lies in its incorporation of contemporary communication data and fresh insights into network interference detection.In total, the dataset encompasses 83 distinct features related to IoT networks [47].The ToN_IoT dataset was derived from a real-time IoT network at UNSW Canberra in Australia.This compilation encompasses seven distinct categories of cyberattacks targeting IoT networks, each documented within its respective file [48].The CIC-IDS2017 dataset was generated using real-time network data, spread across eight distinct files.These files encompass a five-day period, capturing both regular and attack-related traffic data [49].This dataset was created by the Canadian Institute of Cybersecurity [50].Table 3 provides summaries of the utilized datasets.

Data Cleaning
This is the very first process in the preprocessing phase, and it addresses the issue of null values and converting categorical attributes to numeric attributes.The Edge-IIoTset dataset contains no missing value.The used dataset has categorical data which include a variety of data categories.To transform categorical data into numeric, first, we considered using a one-hot encoder; however, this technique requires a huge memory and exposes significant latency [51].As a necessary consequence, for the transformation task, we instead used the label encoder mechanism.The label encoder strategy assigns a unique numeric value to each label based on alphabetical order and does not require any additional memory.

Features Filtering
Each dataset includes a set of attributes.If a dataset contains insignificant attributes that do not influence the output, it is advisable to remove them from the dataset.These features can lead to overfitting and underfitting problem, affecting the computation time and the effectiveness of the framework.The selection of features is a method for eliminating insignificant attributes from a dataset and retaining only essential attributes.The primary goal of feature filtering is to avoid overfitting and underfitting, enhance efficiency, and decrease the model's training and response times.
In this experiment, we employed the mutual information method to select the most significant features for training the models.Mutual information is a fundamental concept in information theory that quantifies the shared information between two random variables.It measures the extent to which knowing the value of one variable reduces the uncertainty about the other variable.Mutual information is employed to identify relevant features that contribute to the detection of malicious activities.This method calculates the mutual information between each attribute and the target variable based on entropy.Mathematically, the mutual information is expressed in Eq. ( 8), where I(X; Y) denotes the mutual information between X and Y, p(x, y) represents the joint probability mass function of X and Y, and p(x) and p(y) correspond to the marginal probability mass functions of X and Y, respectively.Only the features with a threshold value greater than 0.1 were selected.The datasets used in this experiment, namely Edge-IIoT set and IoTID20, contain a total of 61 and 83 features, respectively.Out of these, 29 features were selected from the Edge-IIoT set, while 55 features were chosen from the IoTID20 dataset.

Normalization
In this experiment, the selected datasets have different features which have different range values, for example, some features have very large range values, and some attributes have very small range values.To begin addressing this issue, we used the min-max normalization methods to normalize all attributes between 0 and 1 by using Eq. ( 9).Where x is the original value to be normalized, x m in and x m ax are the minimum and maximum values of the feature, respectively, and x n orm is the normalized value of x.After normalization, split data into 80% train and 20% test sets by using stratified split.A distribution of 80% training and 20% testing split was used to facilitate 5-fold cross-validation.This distribution ensures that each fold of the cross-validation process encompasses 20% for testing and 80% for training, effectively covering the entire dataset for robust performance validation.The stratified method split data equally for each class in the train and test sets.

Experimental Setup
To conduct the experiment on the proposed model, Python 3.11 was utilized since it is widely used among scientists for experimentation.The Keras package from the Tensor-Flow library was employed, as it is a convenient deep-learning framework.For the code implementation, Jupyter Notebook was utilized, as it provides results after each code cell is executed.The experiment was conducted on a Windows 10 operating system, utilizing an i5-8th generation laptop with 24 GB RAM and a 1.8 GHz processor.The dataset employed in the experiment was significant, necessitating the use of high RAM.

Evaluation Measure
In this study, we utilize multiple metrics to evaluate the effectiveness of our proposed classification approach.These metrics include accuracy (ACC), macro precision (MP), marco recall (MR), and F1-score which are based on true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN).Precision is a technique for evaluation of the model that shows how much correct data are given in total returned data.Precision will find from the testing results of the model which returns a confusion matrix.Precision is calculated from TP and FP values.This experiment uses the MP which is calculated by using Eq. ( 10).
where k is the number of classes, i represents the true positives for class i, and i represents the false positives for class i. Recall is a technique for evaluation of the model that shows how much correct data are returned from a total collection of data.Recall will find from the testing results of the model which return a confusion matrix.The recall is calculated from TP and FN values.This experiment uses the MR which is calculated by using Eq.(11).
where i represents the false negatives for class i.The F1-score is a technique for evaluation of the model that combines both techniques precision and recall and gives a single value.The F1-score is calculated using Eq. ( 12).
Accuracy is a metric used to evaluate the performance of a model in accurately detecting attacks.Eq. ( 13) is employed to calculate accuracy, where i represents the true negatives for class i. (

Proposed ABCNN Layers Comparison
The experiments were carried out with varying numbers of convolutional one-dimension (Conv1D) and dense layers, and the results were compared to find the optimal layers.In this experiment, the Adam optimization function with batch size of 32 was employed.Figs. 3, 4, 5, and 6 depicts the training and validation performance, respectively, providing an assessment of the model's performance in terms of overfitting and underfitting.These figures demonstrate that the model is capable of generalizing well to unseen data and does not exhibit any signs of overfitting.The performance evaluation was then recorded in Tables 4, 5, 6, and 7 on all the utilized datasets, which presents the accuracy and other performance metrics for different layers.The results indicate that the best performance was achieved using one Conv1D and three fully connected layers in the proposed ABCNN approach.This means that this particular configuration of layers provides optimum performance for the task at hand.Furthermore, it is stated that all the evaluation metrics were optimal for these layers.This suggests that this configuration consistently performed better than any other number of layers tested, indicating that it is a robust and reliable choice for this task.

Proposed ABCNN Results on Different Optimization Functions
A series of experiments were conducted to evaluate various optimization functions and determine the most optimal one for the proposed model.The experiments were conducted using the optimized layer configuration of the proposed model, as explained previously, with a batch size of 32.The training accuracy and loss of the proposed model on different optimizers are compared in Figs. 7, 8, 9 and 10.The results of the experiments, comparing the performance of the ABCNN model using different optimization functions, are summarized in Tables 8, 9, 10 and 11 on all the utilized datasets.The findings reveal that the ABCNN model achieved superior performance when the Adam optimization function was employed for both datasets, outperforming the other functions.

Proposed ABCNN Results on Different Batch Sizes
As previously mentioned, various experiments were conducted to determine the optimal batch size for the proposed model.For this specific experiment, the proposed model's optimal layer configuration was utilized with the Adam optimization function.The results for the proposed ABCNN model using different batch sizes are presented in Tables 12, 13, 14 and 15 on all the utilized datasets.The findings indicate that the ABCNN model performed better with the batch size 32 function than others for both datasets.

Performance Comparison with Other ML and DL Methods
The proposed ABCNN model was evaluated to determine its effectiveness in comparison with other ML and DL approaches, such as CNN, LSTM, GRU, Naive Bayes (NB), and Support Vector Machines (SVM).All models were evaluated under the same experimental conditions, including the preprocessing steps and the dataset split into 80% training and 20% testing sets.Tables 16, 17, 18, and 19 provide a detailed analysis of the performance evaluation of the proposed ABCNN model in network behavior classification on all the utilized datasets.The results show that the proposed ABCNN model outperformed all other ML and DL approaches, including CNN, LSTM, GRU, NB, and SVM.Therefore, the proposed ABCNN model can be considered an effective approach for achieving high performance in the given task.
It can be seen in Table 16 that the M-Precision of CNN approach is better than the proposed approach.Similarly, the Table 19 shows that the M-Precision of GRU approach is better than the proposed approach.These high precision and low recall of CNN and GRU suggest that when these models predict a positive class, they are very likely correct (high precision), but they miss a significant number of positive cases (low recall).This could be because CNN and GRU are too conservative in their predictions, possibly due to being too sensitive to certain features that do not generalize well.While

Conclusion
In this study, we have proposed an attention-based convolutional neural network (ABCNN) for intrusion detection in IoT networks.Moreover, we have utilized the mutual information technique during the pre-processing stage to filter out the most relevant features from the dataset.To evaluate the effectiveness of our proposed ABCNN approach, we have employed the Edge-IoTset, IoTID20, ToN_IoT, and CIC-IDS2017 datasets.The performance of our approach has been compared with other Intrusion Detection Systems (IDS) based on both ML and DL techniques.The results clearly demonstrate that our proposed approach achieves an impressive average accuracy of 99.81% on all the datasets.Additionally, other metrics such as precision, recall, and F1-score reach 98.02%, 98.18%, and 98.08% respectively, surpassing the performance of other models and highlighting the superior performance of the proposed model.The proposed model has successfully enhanced the performance of existing systems.There are several classes of attacks available in various IDS datasets.They need to be combined into a single dataset, and the proposed model should be trained by adjusting the hyperparameters.This is because the proposed model has the capability to learn a wider range of attack classes, eliminating the need for multiple models for detecting different cyber attacks.Similar to the limitations of other ML and 1D-DL models, it also requires a feature extraction method to eliminate negative impactful features.Additionally, as the number of layers increases, so does the detection time.

Fig. 3 Fig. 5
Fig. 3 Training and validation performance of the proposed ABCNN model on Edge-IIoTset dataset

Fig. 6
Fig. 6 Training and validation performance of the proposed ABCNN model on CIC-ID2017 dataset

Table 1
Literature overview

Table 2
The Utilized hyperparameters

Table 3
The utilized datasets detail presentation for each class

Table 4
high recall of the proposed ABCNN indicate that the model identifies most of the positive cases (high recall) but also makes a lot of false positive errors (low precision).This might happen if the model is generalized and captures many features as indicators of the positive class.

Table 16
Performance comparison with several ML and DL approaches on Edge-IIoTset dataset

Table 17
Performance comparison with several ML and DL approaches on IoTID20 dataset

Table 18
Performance comparison with several ML and DL approaches on ToN_IoT dataset

Table 19
Performance comparison with several ML and DL approaches on ToN_IoT dataset