Intrusion Detection in IoT Using Deep Residual Networks with Attention Mechanisms

: Connected devices in IoT systems usually have low computing and storage capacity and lack uniform standards and protocols, making them easy targets for cyberattacks. Implementing security measures like cryptographic authentication, access control


Introduction
With the leap forward in next-generation information technology such as 5G, artificial intelligence, and big data, IoT has entered a new era of development.However, connected devices in IoT systems often have low computing and storage capacity and lack uniform standards and protocols, making them easy targets for cyber attacks.Secondly, IoT device manufacturers do not implement strict security measures when producing devices, which makes current IoT devices have many vulnerabilities, which also leads to many potential security threats in the IoT system composed of IoT devices [1].For example, the Mirai botnet is capable of executing large-scale distributed denial of service attacks via IoT devices [2]; IoT devices are usually connected through wireless networks, and intruders can obtain private information from communication channels through eavesdropping [3].These cyberattacks not only disrupt the availability of device functionality but may also steal important information and data about the device and the users who use it.Therefore, designing a security approach that can effectively address the cyberattacks faced by IoT has been the focus of research [4].
Implementing security measures such as cryptographic authentication, access control, and firewalls for IoT devices does not fully address the cyberattack problem or guarantee absolute security in the IoT environment [5].To improve the defense capability of IoT systems, many studies have focused on machine learning-based or deep learning-based intrusion detection methods [6].An intrusion detection system (IDS) [7] is a network security device that continuously monitors network traffic and issues alerts or takes proactive measures when suspicious activities are detected.The difference with other network security measures is that IDS is a proactive security protection technology.IDS solutions can be classified into three approaches: signature-based, anomaly-based, and hybrid [6].In general, signature-based approaches are effective against known attacks.However, due to the heterogeneity, dynamics, and complexity of IoT networks, the signature-based approach is less efficient for IoT because it requires continuous human intervention to extract attack patterns and update the IDS model.On the other hand, anomaly-based approaches are effective for unknown attacks, making them advantageous in IoT as they can detect zero-day attacks with minimal human intervention.The hybrid approach combines signature-based and anomaly-based techniques.However, the use of signature-based intrusion detection methods in IoT environments is limited due to their inability to detect unknown attacks.Therefore, anomaly-based intrusion detection systems play a crucial role in IoT security.
Some existing anomaly intrusion detection systems use traditional machine learning techniques to build IDS models [8].These traditional techniques, however, encounter challenges with the high speed and volume of data generated by IoT devices.They rely extensively on feature engineering to extract representative features from the unstructured data, leading to performance degradation when applied to large-scale and high-dimensional datasets.Consequently, several studies have shifted focus toward employing deep learning techniques to develop more effective solutions [9].
Deep learning automates feature extraction and classification by designing multilayer neural networks into architectures and using large amounts of data and computation.The advantage of this approach is that it can perform feature learning without human intervention and can achieve good results when dealing with large or complex datasets.The powerful data processing and feature learning capabilities and the ability to detect unknown attacks that deep learning has can provide advanced solutions for IoT intrusion detection [9][10][11][12].Therefore, researchers have applied deep learning to the field of IoT intrusion detection and have achieved good results [9][10][11][12][13][14].
However, there are several obvious problems with some current deep learning-based IoT intrusion detection methods.First, most of the current methods use previous data sources, i.e., NSL-KDD, KDD-99, etc., for evaluation [9][10][11][12][13], which do not contain current and up-to-date attack data against the IoT.Secondly, the detection models mentioned in a significant number of methods have a complex structure [11][12][13][14], which leads to the limitation of the application of the models in the IoT environment.In addition, some methods ignore the fact that IoT traffic data has both spatial and temporal features [12], which leads to inadequate feature extraction by the models proposed in the methods and thus affects the detection performance.Moreover, some models cannot be processed for heterogeneous data in the IoT environment, which leads to a slight lack of model generalization [15], i.e., the models are not sufficiently adaptable to new data and have poor detection performance on new datasets.Finally, the limited amount of labeled data available for training deep learning models in the IoT environment seriously affects the detection performance of the classifier.In our work, we proposed an improved residual network structure, which consists of three residual modules, each containing a CONV-LSTM sub-network, and connects the module design to the attention module, which greatly reduces the complexity of the model structure.The evaluation results based on ToN_IoT with the UNSW-NB15 dataset show that our model has better performance performance than existing models.The contributions of our work are as follows:

•
Considering the spatiotemporal characteristics of IoT traffic data, we proposed an improved residual network structure that avoids the performance impact of extracting only a single feature.

•
We introduced an attention mechanism in the model to compute weights representing the importance of different features to help the model focus on the most important features.The rest of this paper is organized as follows.Section 2 describes the intrusion detection-related work.Section 3 describes the mathematical modeling of the proposed model.Section 4 presents the experimental results and discussion.Finally, Section 5 summarizes the contributions and results of this study.

Related Work
In recent years, Hasan et al. [16] studied attack detection models in IoT sensors using machine learning methods.They accurately compared the performance of several machine learning techniques in predicting attacks and violations in IoT systems.A robust algorithm was developed for detecting IoT cyberattacks, particularly focusing on virtual environments.The proposed system demonstrated better detection accuracy compared to existing models.Ravi et al. [17] proposed DDoS attack mitigation and learning-driven detection in IoT through SDN-Cloud architecture.This approach aims at detecting DDoS attacks launched in IoT servers using malicious wireless IoT.Zhang et al. [18] proposed an effective method for classifying network traffic, which uses principal components analysis (PCA) to remove irrelevant features and Gaussian Parsimonious Bayes as a classifier.However, machine learning methods have certain limitations.Firstly, their performance is heavily dependent on the robustness of the feature engineering techniques employed.Secondly, their effectiveness diminishes when applied to large-scale and high-dimensional data.Lastly, their learning capability is insufficient to effectively handle unknown attacks in the IoT environment.
The most important advantage of deep learning over traditional machine learning is its superior performance on large datasets.The use of IoT systems usually generates a large amount of data, which is more complex and diverse, and contains a variety of intrusive behaviors.The ability of deep learning to automatically model complex feature sets from sample data makes deep learning more relevant in IoT security applications.For example, Mohamed et al. [19] proposed the use of DeepIFS integrated with gated recurrent units and a multi-headed attention mechanism to detect intrusions.Liu et al. [20] introduced a federated learning approach for collaborative and decentralized training on edge devices.They utilized LSTM to capture temporal representations and employed attention-enhanced convolutional neural networks to learn important spatial information.The study of Recurrent Neural Networks and their variants is important for improving the security of IoT systems, especially against time-series-based threats.For example, Yan et al. [21] used a variational autoencoder (VAE) as a network baseline and then learned potential temporal representations of the input time series for intrusion detection by RNNs.They also used FNN to parameterize the mean and variance of each time window to provide a non-smooth architecture that can operate in the absence of persistent noise.Gao et al. [22] introduced the LSTM-GaussianNB architecture for evaluating the probability of outliers in IoT data.These research works further demonstrate the suitability of recurrent neural networks as an IoT intrusion detection model.Therefore, some researchers have gone a step further and combined recurrent neural networks with other methods to achieve better detection in the field of IoT intrusion detection.For example, Parra et al. [23] proposed a distributed cloud-based approach based on CNN and LSTM to detect and mitigate phishing and botnet attacks on client devices, which outperformed a single LSTM model.
In the last two years of research work, Khan et al. [24] proposed an efficient model called XSRU-IoMT for the effective and timely detection of complex attack traffic in medical IoT.However, the authors tested it using only a single dataset, which was not sufficient to demonstrate the effectiveness of the model.Wu et al. [25] proposed a hierarchical CNN-RNN neural network LuNet, but the generalization ability of the model was poor.Later, Wu et al. proposed a dense residual network Densely-ResNet [15] based on LuNet for secure detection of edge, cloud, and fog layers.Although the generalization ability of the model was improved, it was slightly inadequate and the structure of the model was too complex.Latif et al. [26] pointed out that their proposed dense random neural network (DnRaNN) could improve the generalization performance of the model, but there were not enough experiments to show that the generalization ability of the model was improved.Therefore, to further improve the detection accuracy of the model while maintaining good generalization ability, this paper proposes an intrusion detection model based on temporal convolutional residual modules, which consists of three residual modules, each containing a CONV-LSTM subnetwork, and connects the module design to the attention module, which greatly reduces the complexity of the model structure.The model can effectively learn the spatiotemporal representation of IoT data and obtain high accuracy while maintaining good generalization performance.All experiments are conducted on the ToN_IoT dataset and the UNSW-NB15 dataset.

Proposed Model Model Overview
To achieve intrusion detection in the IoT environment, researchers have introduced convolutional neural networks [27].However, CNNs are usually used for feature extraction in static environments and lack long-term relevance storage mechanisms, which are not suitable for modeling sequential data in IoT.Some researchers have applied RNNs to the field of IoT intrusion detection for the sequential relationship of traffic data in the IoT environment.RNNs integrate a temporal layer to capture sequential data, learning multifaceted changes through the hidden units of recurrent units.These hidden units are modified based on the data provided to the network and are continuously updated.RNNs are used for IoT security due to their efficient management of sequential data.The study of RNNs and their variants is important to improve the security of IoT systems, especially against time-series-based threats [14].However, RNNs suffer from the gradient disappearance or explosion problem, where gradients can become too small or too large during training, leading to their unsatisfactory predictions for IoT intrusion detection.
This paper studies intrusion detection models based on the combination of deep learning and attention mechanisms [28] for intrusion detection from IoT traffic data.For the characteristics of data in IoT, this paper combines CNN and LSTM units, redesigns and improves the ResNet architecture [29], and introduces an attention mechanism for the extracted high-dimensional features to give weight information to different features to avoid the problem of failing to express important information due to the excessive dimensionality of the features.
The ResNet model proposed in this paper consists of three main pairs of modules.Each module pair contains a residual module and a ResBlock-CBAM module.A jump connection is used between each two module pairs, and after the summation operation is performed on the output of the last module pair, the output of the module pair is classified by the classification layer, which finally constitutes the main structure of the deep residual network model used in this paper.The overall architecture of the model is shown in Figure 1.This network model overcomes the disadvantages of traditional deep learning, such as gradient disappearance and gradient explosion, and has a strong generalization capability.

Residual Network Architecture
The specific structure design of the residual module proposed in this paper is shown in Figure 2.Each residual module consists of a CNN unit and an LSTM unit.The CNN unit mainly consists of three convolutional layers: the first convolutional layer has a convolutional kernel size of 1 × 1, which is used to reduce the dimensionality of the input feature map; the second convolutional layer has a convolutional kernel size of 3 × 3, which is the core part of the CNN structure and is used for feature extraction and nonlinear transformation; the third convolutional layer has a convolutional kernel size of 1 × 1, which is used to restore the dimensionality of the feature map to its original size.The proposed model extracts the features of the input data through the convolutional and pooling layers, and then the feature map is transformed and input to the LSTM unit.Assuming that the input datum is x, the convolution layer in the proposed model extracts the spatial features of the given data and performs the convolution operation on the input data to obtain the output: where f (•) is the activation function, w i is the convolution kernel weight, and b i is the corresponding bias.The rectified linear unit (ReLU) is used as the activation function: The first convolution operation is performed in the residual unit, using a 1 × 1 convolution kernel, reducing the number of channels of the input feature map to C/4 and the size of the output feature map to H × W × C/4; batch normalization (BN) and ReLU activation operations are performed on the output feature map: where λ and φ are learning parameters and x ′ i is calculated as follows: In these equations, µ β represents the mean of the mini-batch, which is the average value of the β-th feature map element in a mini-batch; σ 2 β represents the variance of the minibatch, which is the variance of the β-th feature map element in a mini-batch; x i represents the i-th element of the input feature map; x ′ i represents the i-th element of the normalized input feature map; m represents the size of the mini-batch in batch normalization; that is, the number of elements in the input feature map; and ε represents a small constant in batch normalization used to prevent division by zero.
The second and third convolution operations are performed using 3 × 3 and 1 × 1 convolution kernels, respectively, and each convolution operation is followed by a BN and ReLU activation operation.The output of the previous layer is then downsampled by the pooling layer to compress its features, making the data feature dimensionality reduced by: where M l denotes the size of the l pooling layer and down(•) denotes the downsampling function.
The maximum pooling layer is a common down-sampling technique in convolutional neural networks, which serves to reduce the feature dimension and improve the computational efficiency and generalization ability of the model.The specific roles of the maximum pooling layer are as follows: feature compression: the maximum pooling layer can reduce the size of the feature map to reduce the computation of the model, and it can retain the main features of the original data to avoid overfitting.Feature invariance: The maximum pooling layer can improve the feature invariance, i.e., the features can remain unchanged when the input changes slightly, thus improving the model's robustness and generalization ability.Feature selection: The maximum pooling layer can select the most important features in the data, i.e., keep the maximum value and ignore other values, thus improving the expressiveness and performance of the model.In summary, the maximum pooling layer can optimize the performance and efficiency of the convolutional neural network by compressing the feature map, improving the feature invariance, and selecting the most important features.The specific computational procedure is as follows: The pooling operation converts M to Z = [z 1 , z 2 , z 3 , . . ., z C ] of size 1 × 1 × C, where z is calculated as follows: After local features are extracted by the CNN, long-distance dependencies of these local features are captured using the LSTM.The following outlines the process of transforming the specific input features of the LSTM unit: The input vector X at the current moment is input to the LSTM cell.The output value i t of the input gate of the LSTM cell is calculated by the sigmoid function, and this value determines how the input vector x affects the state C t .i t ranges between 0 and 1.The output value f t of the forgetting gate is computed by the sigmoid function.This value determines which information in the state C t−1 of the previous moment needs to be forgotten.f t ranges from 0 to 1.
Updating the state.The current moment's state C t can be updated by adding the result of the dot product of i t and X to the previous moment's state C t−1 and subtracting the result of the dot product of f t and the previous moment's state C t−1 .The output value o t of the output gate is calculated by the sigmoid function, which determines which information in the current moment's state C t needs to be output.o t ranges from 0 to 1.The hidden state h t of the current moment can be obtained by performing a dot product operation between the state C t of the current moment and the output value o t of the output gate.Finally, the output is passed to the next layer for calculation.
In this way, the input features can be better represented by the transformation of the LSTM cells, thus improving the performance of the model.
The dropout layer is used to avoid the overfitting problem during training and to improve the model generalization ability.ReLU is then chosen as the activation function to overcome the gradient disappearance problem and to speed up the training speed.

Convolutional Block Attention Mechanism
In the IoT environment, network traffic data often contains different features that are of different importance for intrusion detection.Therefore, this section introduces the CBAM module [30] in the model to help the model distinguish the importance of features in order to obtain important features and improve the performance of the model.As shown in Figure 3.The structures of the channel attention module and the spatial attention module are shown in Figures 4 and 5, respectively.The specific computational procedure of the CBAM module is described as follows: The specific structure of the channel attention module is shown in Figure 4.For a given feature map F, the channel attention module first performs Global Average Pooling (GAP) to obtain the average value of each channel, and then performs feature mapping through two fully connected layers to obtain the weight coefficients of each channel.Finally, the weight coefficients are applied to the feature map to emphasize the important channels and suppress the unimportant ones.The whole process is as follows: where ⊗ denotes element-by-element multiplication.Since each channel of the feature map extracts some level of feature information, the channel attention mechanism feature information is more important.In order to compute channel attention efficiently, the channel attention module uses a spatial dimensionality method that compresses the input feature mapping, compared to using a single pooling method the channel attention module uses average pooling and maximum pooling methods and proves that the dual pooling method has stronger representational power.The specific computational procedure is shown as follows: The specific structure of the spatial attention module is shown in Figure 5.The feature maps processed by the channel attention module first pass through two convolution layers to obtain the weight coefficients of each pixel point.Here, the weight coefficients are calculated by considering the response of each pixel point on different spatial scales, and the similarity between pixel points.Finally, the weight coefficients are applied to the feature map to emphasize the important pixel points and suppress the unimportant ones.This improves the generalization ability of the model.The spatial attention module focuses on which location information is meaningful, and it also complements the channel attention module, as shown below:

Channel Attention Module
where σ(•) denotes the Sigmoid function, MaxPool(•) denotes the maximum pooling, AvgPool(•) denotes the average pooling, MLP(•) denotes the multilayer perceptron, and Conv(•) denotes the 3D convolutional layer.Finally, the model uses the Softmax layer to calculate the final traffic class using the output of the residual module and the CBAM module.The outputs of the residual and CBAM modules and the raw inputs are fed into the fully connected layer after manipulation.The fully connected layer then multiplies the weight matrix with the input vector and adds the bias as follows: where X is the input of the fully connected layer, w j is the weight of the j-th class of features, and b j is the bias term.The feedforward layer represents the captured spatiotemporal features as a linear representation suitable for predicting the final category labels using Softmax operations.In this process, each category is assigned a probability score, and the category with the highest probability is considered the final model prediction, as shown below: where z j denotes the output of the feedforward layer and p denotes the probability score.The training model is calculated to minimize the cross-entropy loss according to Equation ( 16): where y i denotes the actual label and ŷi denotes the model prediction label.
To maximize feature reuse capability, this study uses jump connections to add the output of the previous parameter layer to its subsequent parameter layers to maintain local originality throughout the learning phase.In addition, each pair of jump-connected residual and attention modules will be connected to all subsequent pairs of residual and attention modules to maintain global originality throughout the learning phase.The proposed network structure has several significant advantages to improve its generalization performance.

Dataset Description
Generalization ability is an important metric to assess the performance of a model.A good model should have strong generalization ability, meaning it can perform well on new datasets.Therefore, in this study, the model's generalization performance is tested using different datasets.
To evaluate the performance of the proposed model, we used ToN_IoT [13], a real dataset created from a large-scale IoT system developed by the Cyber IoT Laboratory at ADFA, New South Wales, Australia.This dataset contains normal category data and nine categories of attack data, including password, scan, ransomware, backdoor, denial of service, distributed denial of service, MITM, injection, and XSS attacks.The total number of data instances in this dataset is shown in Table 1.We trained our model using the ToN_IoT dataset to ensure it learns from a comprehensive set of attacks and normal behavior data.To test the generalization ability of our model, we then used the UNSW-NB15 [31] dataset.The raw network packets for the UNSW-NB15 dataset are created by the IXIA PerfectStorm tool at the Australian Cyber Security Centre's Cyber Scope Lab, which can be used to generate a mix of modern normal activity and synthetic contemporary attack behavior.The dataset contains a total of 49 features and 9 common attacks.The normal information accounts for 88% of the dataset size and the attack information accounts for 12%.We used this dataset to test the generalization ability of the model.The basic information of this dataset is shown in Table 2.We use several metrics to evaluate the performance of the proposed model, i.e., accuracy, precision, recall, and F1-score because they are widely used to evaluate deep learning algorithms.The formula is as follows: where TP is true positive, representing the number of positive samples correctly identified as positive, FP is false positive, representing the number of negative samples incorrectly identified as positive, TN is true negative, representing the number of negative samples correctly detected as negative, and FN is false negative, representing the number of positive samples incorrectly identified as negative.

Hyperparameter Settings
Hyperparameters are parameters that are artificially adjusted before or during training.To ensure the best performance of the proposed deep learning model, optimal hyperparameters are determined through extensive experimentation.The hyperparameters involved in the experimental process include learning rate, epoch, and batch size.The details are shown in Table 3 We divided the ToN_IoT dataset into a training set and a test set by the ratio of 80% and 20%, respectively.Initially, we set the learning rate to 0.01, epochs fixed to 100 times, and batch size range set to 16, 32, 64, 128, 256, 512, and 1024.We conducted experiments at different batch sizes and recorded the results for the above learning rates in Table 3.A detailed performance comparison of these parameters is shown in Figure 6a.The best-tested accuracy of the model is 96.82% for a learning rate of 0.01 and a batch size of 64.The lowest value of accuracy for this learning rate is 90.74% for a batch size of 1024.For other batch sizes, the accuracy and other performance scores are above 90%, demonstrating the model's robustness to changes in batch size.In the second stage, we fixed the learning rate at 0.001 and the epoch remained fixed at 100 times, while the batch sizes ranged from 16, 32, 64, 128, 256, 512, and 1024.By keeping the learning rate and epoch fixed, we conducted experiments at different batch sizes and recorded the results in Table 3.A detailed performance comparison of these parameters is shown in Figure 6b.The best test accuracy is 99.55% when the batch size is 512.At batch sizes of 64, 128, 256, and 1024, the accuracy of the model was greater than 99%.At lower batch sizes of 16 and 32, the accuracy and other performance scores decreased but were still close to 99%.This indicates that a learning rate of 0.001 with a batch size of 512 provides the optimal balance between training stability and performance.
Finally, we set the learning rate to 0.0001, and the other parameters were kept the same as those in the first two stages.The detailed performance comparison of these parameters is shown in Figure 6c.The best test accuracy is 99.24% when the batch size is 32.For batch sizes of 16, 64, and 256, the accuracy of the model is above 98%.For other batch sizes, the accuracy and other performance scores decreased but the results were above 90%.The comparative experimental results of the above performance parameters show that the proposed model achieves the best results when the learning rate is 0.001, epochs are 100, and the batch size is 512.This combination of parameters consistently provides high accuracy and stable performance across different configurations, demonstrating the robustness and generalizability of the proposed model.Finally, we set the learning rate to 0.0001, and the other parameters were kept the same as those in the first two stages.The detailed performance comparison of these parameters is shown in Figure 6c.The best test accuracy is 99.24% when the batch size is 32.For batch sizes of 16, 64, and 256, the accuracy of the model is above 98%.For other batch sizes, the accuracy and other performance scores decreased, but the results were above 90%.
The comparative experimental results of the above performance parameters show that the proposed model achieves the best results when the learning rate is 0.001, epochs are 100, and batch size is 512.

Comparison with State-of-the-Art Methods
To further analyze the effectiveness of the proposed model, we compared the proposed model with some of the best current methods and evaluated the performance differences between the proposed model and other methods.The result is shown in Figure 7.As can be seen from the figure, the proposed model has the best performance on the ToN_IoT dataset, outperforming existing models in terms of accuracy, precision, recall, and F1score.Specifically, the proposed model achieved an increase in accuracy from 99.20% with DnRaNN to 99.55%, an increase of 0.35%.Compared to XSRU-IoMT, the accuracy increased from 99.38% to 99.55%, which is an increase of 0.17%.Additionally, the accuracy improved from 99.43% with Densely-ResNet to 99.55%, an increase of 0.12%.These results verify the validity of our proposed model.In addition, we compared the generalization ability of the model with several deep learning methods on the UNSW-NB15 dataset, as shown in Table 4.Our proposed LSTM-ResNet model achieved accuracy, precision, recall, and F1-score values of 89.23%, 88.83%, 87.77%, and 88.25%, respectively.This represents a significant improvement in performance metrics compared to other models.For instance, the accuracy of our model is 15.3% higher than that of the Densely-ResNet model, which is the second-best performing model in our comparison.
The severe class imbalance problem in the UNSW-NB15 benchmark often leads to poor model generalization performance.However, our proposed model demonstrates robust performance despite this challenge.As shown in Table 4, the LSTM-ResNet model not only achieves the highest accuracy but also excels in precision, recall, and F1-score compared to LSTM, LuNet, Densely-ResNet, and CNN models.This indicates that our model has superior generalization capability and can effectively handle imbalanced data.These results highlight the robustness and reliability of our method in diverse scenarios, confirming its effectiveness in real-world applications.
Moreover, the attention mechanism incorporated in our model helps it focus on important features, further enhancing its performance and robustness.This makes the LSTM-ResNet model particularly effective in identifying and classifying network intrusions, as evidenced by its leading performance across multiple evaluation metrics.

Conclusions
In this paper, we propose a deep learning model based on the temporal convolution residual module and attention mechanism for IoT anomaly detection and analyze the model in depth on the ToN_IoT dataset, which is the latest publicly available IoT dataset.In addition, the proposed model is evaluated on the UNSW-NB15 dataset, and its generalization performance is compared with several deep learning methods.As a result, ResNet achieves state-of-the-art detection accuracy on the UNSW-NB15 benchmarks while maintaining a low false positive rate.The evaluation results confirm the effectiveness and stability of the proposed model, and it is recommended for use in other intrusion detection tasks in the future.

Figure 6 .
Figure 6.Performance at different learning rates.

Figure 7 .
Figure 7. Performance comparison with other models.

•
Higher detection accuracy.The performance of the algorithm proposed in this paper, in terms of detection accuracy, is superior to some current state-of-the-art methods.•Stronger generalization ability.With a certain amount of data, this paper improves the model's expressiveness by increasing the network width and optimizing the loss function to reach the global optimum.

Table 3 .
Performance variation at different parameters.

Table 4 .
Comparison of model generalization performance on the UNSW-NB15 dataset.