Intrusion Detection Model Based on Improved Transformer

: This paper proposes an enhanced Transformer-based intrusion detection model to tackle the challenges of lengthy training time, inaccurate detection of overlapping classes, and poor performance in multi-class classiﬁcation of current intrusion detection models. Speciﬁcally, the proposed model includes the following: (i) A data processing strategy that initially reduces the data dimension using a stacked auto-encoder to speed up training. In addition, a novel under-sampling method based on the KNN principle is introduced, along with the Borderline-SMOTE over-sampling method, for hybrid data sampling that balances the dataset while addressing the issue of low detection accuracy in overlapping data classes. (ii) An improved position encoding method for the Transformer model that effectively learns the dependencies between features by embedding the position information of features, resulting in better classiﬁcation accuracy. (iii) A two-stage learning strategy in which the model ﬁrst performs rough binary prediction (determining whether it is an illegal intrusion) and then inputs the prediction value and original features together for further multi-class prediction (predicting the intrusion category), addressing the issue of low accuracy in multi-class classiﬁcation. Experimental results on the ofﬁcial NSL-KDD test set demonstrate that the proposed model achieves an accuracy of 88.7% and an F 1- score of 88.2% in binary classiﬁcation and an accuracy of 84.1% and an F 1- score of 83.8% in multi-class classiﬁcation. Compared to existing intrusion detection models, our model exhibits higher accuracy and F 1- score and trains faster than other models.


Introduction
The rapid growth of the internet has brought significant convenience, but it has also led to an increasing number of network security problems. In today's world, security is of paramount concern as intruders have become more sophisticated with the advancement of technology [1]. Hackers employ various techniques to bypass firewalls, enabling them to infiltrate network systems and cause damage to the internal infrastructure or collect individuals' private information. Given the rising threats posed by intruders, network intrusion detection has emerged as a critical research direction in network security.
Intrusion detection systems can be divided into two categories: network-based intrusion detection systems (NIDSs) and host-based intrusion detection systems (HIDSs), depending on the type of intrusion behavior being monitored [2]. NIDSs monitor local network traffic by examining data packets to detect intrusion behavior, while HIDSs analyze multiple sources of information collected on the local host, such as system data, log files, and disk resources. Traditional intrusion detection techniques include methods such as entropy-based approaches and redundancy optimization. Entropy-based approaches are used to detect anomalies, such as DDoS attacks in IEEE802.16-based networks, by calculating the entropy of network traffic [3]. This method analyzes statistical and entropy-based features of incoming traffic to determine whether an attack has occurred. However, this approach has some limitations, such as being less effective when dealing with encrypted traffic or low traffic volume. Redundancy optimization is another commonly used technique in intrusion detection, which improves the accuracy and reliability of detection by performing the same detection algorithm multiple times. The most widely used technique is Triple 1.
This paper proposes a data processing strategy to address the challenges of highdimensional features and class imbalance in intrusion detection datasets. Specifically, two techniques are proposed: (i) A stacked auto-encoder is used to reduce the dimensionality of the dataset by encoding the data features based on their original distribution. This not only accelerates model training but also preserves the information content of the data features. (ii) A new under-sampling method based on the K-nearest neighbors (KNN) algorithm is proposed, which under-samples normal samples and over-samples abnormal samples using Borderline-SMOTE. This hybrid sampling approach balances the dataset while mitigating class overlap issues, thereby improving the detection performance of the model.

2.
To enhance the classification performance of the model, this paper proposes an improved position encoding method for the Transformer. By incorporating positional information from the features in the intrusion detection dataset, the model can capture dependencies among the features, thereby enhancing its detection capability. 3.
To enhance the model's capability to handle multiple classes, this paper proposes a two-stage learning strategy. This strategy involves an initial coarse binary prediction, followed by inputting this prediction, along with the original features, into a multiclassification model. This results in more accurate multi-classification predictions.
This chapter provides an overview of intrusion detection, including its background and current research status using machine learning and deep learning techniques. It also highlights three key challenges in intrusion detection and describes how this paper addresses these issues. The subsequent chapters are structured as follows: 1.
Section 2, Related Work, presents the current research status of intrusion detection in addressing the aforementioned challenges and discusses the limitations of existing research. It also examines the use of a Transformer in intrusion detection and identifies its shortcomings. Finally, it introduces the proposed improvements in this paper to address these limitations.

2.
Section 3, Materials and Methods, presents the proposed model and its various modules and improvements. 3.
Section 4, Results, presents the experimental results and discusses their implications in light of the research contributions. 4.
Section 5, Discussion, provides an in-depth analysis of the proposed model in this paper, including its strengths, limitations, and practical implications. It also compares the model to existing methods, identifies potential areas for future research, and emphasizes its contributions to intrusion detection. 5.
Section 6, Conclusions, presents the overall conclusions of this paper.

Related Work
Despite significant progress in intrusion detection using machine learning and deep learning, three key challenges remain: long model training times, imbalanced datasets, and poor performance in multi-class classification. While many researchers have proposed improvements to address these challenges, the impact of these improvements has been relatively modest.
In dealing with class imbalances, intrusion detection datasets often exhibit significant class imbalance, which may cause algorithms to favor predicting the more numerous class and thus lower detection accuracy. To address this issue, researchers often perform sampling on the dataset to balance it. Jiang et al. [16] suggested a detection framework that combines deep hierarchical networks with hybrid sampling techniques. In particular, they employed the one-sided selection algorithm and SMOTE technique to perform undersampling and over-sampling, respectively, in order to balance the dataset. Zhang et al. [17] proposed another technique for processing unbalanced datasets, which combines SMOTE over-sampling with clustering-based under-sampling using a Gaussian mixture model. Their intrusion detection framework effectively addresses the class imbalance problem and improves detection accuracy. Yan et al. [18] employed an enhanced local adaptive synthetic minority over-sampling technique to address dataset imbalance and utilized an RNN for detecting various types of traffic anomalies, leading to improved accuracy in the detection process. However, these sampling methods are designed to achieve a balanced dataset in terms of quantity, without considering the issue of class overlap in the intrusion detection dataset. The difficulty of detecting data models in the overlapping regions of classes can greatly increase, resulting in a lower detection rate for the models.
In dealing with slow model training speed, the increase in feature dimensionality and quantity in intrusion detection datasets greatly extends the training time. To address this issue, researchers often utilize dimensionality reduction methods to speed up the model training process. Zhou et al. [19] achieved good accuracy performance by combining an auto-encoder and a residual network. They reconstructed the network using an autoencoder to perform feature extraction, and then used the extracted features to train a designed residual network. Similarly, Liu et al. [20] employed Principal Component Analysis (PCA) to reduce dimensionality, extracting a subset of principal component features that contain maximum information. The processed data was then fed to the recurrent neural networks for classification, resulting in a high accuracy rate. However, these dimensionality reduction methods do not take into account the loss of information caused by dimensionality reduction, which in turn results in a decrease in the model's classification ability. Moreover, slow model training speed is not necessarily only due to the increase in the number and dimensions of the dataset, as deep learning models with a deeper hierarchy and a larger number of trainable parameters can also lead to slow model training.
In dealing with the low multi-classification ability of intrusion detection models, researchers usually improve the model's multi-class detection ability through optimization. To address this issue, Hassan et al. [21] proposed a hybrid intrusion detection model by combining convolutional neural networks (CNNs) and long short-term memory (LSTM) networks. The CNN is used to extract deep-level features, while the LSTM captures long-term dependencies between these features. To prevent overfitting, the weight matrix in the LSTM network is regularized using drop-connect. This hybrid model achieves high accuracy in intrusion detection classification. Guo et al. [22] proposed a method for detecting attacks without any prior knowledge by combining Sub-Space Clustering (SSC) and One-Class Support Vector Machine (OCSVM). Rehman et al. [23] proposed a combined model that employs convolutional neural networks (CNNs) and attention-based gated cyclic units for detecting both single and hybrid attacks, resulting in improved attack detection performance of the model. Yuqing et al. [24] introduced a novel method for traffic analysis by converting data traffic into pixel points in bytes. The resulting images are then processed using a CNN through operations such as convolution and pooling to obtain classification results. Their approach achieved high accuracy in both binary and multiclassification problems. However, their improvement in multi-class detection capability is not significant enough and requires further enhancement.
In recent years, the Transformer model proposed by Vaswani et al. [25] has shown superior performance in parallel training compared to other deep learning models, significantly improving the training efficiency and reducing the training time. Consequently, researchers have begun to apply Transformers to the field of intrusion detection. In [26], a Robust Transformation-based Intrusion Detection System (RTIDS) was proposed, which used position embedding to associate sequence information between features and stacked the encoder and decoder variants of the Transformer to learn low-dimensional feature representations from high-dimensional raw data. The self-attention mechanism was applied to facilitate the classification of network traffic types. However, the encoding of the input features into low-dimensional representations by the encoder becomes ineffective when the low-dimensional features are fed into the decoder, as the data is transformed back into high dimensions. In [27], an improved Visual Transformer (ViT)-based intrusion detection model was proposed, which was combined with a sliding window mechanism to enhance ViT's local feature modeling ability. A hierarchical focal loss function was adopted to improve the classification performance and mitigate the problem of imbalanced data. However, the use of focal loss alone is insufficient to solve the issue of imbalanced datasets, even with weight modifications, as it does not address the quantitative imbalance of the data. In [28], a combination of convolutional neural networks (CNNs) and a Transformer was proposed for intrusion detection, which captured both the global and local correlations between packets. However, the model did not show significant improvement in multi-class classification. In [29], Transformer-based transfer learning was used to learn network feature representations, and a hybrid CNN-Long Short-Term Memory (CNN-LSTM) model was applied to detect different types of attacks from deep features, with the synthetic minority over-sampling technique (SMOTE) used to balance the anomalous traffic. However, SMOTE may lead to overlapping of classes, making it difficult for the model to classify the synthesized data, and the use of CNN-LSTM as the base model resulted in longer training time.
This paper proposes an improved Transformer-based intrusion detection model to tackle the challenges of imbalanced data, slow model training, and poor multi-class classification performance. The model incorporates a stacked auto-encoder to reduce dimensionality while preserving the original features. We introduce a hybrid sampling method based on the KNN principle to under-sample normal samples and the Borderline-SMOTE technique to over-sample abnormal samples. This method balances the dataset and addresses the issue of class overlap in intrusion detection datasets. We also propose an enhanced Transformer position encoding method that embeds positional information, improving classification accuracy by allowing the model to learn feature dependencies. As the Transformer has a parallel training approach, it speeds up the training process. Furthermore, we propose a two-stage learning strategy that involves rough binary classification prediction followed by multi-class detection with the original feature and prediction values. Our experimental results show that our proposed model effectively addresses these challenges and achieves faster model training speed and good detection accuracy. Figure 1 illustrates the structure of our enhanced intrusion detection model based on a Transformer, which can be divided into three parts as follows:

Model Construction Materials
The first part is the data processing strategy, which involves the numerical and normalization transformation of input data, dimension reduction of the dataset through the encoding layer of the stacked auto-encoder (SAE), under-sampling of normal samples using KNN, and the use of a hybrid sampling method consisting of Borderline-SMOTE for over-sampling of abnormal samples to obtain a balanced dataset, while mitigating the problem of data class overlap.
The second part is the first stage of learning, where the balanced dataset is embedded with positional information using an improved position encoding method, features are extracted using the Transformer encoder, and the binary classification model is learned and trained using softmax function.
The third part is the second stage of learning, where the binary classification model of the first stage is used to make binary predictions on the balanced dataset, the predicted values are merged with the original features to form a new dataset, which is then embedded with positional information using the improved position encoding method, features are extracted using the Transformer encoder, and the multi-class model is learned and trained using softmax.
Compared to other intrusion detection models, our model has the following advantages: 1.
Given the reconfigurable nature of SAE for feature handling, we propose to utilize its encoding layer for reducing the dimensionality of the input features. This ensures that the same amount of information content is preserved in the features despite the reduction in dimensionality, thereby expediting the model training process and reducing computational overheads.

2.
This paper proposes a novel under-sampling method based on the characteristics of KNN, combined with the Borderline-SMOTE over-sampling method to achieve hybrid sampling. This method not only balances the dataset but also alleviates the problem of class overlap, thereby improving the classification performance of the model.

3.
This paper proposes an improved method for position encoding in the Transformer model, which enhances the model's ability to capture dependencies between fea- Compared to other intrusion detection models, our model has the following advantages: 1. Given the reconfigurable nature of SAE for feature handling, we propose to utilize its encoding layer for reducing the dimensionality of the input features. This ensures that the same amount of information content is preserved in the features despite the reduction in dimensionality, thereby expediting the model training process and reducing computational overheads.

Data Preprocessing Strategy
This subsection provides a detailed explanation of the data processing approach proposed in this paper. To address the issue of high dimensionality in the dataset, which often results in long sampling and training times, we propose the utilization of stacked autoencoders for data feature dimensionality reduction. We also introduce a hybrid sampling approach that involves under-sampling the normal samples and over-sampling the abnormal samples in the training set. In addition, we present a novel under-sampling method that leverages the properties of KNN in conjunction with the Borderline-SMOTE oversampling method for hybrid sampling. This approach effectively resolves the challenges of class overlap and class imbalance in the dataset.

Numericalization and Normalization
Intrusion detection datasets often contain character-based features, which cannot be processed by computers as they only recognize numerical data. Therefore, characterbased features are encoded into numeric values by using one-hot encoding. However, the resulting dataset may contain discrete and continuous states, leading to significant differences in individual feature values. This, in turn, can cause the gradient to disperse during backpropagation, slowing down the learning process and reducing the model's ability to extract deeper features. To address this issue, we propose normalizing the dataset after one-hot encoding. In this paper, we utilize minimize-maximize normalization to rescale the data and map it to a range of [0, 1]. The calculation formula is shown as follows: In the formula, x represents a specific data value, x min is the minimum value of the column feature, x max is the maximum value of the column feature, and x * is the resulting normalized data value.

Dimensionality Reduction
After numericalization and normalization, the dataset's increased dimensionality results in longer training times and sparse data, leading to a decrease in model accuracy. Therefore, dimensionality reduction becomes necessary. This paper proposes to use the encoding layer of the stacked auto-encoder for dimensionality reduction, based on the reconfigurability of the original distribution of data features by the auto-encoder. This ensures that the amount of information contained in the data features remains unchanged after reduction.
The stacked auto-encoder (SAE) is an unsupervised learning model that utilizes multiple layers of pre-trained auto-encoders. The training process involves a layer-by-layer greedy training strategy, where one layer is trained at a time, and training only starts for the next layer after the current layer is trained. This approach initializes each layer with a reasonable value, leading to faster convergence and better accuracy.
The principle of SAE is to utilize the input data X as a reference to guide the neural network to learn a mapping relationship that can reconstruct the data X R , where X R is an approximation of X. Hence, the feature h 2 resulting from the dimensionality reduction of the encoding layer of SAE needs to retain all the information contained in the original feature X, enabling the reconstructed X R to approximate X. Building upon this, this paper proposes to apply the encoding layer of SAE for dimensionality reduction of the input features, as illustrated in Figure 2.
The encoding layer consists of f 1 and f 2 , which map the input X to h 2 , while the decoding layer consists of g 1 and g 2 , which reconstruct h 2 to X R . The calculation formulas are shown as follows: the encoding layer of SAE needs to retain all the information contained in the original feature X, enabling the reconstructed X R to approximate X. Building upon this, this paper proposes to apply the encoding layer of SAE for dimensionality reduction of the input features, as illustrated in Figure 2. The encoding layer consists of f1 and f2, which map the input X to h2, while the decoding layer consists of g1 and g2, which reconstruct h2 to X R . The calculation formulas are shown as follows: This paper proposes to utilize the encoding layer of the stacked auto-encoder (SAE) for dimensionality reduction, which offers several benefits. Firstly, it provides control over the dimensionality reduction features, as each layer is initialized with a reasonable value after separate training. Secondly, intrusion detection classification tasks typically involve a large number of neurons in the neural network and more trainable parameters, and using the encoding layer of SAE for dimensionality reduction can simplify the complexity of the problem and facilitate the task. Additionally, the reconfigurability of SAE for data features ensures that the amount of information contained in the data features remains unchanged after dimensionality reduction.

Hybrid Sampling
Current class imbalance problems are generally solved through under-sampling or over-sampling methods, but existing sampling methods aim to obtain a balanced dataset in terms of quantity, without considering class overlap. Intrusion detection datasets exhibit a clear class overlap problem, which can be defined as the intersection between normal and abnormal samples. In the overlapping region, even if the samples belong to different categories, their feature attributes are similar due to the similarity between feature attributes. Due to the similarity between feature attributes, the model finds it difficult to classify samples in the overlapping region, leading to a decrease in classification accuracy.
In existing under-sampling methods, random under-sampling deletes random samples, which leads to the loss of useful information even if the dataset is balanced in terms of quantity. In existing over-sampling methods, random over-sampling randomly selects samples for over-sampling, and SMOTE over-sampling randomly selects minority class samples for over-sampling. These over-sampling methods have a common feature, which This paper proposes to utilize the encoding layer of the stacked auto-encoder (SAE) for dimensionality reduction, which offers several benefits. Firstly, it provides control over the dimensionality reduction features, as each layer is initialized with a reasonable value after separate training. Secondly, intrusion detection classification tasks typically involve a large number of neurons in the neural network and more trainable parameters, and using the encoding layer of SAE for dimensionality reduction can simplify the complexity of the problem and facilitate the task. Additionally, the reconfigurability of SAE for data features ensures that the amount of information contained in the data features remains unchanged after dimensionality reduction.

Hybrid Sampling
Current class imbalance problems are generally solved through under-sampling or over-sampling methods, but existing sampling methods aim to obtain a balanced dataset in terms of quantity, without considering class overlap. Intrusion detection datasets exhibit a clear class overlap problem, which can be defined as the intersection between normal and abnormal samples. In the overlapping region, even if the samples belong to different categories, their feature attributes are similar due to the similarity between feature attributes. Due to the similarity between feature attributes, the model finds it difficult to classify samples in the overlapping region, leading to a decrease in classification accuracy.
In existing under-sampling methods, random under-sampling deletes random samples, which leads to the loss of useful information even if the dataset is balanced in terms of quantity. In existing over-sampling methods, random over-sampling randomly selects samples for over-sampling, and SMOTE over-sampling randomly selects minority class samples for over-sampling. These over-sampling methods have a common feature, which is to randomly select samples for over-sampling without considering the class information of neighboring samples. This can lead to overlap between over-sampled samples and samples from different classes, making it difficult for the model to classify them. Based on the characteristics of KNN, this paper proposes a new under-sampling method. Then, this paper uses this method to under-sample normal samples and over-sample abnormal samples using Borderline-SMOTE, effectively solving the problems of class imbalance and class overlap.
The under-sampling method proposed in this paper is based on the KNN principle, which can identify and remove normal samples that overlap with abnormal samples. The KNN principle determines the class of a test sample based on the classes of its K-nearest neighbors. Specifically, this method checks if a normal sample is a class-overlapping sample by evaluating the number of normal and abnormal samples among its K-nearest neighbors. If the number of normal samples is less than that of abnormal samples, the sample is considered a class-overlapping sample and is removed. Algorithm 1 presents the under-sampling method based on KNN. The under-sampling method used in this paper is Borderline-SMOTE. Borderline-SMOTE is an improvement upon the SMOTE algorithm. The SMOTE over-sampling principle involves obtaining the K-nearest neighbors of each minority class sample x by calculating the distance between it and other minority class samples and then creating a new sample point X new by randomly selecting a sample x i from its K-nearest neighbors. The calculation formula is shown as follows: SMOTE adopts a global approach by randomly synthesizing new samples for all instances of the minority class, regardless of their distribution or proximity to other samples. However, this method can generate redundant instances and lose useful information, leading to reduced classification accuracy. This is because SMOTE does not consider the class distribution of the nearest neighbors when producing synthetic instances. As a result, the generated synthetic samples may overlap, resulting in suboptimal classification results. To address this issue, we propose using Borderline-SMOTE for oversampling. Borderline-SMOTE is a method for generating synthetic samples in the minority class located near the decision boundary between the minority and majority classes. It utilizes a targeted oversampling approach, where synthetic samples are only created for the "borderline" minority class samples.
Unlike other over-sampling methods that randomly over-sample minority class samples, Borderline-SMOTE first divides the minority class samples into three categories: Safe, Danger, and Noise, as shown in Figure 3. Unlike other over-sampling methods that randomly over-sample minority class samples, Borderline-SMOTE first divides the minority class samples into three categories: Safe, Danger, and Noise, as shown in Figure 3. Here, A represents the Safe sample, B represents the Danger sample, and C represents the Noise sample. The steps to determine the type are shown in Algorithm 2. Finally, only the samples labeled as Danger were over-sampled.

Algorithm 2. Borderline-SMOTE Categorization
Input: All abnormal samples in the training set Output: Three types of samples Here, A represents the Safe sample, B represents the Danger sample, and C represents the Noise sample. The steps to determine the type are shown in Algorithm 2. Finally, only the samples labeled as Danger were over-sampled.

Algorithm 2. Borderline-SMOTE Categorization
Input: All abnormal samples in the training set Output: Three types of samples Procedure: (1) All minority samples are labeled as samples to be tested.
(2) Calculate the distance between each sample to be tested and all other samples. After dividing minority class samples into three categories, the Borderline-SMOTE method can easily learn and classify Safe class samples because their neighbors are mostly from the same class. Noise class samples, on the other hand, have neighbors exclusively from the majority class and can be considered as outliers, over-sampling these samples may lead to an increase in noise and negatively affect the model performance. Danger class samples mostly have neighbors from the majority class, causing class overlap and making it difficult for the model to learn. Therefore, Borderline-SMOTE only over-samples Danger class samples to increase the number of minority samples in the overlapping region and improve the model's ability to distinguish minority samples in this area.
In intrusion detection datasets, normal samples are often more abundant than abnormal ones, which belong to the minority class. To address the issues of class imbalance and class overlap, a novel under-sampling method based on the characteristics of KNN is proposed in this paper. The method under-samples normal samples while using Borderline-SMOTE to over-sample abnormal samples. This hybrid sampling approach balances the dataset by reducing the number of normal samples and increasing the number of abnormal samples. Moreover, because the KNN under-sampling method only removes samples that belong to overlapping classes, and Borderline-SMOTE only over-samples Danger class samples, the combination of these two methods can effectively address class overlap issues.

Improved Transformer
A Transformer was initially applied to natural language processing tasks, abandoning the traditional recurrent neural network (RNN) and convolutional neural network (CNN) structures and solely relying on the attention mechanism to perform machine translation tasks, achieving excellent results. Compared to RNN-based sequential neural networks, the Transformer is superior. RNN training is iterative and sequential, resulting in particularly lengthy training times. In contrast, Transformer training is parallel, allowing all features to be trained simultaneously, dramatically increasing computational efficiency and reducing model training time. Therefore, in this paper, a Transformer was used as the base model to learn and extract features, thereby accelerating the model training speed.
The Transformer is composed of an encoder and a decoder, but for network intrusion detection tasks that do not require decoding, unlike sequence-to-sequence tasks such as machine translation, only the encoder is needed to learn and extract features, which can be combined with softmax for classification. The Transformer classification structure is shown in Figure 4. The Transformer encoder consists of two sub-layers: multi-head attention mechanism and feed-forward neural network, with residual modules and normalization modules in each sub-layer. The multi-head attention mechanism allows the model to focus on different aspects of information, producing multiple subspaces that attend to different aspects of information, thus enhancing the model's performance. machine translation, only the encoder is needed to learn and extract features, which can be combined with softmax for classification. The Transformer classification structure is shown in Figure 4. The Transformer encoder consists of two sub-layers: multi-head attention mechanism and feed-forward neural network, with residual modules and normalization modules in each sub-layer. The multi-head attention mechanism allows the model to focus on different aspects of information, producing multiple subspaces that attend to different aspects of information, thus enhancing the model's performance.
PE (pos,2i+1) = cos(pos/10, 000 2i/d x ) In the equation, the variable pos represents the position of a word within a text sequence sample, which is a value ranging from 0 to the maximum sequence length. d x denotes the dimension of the text encoding, while i is the index of a word within the encoding vector, ranging from 0 to d x . The location embedding function has a period that varies from 2π to 10,000 × 2π, and each location in the encoding dimension is assigned a different combination of values of the sine and cosine functions with varying periods. This method generates distinct texture location data, which allows the model to capture the relationships between positions and the temporal features of natural language. By incorporating such information into the model, it can better capture the semantic meaning and structure of the text data, ultimately enhancing its performance in various natural language processing tasks.
In the context of natural language processing, machine translation involves encoding text-based input features. However, in intrusion detection tasks, the input features are typically numeric and do not require encoding. Figure 5 illustrates the fundamental difference between these two tasks. Encoding numeric features may alter the inherent information in the data and affect the amount of information that can be extracted by the model. Consequently, the model's ability to learn effectively from the features can be diminished, thereby reducing the classification performance.
text-based input features. However, in intrusion detection tasks, the input features are typically numeric and do not require encoding. Figure 5 illustrates the fundamental difference between these two tasks. Encoding numeric features may alter the inherent information in the data and affect the amount of information that can be extracted by the model. Consequently, the model's ability to learn effectively from the features can be diminished, thereby reducing the classification performance. It is crucial to carefully consider whether positional encoding is necessary for a particular type of data before applying it in a machine learning model. When applying positional encoding to text samples, encoding the text is a necessary step. In the positional encoding formula, the variable pos represents the index of each word in the text sequence, while i represents the index of each element in the encoded vector of the word. The positional encoding method embeds the relationship between the feature vectors of each word in the text sequence, which represents the relationship between each word. However, for intrusion detection samples, text encoding is not required, as intrusion detection samples mostly consist of numerical features. If the position is encoded according to the text samples, the variable pos in the positional encoding formula would represent the index of each sample, while i would represent the feature index of the sample. In this case, the positional encoding method would embed the relationship between each sample. However, in intrusion detection, the samples are independent of each other, and the embedded information would be irrelevant and invalid. It is crucial to carefully consider whether positional encoding is necessary for a particular type of data before applying it in a machine learning model. When applying positional encoding to text samples, encoding the text is a necessary step. In the positional encoding formula, the variable pos represents the index of each word in the text sequence, while i represents the index of each element in the encoded vector of the word. The positional encoding method embeds the relationship between the feature vectors of each word in the text sequence, which represents the relationship between each word. However, for intrusion detection samples, text encoding is not required, as intrusion detection samples mostly consist of numerical features. If the position is encoded according to the text samples, the variable pos in the positional encoding formula would represent the index of each sample, while i would represent the feature index of the sample. In this case, the positional encoding method would embed the relationship between each sample. However, in intrusion detection, the samples are independent of each other, and the embedded information would be irrelevant and invalid.
Although the samples in intrusion detection datasets are independent from each other, there is a correlation among their features. By embedding position codes that represent the position information of the features, the model can learn the dependency between feature positions and improve the learning performance. In other words, the model can capture the relationships between features in different positions, which can enhance its ability to learn and generalize.
After analyzing the above, this paper proposes an improved position encoding method for the Transformer of setting pos to 1 while letting i represent the positional index of each feature within a segment of samples. In the absence of pos, only i remains, and the encoding formula generated by the combination of values of sine and cosine functions with different periods will represent the positional information associated with i. As i represents the positional index of the feature, the proposed positional encoding embedding will capture the positional information of each feature, allowing the model to effectively learn the dependencies between feature positions. Consequently, the improved positional encoding formulas are shown as follows: PE (pos,2i) = sin(1/10, 000 2i/d x ) (8) PE (pos,2i+1) = cos(1/10, 000 2i/d x ) In the equation, d x represents the feature dimension of the sample and i represents the index of a feature within a segment of the sample, with i ranging from 0 to d x .
The proposed position encoding method in this paper is more suitable for intrusion detection tasks than the previous approach. While the samples in intrusion detection datasets are independent, their features are often correlated. By enhancing the position encoding to include feature position information, the Transformer model can capture the positional dependencies between sample features, which increases the information available to the model and enhances its classification performance. The proposed method is therefore more appropriate for intrusion detection tasks, where feature correlations play an important role in identifying anomalies.

Two-Stage Learning Strategy
This paper proposes a two-stage learning strategy that performs a rough binary classification before multi-class classification. It is particularly suitable for the intrusion detection dataset, which is highly imbalanced and requires the model to effectively learn from the limited number of negative samples. The first stage of binary classification enables the model to better distinguish between normal and abnormal samples, and the predicted results of this stage are then used as additional features for the second stage of multiclassification. This approach provides the multi-classification model with more information and helps it to better classify different types of attacks.
After the data processing strategy is completed, the proposed model can be divided into two parts according to the two-stage learning strategy. The first part involves embedding position information into balanced datasets by using an improved position encoding method. The Transformer encoder is then utilized to extract features and softmax is used to train the first-stage binary classification model. In the second part, the binary classification model trained in the first stage is used to predict the balanced dataset, and the predicted values are combined with the original features to form a new dataset. The improved position encoding method is again employed to embed position information, and the Transformer encoder is utilized to extract features. Finally, softmax is used to train the second-stage multi-class classification model.

Loss Function
Focal loss [30] is a loss function that was originally proposed for object detection tasks with highly imbalanced datasets. In this paper, it is applied to handle the highly imbalanced intrusion detection dataset. The focal loss function adjusts the weights of positive and negative samples to enable the model to prioritize difficult-to-classify samples, typically those belonging to the minority class in imbalanced datasets. This helps to alleviate the issue of data imbalance and improves the model's ability to accurately classify both positive and negative samples. The application of the focal loss function enables the model to better handle the data imbalance and improve its classification ability.

Experimental Environment and Datasets
The experimental hardware environment utilized in this paper was equipped with an Intel Core i5-10300H 64-bit processor, 16GB of RAM, and a GTX1660ti graphics card. The experimental platform employed TensorFlow 2.2.0 and Keras 2.3.1 frameworks, and Python 3.7 was utilized for coding implementation.
The NSL-KDD dataset [31] is a commonly used dataset in intrusion detection research. It is an improved version of the KDD-CUP-99 dataset, with duplicate and redundant records removed. The dataset contains both normal and anomalous network traffic and is divided into training and testing subsets. The training set consists of 125,973 samples, while the test set consists of 22,543 samples.
The distribution of normal and anomalous samples in the training set of the NSL-KDD dataset is highly imbalanced, with only a small proportion of samples being abnormal. The distribution is presented in Figure 6. To address this issue, the hybrid sampling strategy of KNN under-sampling and Borderline-SMOTE over-sampling is used to balance the dataset. After hybrid sampling, the ratio between normal and abnormal samples in the dataset was balanced at 1:1. The distribution of normal and abnormal samples after hybrid sampling is shown in Figure 7. dant records removed. The dataset contains both normal and anomalous network traffic and is divided into training and testing subsets. The training set consists of 125,973 samples, while the test set consists of 22,543 samples.
The distribution of normal and anomalous samples in the training set of the NSL-KDD dataset is highly imbalanced, with only a small proportion of samples being abnormal. The distribution is presented in Figure 6. To address this issue, the hybrid sampling strategy of KNN under-sampling and Borderline-SMOTE over-sampling is used to balance the dataset. After hybrid sampling, the ratio between normal and abnormal samples in the dataset was balanced at 1:1. The distribution of normal and abnormal samples after hybrid sampling is shown in Figure 7.

Assessment Indicators
There are generally four evaluation metrics for models, which are accuracy, precision, recall, and F1-score. Their formulas are shown below:  (13) TP represents the number of true positives, FP represents the number of false positives, TN represents the number of true negatives, and FN represents the number of false negatives. However, because precision and recall often conflict with each other, accuracy and F1-score are used as the main evaluation criteria in this study. The larger the values of accuracy and F1-score, the better the performance of the model. Additionally, this study added model training time as a metric to evaluate the speed of model training.

Experimental Results and Discussion
To fully validate the effectiveness of the proposed model, several experiments were

Assessment Indicators
There are generally four evaluation metrics for models, which are accuracy, precision, recall, and F1-score. Their formulas are shown below: TP represents the number of true positives, FP represents the number of false positives, TN represents the number of true negatives, and FN represents the number of false negatives. However, because precision and recall often conflict with each other, accuracy and F1-score are used as the main evaluation criteria in this study. The larger the values of accuracy and F1-score, the better the performance of the model. Additionally, this study added model training time as a metric to evaluate the speed of model training.

Experimental Results and Discussion
To fully validate the effectiveness of the proposed model, several experiments were designed in this study. In Section 4.3.1, different dimensionality reduction methods were compared to verify the superiority of SAE. Section 4.3.2 conducted experiments with different dimensionality reduction levels to analyze the impact of SAE on model accuracy. Section 4.3.3 tested different sampling methods to compare the hybrid sampling method proposed in this study with other sampling methods. In Section 4.3.4, performance analysis and comparative experiments were conducted to evaluate the intrusion detection capability of the proposed model in binary and multi-class classification, and it was compared with existing models. Additionally, to test the robustness of the model, further performance testing was conducted using the UNSW-NB15 dataset. Section 4.3.5 conducted three ablation experiments to accurately evaluate the effectiveness of each module in the proposed model.

Comparison Experiments of Different Dimensionality Reduction Methods
To validate the effectiveness and applicability of the feature dimensionality reduction method employed in this paper, we conducted comparative experiments under the same experimental conditions with different dimensionality reduction methods: we compared the dimensionality reduction method used in this paper (SAE) with existing dimensionality reduction methods such as PCA [20] and AE [19]. Finally, we used the proposed overall model for classification and selected the highest accuracy and F1-score for each dimensionality reduction method for comparison. The comparative results are shown in Table 1. According to Table 1, the accuracy of the proposed model using SAE for dimensionality reduction achieved 88.7% in binary classification with an F1-score of 88.2%, which is a 3.6% and 0.9% improvement over using PCA and AE, respectively. In multi-class classification, the accuracy reached 84.1% with an F1-score of 83.8%, which is a 2.1% and 0.6% improvement over using PCA and AE, respectively. Analysis of the reasons for this shows that PCA relies more on variance when reducing data, and non-principal components with low variance may contain important information on sample differences. This leads to a reduction in the amount of information contained in the features during the dimensionality reduction process. AE trains a single-layer encoder directly during training, and excessive reduction can result in a loss of information in the reduced data features. However, SAE uses greedy layer-wise training and initializes the parameters for each layer, ensuring control over the reduced features. The layer-wise dimensionality reduction ensures that the amount of information contained in the data features remains unchanged, allowing the model to obtain the most amount of information and achieve the highest accuracy and F1-score in classification.

Experiments on Different Dimensionality Reduction Levels
After numerical and standardization preprocessing, the dimension of the NSL-KDD dataset was reduced to 122. To prepare the NSL-KDD dataset for training the Transformer encoder classifier, a stacked auto-encoder was employed to further reduce the dataset's dimension after numerical and standardization preprocessing. The encoding layers of the SAE were used for dimensionality reduction, and various feature subsets consisting of different numbers of features were input into the overall model for binary and multi-class classification. The impact of varying degrees of dimensionality reduction on model accuracy was evaluated, and the results are shown in Figure 8. From Figure 8, it is observed that the accuracy increased rapidly in the beginning as the number of selected features increased, and it eventually stabilized. The highest accuracy was achieved when the number of features was approximately 35. Analysis of the reasons for this shows that when the feature dimension is reduced to a minimum, the amount of information contained in the features is not sufficient for the model to learn and train effectively, resulting in the lowest accuracy. As the feature dimension increases, the amount of information contained in the features also increases, leading to an improvement in model accuracy. When the feature dimension reaches 35, the accuracy tends to balance.

Comparison Experiments of Different Sampling Methods
To address the issues of imbalanced dataset and class overlapping, a hybrid sampling method (KNN-based under-sampling and Borderline-SMOTE over-sampling techniques) was used in this paper to handle the dataset. To verify the effectiveness of the proposed method, two comparative experiments were set up in this section using different sampling methods: 1. Under the same experimental conditions, five different single sampling methods were used to handle imbalanced datasets: random over-sampling, SMOTE, Borderline-SMOTE, random under-sampling, and KNN-based under-sampling. Finally, binary and multi-class experiments were conducted using the proposed model, and the results are shown in Table 2 to verify the effectiveness of the proposed method. 2. Under the same experimental conditions, the above five sampling methods were randomly combined using a mixed sampling approach to handle imbalanced datasets. Finally, binary and multi-class classification experiments were performed using the proposed model, and the results are shown in Table 3.  From Figure 8, it is observed that the accuracy increased rapidly in the beginning as the number of selected features increased, and it eventually stabilized. The highest accuracy was achieved when the number of features was approximately 35. Analysis of the reasons for this shows that when the feature dimension is reduced to a minimum, the amount of information contained in the features is not sufficient for the model to learn and train effectively, resulting in the lowest accuracy. As the feature dimension increases, the amount of information contained in the features also increases, leading to an improvement in model accuracy. When the feature dimension reaches 35, the accuracy tends to balance.

Comparison Experiments of Different Sampling Methods
To address the issues of imbalanced dataset and class overlapping, a hybrid sampling method (KNN-based under-sampling and Borderline-SMOTE over-sampling techniques) was used in this paper to handle the dataset. To verify the effectiveness of the proposed method, two comparative experiments were set up in this section using different sampling methods: 1.
Under the same experimental conditions, five different single sampling methods were used to handle imbalanced datasets: random over-sampling, SMOTE, Borderline-SMOTE, random under-sampling, and KNN-based under-sampling. Finally, binary and multi-class experiments were conducted using the proposed model, and the results are shown in Table 2 to verify the effectiveness of the proposed method.

2.
Under the same experimental conditions, the above five sampling methods were randomly combined using a mixed sampling approach to handle imbalanced datasets. Finally, binary and multi-class classification experiments were performed using the proposed model, and the results are shown in Table 3.  According to Table 2, Borderline-SMOTE achieved an accuracy of 84.5% and an F1-score of 84.1% in binary classification. In comparison to random over-sampling and SMOTE, Borderline-SMOTE improved the accuracy by 2.6% and 1.2%, respectively. In multi-class classification, Borderline-SMOTE achieved an accuracy of 82.4% and an F1-score of 81.8%. Borderline-SMOTE improved the accuracy by 1.6% and 0.8% in comparison to random over-sampling and SMOTE, respectively. In the case of under-sampling methods, KNNbased under-sampling achieved an accuracy of 83% and an F1-score of 82.5% in binary classification. KNN-based under-sampling improved the accuracy by 2.6% compared to random under-sampling. In multi-class classification, KNN-based under-sampling achieved an accuracy of 81.9% and an F1-score of 81.4%, with an accuracy improvement of 1.2% in comparison to random under-sampling. Analysis of the reasons for this shows that in over-sampling methods, random over-sampling and SMOTE randomly sample the data points, while Borderline-SMOTE over-samples only the Danger class samples. This approach balances the dataset while also alleviating the class overlap problem, thereby enhancing the separability of the data and improving the model's performance. Similarly, in under-sampling methods, KNN-based under-sampling selectively removes overlapping samples that are prone to cause misclassification by the model. This approach balances the dataset while also alleviating the class overlap problem, thereby enhancing the separability of the data and improving the model's performance.
According to Table 3, the proposed hybrid sampling algorithm (KNN-based undersampling and Borderline-SMOTE over-sampling techniques) outperformed other combinations of hybrid sampling algorithms with the highest accuracy and F1-score on both binary and multi-class classification tasks. Analysis indicates that the combination of these two techniques not only balances the dataset in terms of quantity but also maximally mitigates class overlap by complementing each other's effects on overlapping samples, improving the separability of the dataset and thus enhancing the model's detection ability.

Performance Analysis and Comparison Experiments
To validate the detection capability of the model, this study first selected recently proposed intrusion detection models, including CNN [32], CNN-LSTM [32], CBA-CLSVE [33], SSC-OCSVM [22], CNN-GRU [34], and FCNN-SE [35], which have shown good performance on the NSL-KDD dataset. Additionally, several intrusion detection models using a Transformer as the base model were selected for comparison, namely RTIDS [26], VIT [27], and CNN-Transformer [28]. To ensure the validity of the experiments, the NSL-KDD dataset was not re-partitioned into training and testing sets under the same experimental conditions, and the official training and testing sets specified by NSL-KDD were used for binary and multiclass comparison experiments. The binary classification results, compared with other models, are presented in Table 4. In Table 4, the results of our proposed model demonstrate its superiority over other state-of-the-art models in terms of both accuracy and F1-score, achieving an impressive accuracy rate of 88.7% and an F1-score of 88.2%. These results indicate the effectiveness of our proposed model in accurately predicting the target variable and suggest its potential for practical applications. Furthermore, the proposed model also exhibited faster training speed than the other models, indicating its efficiency and scalability. Table 5 presents the results of the multi-classification experiment and compares them with other models. In Table 5, our experimental results indicate that our proposed model outperforms other models in the multi-classification task, achieving an accuracy of 84.1% and an F1score of 83.8%. This performance is superior to other models by 5.3%, 4.4%, 0.9%, 2.6%, 4.1%, 1.1%, 0.6%, 1.3%, and 1% in accuracy, respectively. Furthermore, our proposed model exhibits faster training speed than other models. These results suggest that our proposed model offers a significant improvement over existing models for the multiclassification task.
To further demonstrate the robustness of our model, we conducted additional tests on the UNSW-NB15 dataset and compared it with the models mentioned earlier under the same experimental conditions (by splitting the UNSW-NB15 dataset into training and testing sets with a ratio of 7:3). The results of our multi-class classification experiment on the testing set are shown in Table 6.
The experimental results show that our model has a better detection capability on the UNSW-NB15 dataset compared to other models, indicating the robustness of our proposed model. In the multi-classification task, our model achieves an accuracy of 87.5% and an F1-score of 87.3%, outperforming other models by 4.6%, 4.9%, 0.8%, 2.6%, 3.2%, 1.1%, 0.2%, 1.9%, and 0.5% in accuracy, respectively. Additionally, the training speed of our model is faster. The three experiments conducted above demonstrate that the proposed model outperforms existing intrusion detection models and has a faster model training speed, as well as a certain level of robustness. Moreover, compared to recently proposed Transformer-based models, the proposed model in this paper has higher accuracy and a faster model training speed. These results confirm the effectiveness and efficiency of the proposed model in detecting network intrusions, indicating its potential application in real-world scenarios.

Ablation Experiment
In order to verify the effect of each module in the proposed model on the overall performance, we conducted three ablation experiments on the model:

1.
We conducted ablation experiments to assess the impact of the proposed improved position encoding on the classification ability of the Transformer model. The results are presented in Table 7.

2.
We conducted ablation studies on the binary classification model proposed in this paper, including the data processing strategy and the first stage of learning, in order to verify the effectiveness of each module in the model. The results of the experimental analysis are listed in Table 8.

3.
We conducted ablation experiments on the multi-classification model. There are three parts of the model that have an impact on the classification effect, namely improved positional encoding, hybrid sampling, and a two-stage learning strategy. The effect of each module was verified by comparing the accuracy and F1-score before and after the addition of the module. The results of the experimental analysis are shown in Table 9.  The results presented in Table 7 demonstrate a significant improvement in the accuracy and F1-score of the Transformer model after the proposed positional encoding method was implemented, as compared to its performance before the improvement. This improvement was observed in both binary and multi-classification tasks, indicating the effectiveness of the proposed method across different types of classification problems. These results highlight the importance of properly encoding positional information in the Transformer model and suggest that the proposed method can be a valuable addition to the existing approaches for enhancing the performance of Transformer-based models. Table 8 presents the results of our experiments examining the impact of each module on the performance of the binary classification model. As can be seen from the results, the removal of any module leads to a significant decrease in the accuracy and F1-score of the model, highlighting the critical role that each module plays in achieving high performance. Moreover, the synergy between all modules is observed when all modules are present, resulting in the highest accuracy and F1-score. These findings underscore the importance of a holistic design approach for the binary classification model, where each component is meticulously designed and optimized to achieve optimal performance. In summary, the results of our experiments indicate that the performance of the binary classification model is highly dependent on the effective integration and cooperation of all its constituent modules.
In Table 9, it is evident that the absence of any module in the total multi-classification model leads to a significant decrease in both accuracy and F1-score. However, the model exhibits the highest levels of accuracy and F1-score when all modules are present. These results emphasize the critical role of each module in the overall performance of the multiclassification model and highlight the importance of a comprehensive design approach that optimizes the interaction of all modules to achieve optimal performance.

Discussion
To address the three issues of slow model training, imbalanced datasets, and poor multi-class detection performance in intrusion detection models, this paper proposes an improved Transformer model. Through multiple experiments, we have demonstrated the effectiveness of the proposed model in addressing these issues, not only improving the model's training speed but also enhancing its classification detection ability. Despite the notable improvements in detection accuracy attained by the proposed model, there remain certain limitations that warrant attention. Specifically, the multi-class detection capability of the model, although enhanced, still requires further development. Future research endeavors may include the exploration of alternative strategies to enhance the binary classification performance of the proposed model. This may involve investigating diverse model architectures, such as incorporating attention mechanisms or exploring novel loss functions that are better equipped to capture the unique characteristics of the dataset. Additionally, efforts could be made to improve the interpretability of the model by analyzing the attention weights of the Transformer model and identifying significant features for intrusion detection. These approaches may culminate in further improvements in the overall multi-classification detection performance and can have far-reaching implications beyond the scope of intrusion detection. Overall, this study contributes to the knowledge system by proposing a new approach and demonstrating its effectiveness in