A Hybrid parallel deep learning model for efficient intrusion detection based on metric learning

With the rapid development of network technology, a variety of new malicious attacks appear while attack methods are constantly updated. As the attackers exploit the vulnerabilities of popular third-party components to invade target websites, further improving the classification accuracy of malicious network traffic is the key to improving the performance of abnormal traffic detection. Existing intrusion detection systems may suffer from incomplete feature extraction and low classification accuracy. Thus, this paper proposes an efficient hybrid parallel deep learning model (HPM) for intrusion detection based on margin learning. First, HPM constructs two parallel CNN architectures and fuses the spatial features obtained through full convolution. Secondly, the temporal information of the fused features is parsed separately using two parallel LSTMs. Finally, the extracted spatial-temporal features are fed into the CosMargin classifier for classification detection after global convolution and global pooling. Besides, this paper proposes an improved traffic feature extraction method, which not only reduces redundant features but also speeds up the convergence speed of the network. In the experiment, our HPM has achieved 99% detection accuracy of each malicious class, ranging from 5%–10% improvement with other models, which demonstrates the superiority of our proposed model.


Introduction
With the rapid development of technology and widespread use of the Internet, the network penetrates every corner of human society, and network security has attracted more and more attention. In the Visual Networking Index (VNI) published by Cisco (2020), by 2022, Internet-connected devices are expected to reach a magnitude of 28.5 billion, so the cost of network maintenance will increase significantly. In terms of industrial control system security, the operation and production of industrial machine tools in countries around the world are closely related to the network. Once these systems are attacked, a large amount of information about the cyber assets of critical information and networked control systems will be leaked, which will undoubtedly bring about hidden dangers to the security of the national industrial sector .
To secure the user data, a large number of schemes are applied in various industries. Zhang et al. (2020) proposed a hierarchically structured intrusion detection model in wireless sensor networks, which uses a kernel extremum learning machine classification algorithm that follows the Mercer property to synthesise multiple kernel functions. Han et al. (2020) implemented a new attribute encryption scheme combining revocation, white-box tracing. Li et al. (2022) proposed a blockchain-based asymmetric encryption mechanism to achieve conditional anonymity and data privacy. Xiao et al. (2020) proposed a multi-keyword searchable encryption scheme to achieve efficient lookup in encrypted data. However, the informed schemes are only suitable to a single situation, where the reality of the network exists various cyber malicious attacks, thus the application of intrusion detection can effectively defend against kinds o malicious attacks.
So far, to efficiently protect the operation of network systems, machine learning (ML) and deep learning (DL) have been applied to network intrusion detection systems (NIDS) by many researchers. However, several blemish assumptions are inherent in traditional ML\DL algorithms of NIDS. One of the main flawed assumptions is they are both highly dependent on functional design along with a high false alarm rate, which has poor performance in practical applications (Rudd et al., 2016). Researchers have made numerous efforts to improve the identification and classification performance of malicious traffic (Jiang et al., 2020;Li et al., 2019;Li, Han, Yin et al., 2021;Wei & Wang, 2021;Xiao et al., 2019), whereas they have concentrated too much on the comprehensive accuracy of malicious network flow classification and neglected the category accuracy of imbalanced samples.
Despite constant iterations by many researchers, imbalanced data feature learning is still a long-standing problem in the research direction of machine learning and deep learning, where the excessive deviation of data may guide models to detect high data volumes easily (Uhm & Pak, 2021). For example, DDoS and PortScan attacks involve huge amounts of packets, while the APT (Zhao et al., 2015) attack is a comprehensive attack that integrates multiple common attack modes and is highly secretive. Thus, the two types of attack samples differ greatly in collection difficulty and dataset size. Generally, researchers extract specific features from network traffic, with the aiming of facilitating the discrimination of benign and malicious traffic data. While there is still no universal standard for effective feature selection by now. For example M.G (Anderson & McGrew, 2016, october) and Lopez-Martin et al. (2017) et al. extracted hundreds-dimensional and six-dimensional features respectively for malicious traffic discrimination, while the former achieved an accuracy of 95.7% and the latter is 96%. However, in the field of detecting the imbalanced malicious traffic, manually designed features have difficulty in carrying out a better detection effect rate for classes of low data volume.
A. Reaseach contribution Inspired by the issues mentioned above, it is necessary to design an efficient scheme combining manually features a design based on metric learning in intrusion detection. An improved hybrid parallel deep learning model (HPM) based on metric learning is proposed to address these issues.
The main contributions of this paper can be summarised as follows: (1) We propose a traffic original feature extraction algorithm to construct unique datasets and perform training corresponding to network intrusion detection based on real network traffic. Using this method, we constructed two datasets incorporating the data public network traffic data sources and conducted two types of experiments on these datasets. (2) We propose a hybrid parallel deep learning model(HPM) for efficient intrusion detection based on metric learning. A new classifier, named CosMargin, is introduced to replace the traditional SoftMax classifier (Liu & Zhang, 2020) in HPM which utilises original features of network traffic to enhance the classification performances of imbalanced data. This classifier can use subtractive angular margin loss to learn the feature of original traffic and classify anomaly traffic in cosine space.
(3) For HPM, we put forth two improved versions and perform an experimental comparison to verify the reliability of our proposed model. We also empirically demonstrate the affection of performance from traffic length and further analyse the better parameter setting behind our method.

B. Organisation
The remainder of this paper is organised as follows. A brief review of relative works is presented in Section 2. In Section 3, we describe the dataset, data preprocessing algorithm along the proposed intrusion detection system in detail. In Section 4, we first introduce the environment of the experiment and the parameters and then analyse the results from the traditional models and the model embedded with metric learning. Finally, we conclude and outline the future directions in Section 5.

Related work
Anomaly detection of network traffic is an important tool to maintain cybersecurity in the age of big data. By now, a large number of researchers have applied machine learning to traffic detection. Network intrusion detection system (NIDS) was first proposed by Co (1980) in 1900 to achieve high-efficiency defense against various types of network attacks. He et al. (2020) proposed a hash chain-based OTP algorithm that avoids the time synchronisation problem. (n.d.) proposed a backpropagation neural network and genetic algorithm-based network intrusion detection system that has wide adaptability to real environmental data. Li, Tian et al. (2021) proposed a blockchain-based generic shared auditing mechanism that enables data users to share their auditing procedures with others. Zhang et al. (2021) put forward a novel blockchain-based privacy-preserving framework for online social networks, which combines blockchain with public-key cryptography techniques to achieve the function of data protection. To further increase the security in access control and authentication, Xiao et al. (2021) proposed a scheme with PUF and compact PUF identity authentication models to implement secure mutual authentication between tag and server. Pan et al. (2015) and Mohamad Tahir et al. (2015) evaluated the performances of combining K-means and SVM in detecting attack flow. Although these two researchers used the same approach, but the method is used differently. On the one hand, Pan composes a comprehensive intrusion detection system that monitors user activity in real-time on multiple target systems connected through Ethernet. Cui et al. (2020) proposed a novel shared data auditing scheme that combines the existing certificateless signing and encryption method with the FOG architecture, the proposed method reduces a large number of timeconsuming operations. Fuqun (2015) proposed a Least Squares Support Vector Machine (LSSVM) for network intrusion detection. Compared with the traditional LSSVM method, the proposed LSSVM can find better individual parameters, but the improved algorithm has some defects, such as poor sparsity, easily conditional kernel function, and so on. Malialis et al. (2015) put forward an approach that incorporates task decomposition, team rewards, and a form of reward shaping called difference rewards that provides a decentralised coordinated response to the DDoS problem. He et al. (2021) proposed a lookup algorithm for NDN interest forwarding, which selects feature prefixes instead of lengths to filter out interest packets in NDN interest forwarding. Ning et al. (2021) proposed a novel network coding approach, which can rapidly identify and isolate the malicious nodes from the networks as early as possible. Xie et al. (2021) develops an Inference-Based Adaptive Attack Tolerance (IBAAT) system that formulates the cyber-attack and defense as a dynamic partially observable Markov process based on dynamic Bayesian inference. Chadza et al. (2020) used Hidden Markov for feature learning in network intrusion detection and analysed the detection and prediction accuracy of a series of training and initialisation algorithms to detect and predict accuracy. While Chadza stared at the multistate transfer of Markov chains without considering the association of other packet states, it is still difficult to obtain the optimal model parameters.
Traditional machine learning has excellent achievement in the area of network anomaly behaviour detection, whereas with the sustained extension of the dimension of traffic feature as well as noisy data, traditional machine learning-based detection methods are confronted with the problems such as low accuracy of feature extraction and poor robustness, which degrade the traffic attack detection performance to some extent.
Deep learning (Liu, Wang, Liu et al., 2017;Zhang et al., 2018) is a branch of the machine learning algorithm. It trains a target feature model by continuously learning and converging, and converting sequences of a linear or non-linear model into specific data structures. Tian et al. (2020) proposed an intrusion detection method based on an improved deep belief network. By introducing the combined sparse penalty term into the likelihood of the forward unsupervised pre-training stage of the model, the sparse state of the hidden layer neurons is induced to avoid the problem of feature homogenisation and network overfitting, and the detection performance is improved. However, the parameter selection of the model is uncertain, resulting in a certain degree of influence on the detection accuracy.  proposed a novel network intrusion detection model utilising convolutional neural networks which use CNN to select traffic features from raw data set automatically. However, this scheme uses a single CNN model, which is unable to obtain the deep spatial features comprehensively and effectively of the traffic dataset, and the experiments indicate that the FAR and DR of the paper are higher than the traditional machine learning algorithms. Roy and Cheung (2018) presented a novel deep learning technique for detecting attacks within the IoT network using Bi-directional Long Short-Term Memory Recurrent Neural Network. While the bi-directional LSTM used by roy has achieved 95% accuracy for traffic classification, it neglects the spatial characteristics of the traffic. Shone et al. (2018) proposed nonsymmetric deep autoencoder (NDAE) for unsupervised feature learning. Different from typical auto-encoder, approaches provide non-symmetric data dimensionality reduction and combine both deep and shallow learning techniques to analytical overheads. Although the unsupervised learning approach is beneficial to reduce the overhead of data labelled, a large overhead of data computing will be causing the construction of deep deep self-encoders by stacking encoder. Ni et al. (2017) and Alrawashdeh and Purdy (2016, december) proposed an anomaly detection method based on a deep Boltzmann machine, which can improve the detection rate of traffic attacks by extracting their essential features through the learning of highdimensional traffic data. However, the proposed method has poor feature extractability, and its detection system is easily affected by noise. Yin et al. (2017) constructed recurrent neural network (RNN)-IDS that exceed the performance of machine learning in the field of dual and multi-classification. Bi et al. (2020) proposed a design scheme to calculate the K maximum probability attack paths for a given set of target nodes, which effectively reduces the computational cost. Zhang et al. (2016, december) parsed binary executable files into opcode sequences and transform them into images, which used CNN models to determine the security of binary executable files. Liu et al. (2019) presented an adaptive mobile traffic classification method that relies on the data distribution instead of the classification error rate to classify the network flow. Zeng et al. (2019) proposed an anomaly detection method based on CNN, LSTM, and stacked autoencoder (SAE), which can extract traffic features with high accuracy by learning the traffic data layer by layer, but the robustness of its extracted features is poor, and the detection accuracy will decreases when the measured data is corrupted. Zhang et al. (2019) proposed a parallel cross convolutional neural network, which used feature fusion to improve the detection result of highly imbalanced abnormal data.
Currently, many classification algorithms rely on the distance metric of the data. To deal with the features of different data sets, it is indispensable to design different measurement functions while plenty of time and energy will be consumed. The prime contribution of metric learning is to perform similarity measures for different data (e.g.different images) without changing the original data distance and preserving the distance structure.
In industrial Ethernet, real-time, scalability, and security are important indicators to ensure that a system can operate stably, allowing industrial control systems to detect network attacks or attempts to attack as soon as possible, pre-empt further attack activities and keep the damage to a minimum. Our proposed HPM model achieves more than 99.8% in both datasets tested, meeting the security requirements in the industry. In terms of realtime, HPM takes only 6s to run around, while with the ability to handle thousands of traffic data once time with the metric learning of cost margin function. In terms of scalability, HPM has no capacity to detect novel attacks, which is the next step for our future direction.

Methodology
In this section, we critically introduce the dataset, data preprocessing algorithm, concept of metric learning in Section 3.1, 3.2 and present the proposed intrusion detection system in Section 3.3.

Data preprocessing
Deep learning has many open-source datasets in the field of intrusion detection, but most of them have the following disadvantages. (1) These datasets lack category diversity and sample completeness, while a few of them are quite old, such as the KDD99 dataset, cannot cover the current network attacks. (2) Many of them only contain header information but lack of payload information, which cannot reflect the attack trends well. Therefore, we choose the ICSX 2012 (Shiravi et al., 2012) and CIC-IDS 2017 (Sharafaldin et al., 2018, january) that include complete traffic information for our experiments.
Conventional IDPSs can be applied to extract a variety of features from these headers. For example, CICFlowMeter can extract more than 80 network traffic features. There are two major drawbacks associated with the methods implying features extraction from packet headers based on fixed rules. The one deficiency is to use only the information stored in headers and ignore the payload. The other drawback is that if a single packet is considered as a detection object, the correlation between packets is ignored. Therefore, we have proposed a novel dataset construction method for ISCX 2012 and CICIDS 2017 and consider the data flow as a detection object. To reduce the computational complexity and compact the data distribution, the following three rules are applied to sample the data flows: A. Flow Splitting Based on the quin-tuple information, the label provided by the dataset was compared with the original PCAP file, and all malicious traffic was extracted separately and stored in the CSV file in which the file name was named by the date.
Unlike the method of Abdulhammed et al. (2019), our algorithm splits the continuous pcap traffic into multiple discrete traffic units and extracts the information from the discrete units that refer to each flow's information. As for data flow, the first M packets in a data flow are used to represent it as a whole, as the length of a data flow is not fixed. And the first N bytes of a packet are used to represent it as a whole, as the length of a packet is not fixed. Thus, we first determine whether the packet belongs to malicious traffic or not, if so, store its quin-tuple, original data, and label into the corresponding position, otherwise, it is not processed. Storing all the malicious traffic into the csv file while the completion of the data traversal.
For the extracted abnormal traffic, we divide the abnormal traffic into data flows refer to the five-tuple information. Only M packets are extracted for each data flow. For data flow that exceeds M packets, we regard the additional packets as a new data flow. Considering quin-tuple does not take timestamp information as a flow feature, and the packets in the flow span a large period and contain numerous valid information, thus it is not reasonable to ignore the timestamp information. In our research, we usually set M as 5 and N as 96.

B. Traffic Clean
The purpose of this step is to erase interfering information in network traffic packets. We optimise the data to remove duplicate and empty files, as they help nothing rather than interfere with classification performance.

C. Traffic Tailor
If the length of the cropped packets is less than N bits, we use the number 0 to fill up to N bits. If the length of the cropped packets is greater than N bits, we curtail the extra part to keep only N bits. If the number of packets in the network flow is less than 5, we fill the data of the former to 480 bytes at the end of the flow, rather than introducing 0 elements, which will make the data distribution more compact and reduce redundancy.
After the above steps, a data flow is sampled according to the above rules so that its main time and spatial structures are kept unchanged. The original Pcap file is compressed from 60GB into an acceptable data size. The step of data preprocessing is demonstrated in Figure 1.

Metric learning
Softmax is currently the most widely used classification loss function in deep learning. Its formula can be expressed as: Where x i denotes the deep feature of the i-th sample of the y i class, m denotes the batch size, n represents the number of classes in the dataset, w and b i denotes weight and bias respectively, w i denotes the weight of the jth sample in current batch. e W T j x i +b j denotes the output of the full connected layer. With the decreases of the loss L s the backpropagation updates parameters and simultaneously increases the proportion of e W T y i x i +b y i in the model, so that much more samples of class fall within the corresponding decision boundary.
However, the Softmax function mainly considers whether the samples can be correctly classified, while lacks of the constraints between intraclass and interclass distances, i.e. Softmax encourages the feature separation of different classes, but not fine-grained feature segmentation, which causes the model with poor distinguishability in the diversity within interclass and higher similarities among intraclass samples. In the CVPRs of 2017, 2018, and 2019 (Deng et al., 2019, june;Liu, Wen, Yu et al., 2017, june;Wang et al., 2018, june) respectively proposed Sphereface, Coface, and Arcface methods to maximise classification boundary in cosine space, angular space and adjust classification margin to prevent overfitting.
To remove the influence caused by the radial difference of features, Softmax needs to be converted into the formula expression of Margin Loss. First, We set bias b j = 0 and use cosine theorem to express the inner product of weights and inputs as W T j x i = W j x i cos θ j , where θ is the angle among weights W j and inputs x i . According to Wang H, W j is fixed by l 2 normalisation so that the individual weight W j = 1 and the embedding feature x i is also regularised by l 2 normalisation so that ||x i || = 1. Considering the norm is too small for the training loss i.e. the value of softmax is relatively small, so we re-scale x i to s.
In the end, we introduced a cosine margin constraint, so that the category to which the current sample belongs still belongs to this category after subtractive m, namely, cos(θ 1 ) − m > cos(θ 2 ). Finally, we can get the cosine margin function used in this paper: (2) Figure 2 illustrates the classification effects between 10 categories of network traffic sets under Softmax and CosMargin loss function. In hypersphere, when only using softmax for feature classification, although each class can be distinguished, the space between classes is too small, some features fall on other classes, and the classification effect is not obvious. With CosMargin, it can be observed that features are distinguished by angle and the angular distance between features can be better controlled, i.e. high cohesion and low coupling.

System overview
Abnormal traffic is different from benign traffic. It is generally manifested as the length of the data packet, the number of data packets, the TCP synchronisation, and other parameters having explosive growth in a short period. For example, for CIC-IDS 2017 dataset, the feature of Bwd Packet Length Sth has a large difference in Distributed Denial of Service (DDoS) attacks from the rest of the categories of traffic. In DDoS, this characteristic value has a relatively large value, while the characteristic value of benign traffic and other abnormal traffic is mostly zero. Therefore, for different categories of abnormal traffic, the feature differs from one another. A novel deep learning model for traffic classification is designed, namely HPM. HPM utilises the hybrid parallel structure as a training module instead of a single model and double-LSTM as a temporal feature extraction module. such a novel combination can filter the beneficial local and global features from processed data flow and estimate the future changing behaviour and occurrence probability of the data flow from a set of time series. A discriminative Classification function CosMargin is introduced, which indirectly implements a margin boundary on the feature layer by modifying the Softmax formula and discriminating the benign/malicious traffic accurately. Through continuous feature learning in HPM, the benign traffic can be accurately filtrated by the model to continuously learn the features of the traffic.
Our model is mainly composed of a spatial with time-series feature prediction module and an attack classification module, as shown in Figure 3. Among them, the spatial with time-series feature prediction module merges features extracted in different dimensions, in which the top branch uses a fully convolutional neural network (FCN) and the bottom branch uses a traditional convolutional neural network. Besides, the attack classification module is composed of a CosMargin function. The introduction of the CosMargin function aims to map the features in the Euclidean space into the feature space of the hypersphere, improve the feature discrimination ability of the model and meet the NIDS's prediction requirements for network security.
The proposed model is composed of three steps: Firstly, data preprocessing is performed on the input raw traffic file, and the fixed-length network traffic feature is extracted. Subsequently, the data flow features are passed into the spatial with the time-series feature prediction module. After processing by the top and bottom CNN layers, the features of different levels are extracted for cross-layer aggregation, and the results of the upper and lower layers of CNN are combined to reconstruct a feature vector. Among them, the upper CNN uses full convolution and does not use pooling operation after convolution. Finally, the fusion features are inputted into the double layer LSTM to predict the change of network flow in the next period. The specific modules are introduced later.
To verify the reliability of our proposed model, we propose two improved versions and conducted experimental comparisons. The first improved version is Hybrid Parallel Deep Learning Model with a single LSTM (HPMSL). The features are extracted and fused by the FCN of the top and CNN of the bottom. After the features fusion, the reconstructed feature vector is fed to the single-layer LSTM for timing characteristics learning, not for the hybrid LSTM. The second improved version is Hybrid Parallel Deep Learning Model that utilises element sum (HPMEM). The network architecture of HPMEM is consistent with HPM, but in each feature mixing, the corresponding elements are added instead of channel cascading. Element addition operation will not double the number of channels, which can effectively reduce the number of model parameters. Besides, since the number of channels output by the bottom CNN is half of that of the top FPN channels, the final number of channels needs to be unified. For the missing part of the bottom channel, we fill with zeros to be the same dimensions as the top branch. Figure 4 shows the structure of the two improved versions.

Semantic information extraction
Fully convolutional networks (FCN) were first proposed by Long et al. (2015, june) at CVPR 2015 as the first network architecture to use CNN structures in the field of image segmentation, which can achieve pixel-level image classification for images containing complex scenes and solve the semantic-level image segmentation problem. There is information overlap in packets with adjacent timestamps, and the overlapping region affects the classification effect of malicious traffic, so rich geometric information needs to be extracted in the shallow network as a way to improve the classification accuracy.
In HPM, to better obtain the semantic and geometric information of features from the traffic of adjacent timestamps, the top branch abandons pooling layers, while using convolutional layers instead of down-sampling. And the input is convolved four times by conv1, conv2, conv3, and conv4, then the convolutional layer is used to extract features from the data. Aiming to accelerate the convergence speed of the model and reduce its learning cycle, compared with other activation functions such as Sigmoid, the use of ReLU will significantly reduce the amount of computation in the learning process, so the output of each convolution will pass through the ReLU function to reduce inter-parameter dependencies and alleviate overfitting.
The formula for each element in the feature map after convolution is as follows.
Where i represents the position of the feature map element after convolution, j represents the sequence number of the convolution kernel of this layer, c represents the length of the convolution kernel, represents the length of the feature map, b is the bias, and σ () is the "ReLU" function. Compared with the traditional convolution along with the pooling operation, the four-step convolution operation of the top branch has three advantages.

A. Abundance
Due to the increase of the convolutional operation, the semantic information obtained from the same network traffic packet will be much more abundant.

B. Correlation
Since the traffic features are not spatially correlated, the local features of the traffic are well preserved while the abandoned of the pooling.

C. Robustness
Finally, the full convolutional network (FCN) can flexibly control the network parameters to reduce the network complexity and prevent the risk of overfitting.
Compared with traditional CNN-based methods, FCNs can greatly reduce the training parameters and shorten the training time. Figure 5 shows the process of the fully convolutional network in the top branch.

Redundant reduction
For imbalanced network traffic data, too much redundant information will affect the accuracy of model classification. Since the convolution kernel has a large number of overlapping receptive fields on the feature map while sliding convolution is performed, too many redundant features learning should be avoided to be fed into the deep learning network.
A large amount of redundant features results in the model incline towards instances with huge data volume during detection and reduces the detection metric of malicious traffic with low data volume. Pooling operation reduces redundant features by outputting a value as a feature in a small range of receptive fields that curtails the computation while suppressing noise. For each traffic packet in data flow has different information, malicious traffic with the same label contains a thimbleful of information. Therefore, it is necessary to consider the invariance of different packets in the same flow. For data flows with more packets, the input will be in multiple batches; For data flows with fewer packets, they are populated using the features of existing packets, making full use of the translation, rotation, and scaling invariants of the pooling layer.
Hence, in the bottom branch, we conduct a traditional convolutional neural network (CNN) paired with maximum pooling as down-sampling. Both the size of the feature map can be curtailed and the parameters for network training can be reduced. The formula for each element in the feature map after pooling layer is as follows: where i represents the i-th feature map, max(), avg() are the maximum and average values of the convolved feature maps, respectively.

Time-series feature extraction
After the abnormal traffic is subjected to multiple convolution, pooling, and feature fusion operations of the two branches, the features of the top branch contain the rich semantic and geometric features of the anomaly traffic, and the redundant features of the traffic packets are discarded in the lower branch. For some classical models such as CNN_LSTM that only splice the feature map obtained from CNN_LSTM in a single layer after convolution, without considering the impact of the convolution kernel randomness on the network traffic. We fuse the feature mappings obtained from the top and bottom branches and then feed the fusion features with the top and bottom branches into the parallel LSTM separately to reduce the compression of the timing information during the convolution.
Since the traffic data are changing at each moment and there is some connection between packets with adjacent timestamps (e.g. belonging to the same data flow or sent by the same user), HPM will fuse the temporal features obtained by the parallel LSTM (Hochreiter & Schmidhuber, 1997) at the tail of the training part to obtain the final result and estimate the future change behaviour.

Feature fusion
Feature Pyramid Network (FPN) (Lin et al., 2017, june) has been widely used in image recognition before being applied to deep learning, such as using feature pyramid model + AdaBoost  to extract different scale features for classification in the field of target recognition. Convolutional neural networks will obtain feature maps of different scales after convolution. As for the network deepens, the image resolution will keep decreasing, the receptive fields will keep expanding, the characterised features will become more and more abstract and the semantic information will become richer.
After stratification, an image can be divided into two levels: the lower layer with less feature semantic information but accurate target location, and the upper layer with rich feature semantic information but rough target location.
FPN combines top-level features with up-sampling and bottom-level features, which can effectively improve the detection rate of small targets. In this paper, the multi-scale feature fusion of FPN is applied within HPM, which contains four layers of feature fusing operations.
The channel cascade operation is a horizontal splicing process on a one-dimensional array that doubles the number of channels. It does not involve addition or multiplication between feature matrices or alteration in feature mapping, but only with impact minimisation of the fusion process on model training. Subsequently, the first fused feature map y t_concat 1 and y b_concat 1 are pass through a 3 * 3 sliding convolution window and output features performs channel concatenate again to obtain the fusion matrices y t_concat 2 ,y b_concat 2 . After the second channel concatenates operation, the down-sampled formula (3) is performed to reduce the size of the feature maps and obtain the vector matrix y t 4 , y b 4 . Being multiple convoluted and cascaded, the top branch network contains abundant feature semantic information and retains the local features of the traffic; the bottom branch network discards the redundant features and reduces the impact of useless features on the accuracy of abnormal traffic classification.
However, packets with the same timestamp are usually continuous and uninterrupted in practical anomaly attacks, so it is necessary to learn the timing features between packets and flows by LSTM. Performing the third cascade operation on the two feature vectors are fed into the two-layer LSTM to obtain the temporal features h t 1 ,h b 1 respectively. Finally, the vector containing the temporal and spatial features is subjected to a fifth channel cascade operation y concat , which further increases the nonlinearity of the network by fusing the features of each channel through a global convolution operation. After five times of feature fusion, a global average pooling layer is employed to diminish the size of the feature map and dwindle the redundancy of features, so that keeps the process of features learning from being deviating toward the anomalous flow category with a huge data volume.
The output of the network is passed through the fully connected layer and CosMargin layer to perform classification results for multi-category imbalanced malicious flow data. After each convolution, we use batch normalisation (BN) to speed up the convergence of the network model and alleviate the gradient dispersion in the deep network, thus making it easier and more stable to train the deep network model.

CosMargin classifier
The function of Softmax is converting features into probabilities after linear combination and then using cross-entropy to calculate the training loss. However, in the field of intrusion detection, there have some prerequisite assumptions and limitations of using Softmax: the essence of the training on different attacks is to find features with good generalisation performance, while the purpose of Softmax is to ensure that each class is to be classified correctly, which is not equivalent.
Therefore, to achieve a better generalisation performance, we introduce CosMargin, which indirectly implements a margin boundary on the feature layer by modifying the Softmax formula, so that the final feature obtained by the network is more discriminative.

Experiments and performance analysis
In this section, we first introduce the experimental environment and parameter setting in Section 4.1. For evaluation indicator, we adopt 5 common metrics in Section 4.2. To verify the effectiveness of HPM, we designed two sets of comparative experiments in Section 4.3 based on CIC-IDS 2017 and ISCX 2012, which used three classical models, CNN, CNN_LSTM, and Double_BiLSTM, as control groups. In addition, to verify the robustness and generality of HPM, we conducted two improved network models of HPM, named HPMSL and HPMEM, to validate the improvement of classification accuracy brought by the introduction of Cos-Margin function. As for the comparison we designed two groups of comparison reference models, HPM+CosMargin and HPM+SoftMax. Finally, we discuss the impact of the overall accuracy brought by the different packet lengths and packet capacity of the data flow. The experimental results show that when the data flow is taken as 5 packets and the packet interception length is taken as 90-100, the accuracy and running time of HPM outperform the traditional model while considering the running efficiency of the experiment.

Experimental environment and parameter settings
Python 3.8 is used as the programming language and Pytorch as the deep learning framework. The implementations of preprocessing and concatenation of data are based on NumPy. To evaluate the performance of HPM, we designed multiple sets of comparison experiments to compare the accuracy, recall, precision, FAR, Macro-f1, and Weighted-f1 of the model. All experiments are executed on two datasets, CIC-IDS 2017 and ISCX 2012. The details of the devices utilised in the experiments are shown in Table 1 and the distribution of all attack flows in our experiment is shown in Tables 2 and 3. Aiming to train the HPM, we use Adam optimiser to accelerate the convergence of the network, where the momentum factor is fixed to 0.9 and weight decay is set to prevent overfitting. The convergence speed varies from learning rate to learning rate. To have a better convergence speed, for the first 8 epochs, we set the learning rate to 0.001. For the next 3 epochs, we decay the learning rate to 0.0001. and for the last two epochs, we set it to 0.00001. Any augmentation is not used during the testing and training phases to verify the performance of the purposed model.

Indicators
To evaluate the experimental results, we adopt the following five common thresholdindependent metrics that HDM adopted: where N is the number of malicious attack type and w is the weight of each sample in the total data volume. TP is the number of correctly identified positive samples, TN is the number of correctly identified negative samples, FP is the number of incorrectly identified positive samples, and FN is the number of incorrectly identified negative samples. Recall reflects the ability of classification model to identify positive samples; Precision reflects the model's capacity to discriminate negative samples. Macro-f1 indicates the robustness of the classification model, and the value will be lower for models that perform well only on common classes and poorly on rare classes. Weighted-f1 multiplies the f1 value by the proportion of the class in the total sample to account for label imbalance; and FAR indicates the model's positive sample misidentification rate. During the comparison experiments, all of the detected accurate categories are considered as positive classes, and the categories detected with errors are considered as negative classes. Five threshold-independent metrics are used to evaluate the performance of the multi-classification algorithm, and the closer the values of the first four metrics up to 1 and the closer detection error up to 0, the better the classification performance is.

Analysis on different dataset
In this work, we use HPM as the baseline and the rest of the network models for performance comparison. First, we compare the performance of HPM with state-of-the-art intrusion detection models based on deep learning, including CNN_LSTM, Double_BiLSTM, ITSN (Li, Han, Yin et al., 2021), PCCN  and their improved versions. As for the classification function, HPM uses CosMargin, while the rest of the models use softmax, and the hyperparameters are kept consistent for all models to compare the discriminative ability of the models for different classes of features after using metric learning.
The comparison results on the CIC-IDS 2017 dataset are given in Tables 4 and 5. It can be seen in Table 4 that in the case of the individual category, the accuracies of HPM+SoftMax are higher than CNN_LSTM, Double_BiLSTM, ITSN, and PCNN. Among the five-evaluation metrics, only Macro-f1 is lower than the rest of the compared models. The recall, precision, Macro-f1, and Weighted-f1 are all above 0.9, and the FAR is only 0.0002, which shows that the proposed intrusion detection methods can obtain superior performances in identifying these instances from the traffic attacks.
When embedding metric learning to HPM, we found that all indicator metrics improve. It can be inferred that the embeddedness of metric learning has a positive impact on the model. In particular, within the classification of 12 malicious attack flow, the accuracy for botnet, DDoS, hulk, slowloris, heartbleed, and Portscan both reach 1.0, which means that HPM+CosMargin can detect all instances of the above-mentioned types of abnormal attacks without misclassification. Figure 6 represents the changing curve of loss and accuracy for all comparisons as the number of epochs increases. Figure 7 shows a visualisation of the embedded features learned by HPM+Softmax and HPM+CosMargin on the CIC-IDS 2017 dataset. On the left side are the features that only with softmax to predict loss, and on the right side are the features with CosMargin to predict loss. The dots in different colours represent different classes of attacks. It can be inferred that compared to the SoftMax function, the CosMargin function has enhanced the intraclass compactness and interclass discrepancy.
To verify the generality of the proposed metric learning model, we conduct the same test on the ISCX 2012 dataset. Table 6 shows that embedding metric learning is conducive to better classifying abnormal network traffic. Figure 8 represents the trend of loss and acc values for the classification of attack types for all models as the number of epochs increases. From Figure 8, we can see that the HPM+CosMargin model and its variants converge much faster than the rest of the comparison models, which indicates that using the hybrid structure as the main body of the model is effective and can fully exploit the ability of LSTM to control historical information in long sequences and the advantage of CNN to obtain the feature representation of information flow from local to global. When the number of epochs reached 6, the classification performance of the model tends to be stable, and the loss value and acc are reached to 0.0312 and 0.997 respectively while the epoch is 15.
After evaluating the classification effect, Table 7 shows the recall, precision, Macro-f1, and weighted-f1 values of HPM for each attack and overall attack classification under ISCX 2012 dataset. The accuracy of four categories of attacks, Infiltration, Http_DDoS, DDoS, and Brute Force SSH, is 99.98%, 99.83%, 99.95%, and 100%, respectively. Meanwhile, recall, precision, Macro-f1, and Weighted-f1 of each category are both above 90%. All of the experiments show that our HPM+CosMargin has a great performance on the two datasets, which utilise a hybrid network to extract optimal feature representations automatically with advantages in dealing with large-scale datasets. The model we proposed in this paper has better generalizability and cybersecurity event identification performance. Figure 9 shows the confusion matrix of CIC-IDS 2017 and ISCX 2012. The confusion matrix is a method to aggregate the prediction results of the classification model and to reveal the predicted and true label distribution between all categories. If all the data are accurately classified, the matrix will be formed as a diagonal shape. Hence, the more elements beyond the diagonal, the worse the results. Figure 9 displays the vast majority of the data are distributed nearby the diagonal both for two datasets, with little data are scattered in other    locations, which is almost negligible. The confusion matrix fully illustrates the effectiveness of our proposed hybrid model based on metric learning.

Analysis of the function of FCN
In this section, we have discussed the performance of FCN employment. A comparison Table 8 is plotted to show the error rate on the test set with only convolution and convolution+pooling.
Where Model A is a deformation of HPM+CosMargin, with the structure that modifies the full convolution of the top branch to convolution + pooling operation. We can easily gain from the table that Model A has a nearly equal accuracy in the train set, while with a bad performance in the test set. From the parameter of each model, CNN is lighter than the other model, that is due to the parallel structure of the top/bottom branch will double the   parameter of the input data. However, the total number of their parameters is both less than 0.5M for its HPM or deformation, while HPM has higher accuracy for the test set compared with Model A. This indicates that FCN has the advantage of reducing redundant features and avoiding model overfitting.

Packet interception length discussion
In this section, the packet is trimmed due to the fixed length, with a length of 96 bytes per packet and 5 packets per flow. According to the experimental results, our proposed model has good generality and robustness. However, the amount of information varies from packet to packet, and it is extremely hard to find the best parameter from a single experiment. Table 9 presents the packet distribution of the ISCX2012 dataset, and it can be deduced that the number of packets is quite different from each other. Among 7 days, there were a total of 68,910 abnormal data flow and 334,543 packets. The data flow contains a maximum of 42,800 packets and a minimum of only 1, with an average of 75 packets per flow. In the experiment, the range of packet numbers is 1 to 10, and the length per packet is 96. Table 10 represents the accuracy of ISCX2012 dataset traffic under different packet lengths and packet numbers where the packet length is within 20 to 180, and the packet number is within 1 to 6. There are two cases of accuracy errors, which have been marked in the table. The reason which the error points appear is the small number of which packets contained in the data flow and the poor parameter randomisation. Figure 10 shows the results of precision, recall, F1-score, and FAR of the ISCX 2012 under different packet numbers. When the network flow number of ISCX 2012 is 1 and the packet length 20, the result of the four-evaluation metrics are 0.8565, 0.7475, 0.7614, and 0.0281, which are much lower than the average values. Considering the reason is that only one  packet is intercepted into a single data flow, which leads to a great deal of data waste and insufficient feature learning, thus affecting the performance of model classification. It can be found that when packets are intercepted at 160 and 180 bytes, the results of precision, recall, and F1-score are all degraded to different degrees. This is because the length of the pruned packets is too large, while a large number of redundant features are introduced, which interfere with the classification effect of the abnormal traffic. In the experiments, when the number of packets of network traffic was 4 to 6 and the packet truncation length was 40 to 140, the metric curve tended to smooth out, while the precision and F1-Score are above 99%. Compared with Table. 6, there is no noticeable change in recall and precision, while with the significant improvement in FAR. After comprehensive considering, 5 is taken as the better number of flow, and 96 is the better length of packets.

Analysis of consuming time
To evaluate the time performance of the model, a time-consuming examination has been taken in the inference phase. To be fair, all our models are run on the same machine, and all the neural network models are tested with the same parameters. The test time for each model is shown in Table 11. From the experimental results in Table 11, the table represents that the proposed HPM is a little more than the comparison model. However, HPM can achieve a better performance than the other commonly used hierarchical network model while with only 10% test time higher. As for the previous algorithm, HPM can not only achieve almost the better performance with a large number of test samples but also save about 30% of the test time. The reason is that the feature extract from the original flow has concluded beneficial information for network classification.

Conclusion
In this paper, we propose a hybrid parallel deep learning model for efficient intrusion detection based on metric learning (HPM) to identify malicious and other anomaly behaviours in the network. The experimental results show that embedding metric learning in HPM has achieved the highest classification performance and lower consuming time than another existing model. It is worth mentioning that HPM is also robust against different datasets. Besides, we discuss the impact of the different number of packets in data flow and packet interception length, along with the determination of a better parameter with a large number of replicate experiments.
However, while our model achieves an extremely high classification accuracy, all of the performance is based on two prerequisite assumptions: a large number of labelled datasets and known attacks. Therefore, in future work, we will continue to work on supervised training and explore network intrusion detection in more areas, such as open-set classification of network traffic based on metric learning and semi-supervised training on intrusion detection to handle the need for labelled data and the inability to identify unknown attacks. Moreover, we try to study and establish a systematic evaluation method and corresponding evaluation metrics for intrusion detection.

Disclosure statement
No potential conflict of interest was reported by the authors.

Funding
This work was supported by National Natural Science Foundation of China [grant numbers 61672338, 61873160] and Natural Science Foundation of Shanghai [grant number 21ZR1426500].