Deep Learning Anomaly Detection Based on Hierarchical Status-Connection Features in Networked Control Systems

As networked control systems continue to be widely used in large-scale industrial productions, industrial cyber-attacks have become an inevitable problem that can cause serious damage to critical infrastructures. In practice, industrial intrusion detection has been widely acknowledged to detect abnormal communication behaviors. However, unlike traditional IT systems, networked control systems have their own communication characteristics due to specific industrial communication protocols. Thus, simple cyber-attack modeling is inadequate and impractical for high-efficiency intrusion detection because the characteristics of network control systems are less considered. Based on the status information and transmission connection in industrial communication data payloads, which can properly express the characteristics of industrial control logic, this paper associates industrial communication features with transmission connection payload and status payload. Furthermore, transmission connection features include device address, context, time, and packet length, while status features cover measurement, input, distributed state, control state, and more. After designing a convolutional neural network (CNN) and a long short-term memory network (LSTM) to extract status features and transmission connection features from industrial communication data, this paper proposes a hierarchical deep learning anomaly detection approach, which can integrate the advantages of CNN and LSTM to achieve high-efficiency detection. The experimental results clearly show that the proposed approach, having the advantages of strong detection capability and low false alarm rate, is a superior means of anomaly detection when compared to its peers.


Introduction
Alternatively, the machine learning approach can be regarded as an efficient and effective measure since it can build an excellent model, which can reflect the transmission connection features and status features of the transaction payload. Furthermore, deep learning techniques have become a better choice to learn the inherent regular pattern and representation level of sample data, and they have achieved remarkable results in the fields of traditional network anomaly detection.

Deep Learning Techniques
At present, deep learning techniques are widely used in the industry: in the medical, industrial production, operation, and maintenance environments. Furthermore, deep learning techniques are used in academia: in the image, timing, and interpretable modeling environments. There are many methods, and each has its own advantages. In particular, CNNs and recurrent neural networks (RNNs), which can learn temporal and spatial features effectively, are the most commonly used models in deep learning techniques.
In common neural networks, the sum of each node in the first layer is weighted on the common neural network, and the initial value of the last layer can be regarded as a representation or function to learn the neural network from the input data [13]. In practice, a CNN has the ability of representation learning by improving the architecture of the common neural network. It can classify the input information according to hierarchical structure, and the input layer of a CNN can process multi-dimensional data. CNNs are able to learn spatial features and have achieved impressive results in many machine learning tasks [14,15]. Furthermore, CNNs have a good analysis effect on the status information of the communication payload, and many recent research results demonstrate its great potential. Reference Benkhelifa et al. [16] proposes a deep learning method based on Hybrid MLP/CNN (Multi-layer Perceptron/Chaotic Neural Network) neural network for anomaly intrusion detection. This method offers a better detection rate and a lower false alarm rate when detecting novel attacks. Reference Ponomarev et al. [17] generates Denial-of-Service (DoS) attack traffic with normal traffic that cannot be distinguished by some ordinary detection algorithms, and proposes DoS attack detection based on a CNN synthesizing the attack traces payload to improve the attack detection accuracy.
RNN is a fine algorithm that can process sequences of different lengths by using self-feedback, and it can be devoted to process time or connection series samples. Moreover, each layer in an RNN not only connects to the next layer but also outputs a hidden state for the current layer when processing next samples [18]. In practice, LSTM is used to solve the problem of the explosion and disappearance of the RNN gradient [19]. LSTM is a predictive operation model, which can carry transmission features and can use a recursive method based on a time-series back propagation predictive operation model. Additionally, it can combine real data for convergence to improve detection accuracy and detect delaying attacks. Reference Khan et al. [20] proposes that many known signatures from the attack traffic remain unidentifiable, and designs a scalable hybrid IDS (intrusion detection system) based on the convolutional-LSTM network to identify network misuses. Reference Amar et al. [21] proposes a weighted long short-term memory (WLSTM) algorithm to solve malicious behaviors in cloud computing database under high-dimensional and high-speed analysis requirements. More specifically, WLSTM realizes the attack detection of contextual malicious behaviors by considering past events, and minimizes the vanishing gradient.
In brief, CNNs can directly and comprehensively identify various state features in networked control systems and learn the existing features of each state to utilize corresponding disposal methods for different abnormal states. LSTM can effectively learn connection features from a long sequence and predict the operating state of each variable at a future time. By using the special structural features of LSTM network, it is possible to mine the data association degree between different features in networked control systems. This paper combines the advantages of CNN and LSTM in their respective fields and constructs a deep learning anomaly detection model for the abnormal behavior, abnormal state, and abnormal transmission mode of network control systems. We propose an anomaly detection method based on deep learning technology. The effectiveness of the method is verified by training and testing the actual data set.

Hierarchical Deep Learning Anomaly Detection Solution
In the production control process of networked control systems, network attacks may not only cause abnormal network traffic but also cause abnormal control logics. Based on those abnormalities, we propose a deep learning anomaly detection model, which integrates the advantages of CNN and LSTM by using hierarchical status-connection features of industrial communication data. Furthermore, this model specifically includes data preprocessing, extraction of status features by CNN, extraction of transmission connection features by LSTM, and the fully-connected LSTM to realize deep learning anomaly detection in networked control systems. The main framework of the proposed model is shown in Fig. 1. The detailed process can be described as follows.
1) Data preprocessing stage: Guarantees that the training data format meets the standard format requirements of CNN and LSTM. Ensures that a deep learning model can be properly trained.
2) Status feature extraction stage: Further extracts and processes the status features from the communication payload data by creating a CNN neural network.
3) Connection feature extraction stage: Further extracts and processes the connection features from the communication data by generating a LSTM recurrent neural network [22]. This network can learn from the feature representation and models the time dependence automatically.

4) Concatenation & Classify:
Classifies various communication behaviors in networked control systems by using a SoftMax layer, which takes the outputs of CNN and LSTM as inputs.

Data Preprocessing
The analyzed datasets are standard industrial control intrusion detection datasets established by the MSU Infrastructure Protection Center in 2014 [23]. Moreover, the original datasets are collected from the internal SCADA laboratory of Mississippi State University, using the MODBUS application layer protocol to achieve industrial control communication. Additionally, these datasets are generated by attacks on the natural gas pipeline system, which uses the MODBUS protocol for industrial control communication. Compared with the datasets used in IT intrusion detection, such as KDD-99 and NSL-KDD [24,25], these datasets mainly have two types of characteristics: connection payload characteristics and status payload characteristics. The connection payload characteristics describe the communication mode in the SCADA Figure 1: Main framework of hierarchical deep learning anomaly detection model system and can be used to extract transmission patterns for malicious industrial activity detection. The status payload characteristics describe the business status in the SCADA system, and they can be utilized to detect network attacks, which cause abnormal behaviors of some critical devices, such as programmable logic controllers and motion controllers.
In the data preprocessing step, we convert each piece of network payload into a matrix and encode all labels with one-hot encoding. First, in order to generate a suitable matrix, which can be processed by CNN and LSTM from the original data, we can expand each data line in the datasets with a random value following a random normal distribution. Consequently, the feature vector can be converted into an appropriate two-dimensional matrix. The main algorithm for converting data into a two-dimensional matrix is shown in Algorithm 1.
Second, for the eight categories of behaviors in the datasets, we can apply one-hot encoding to process each behavior. One-hot encoding has a low computational cost since the independent value of each category of behavior is not too much. Tab. 1 shows the processed results for the datasets after applying one-hot encoding.
where x belongs to the random normal distribution 8: End if 9: End for 10: Return D 0 End function Reconnaissance 10000000

Status Feature Extraction
In the next stage of our approach, we use a CNN to learn and extract data features from the twodimensional matrix, which has been generated in the data preprocessing stage, since the sparse connection between the convolutional layer and the pooling layer in a CNN can hide some unimportant information, reduce network training time, and accelerate network convergence.
To be specific, the proposed CNN in this paper consists of one input layer, three convolution layers of different sizes, two pooling layers, and one global pooling layer. Given the two-dimensional matrix generated by the data preprocessing step, the convolution layer extracts regional information of payload features by use of a sliding window and expresses the information abstractly. Next, the pooling operation will reduce the dimension of features, and the output of global pooling will represent all data vector information and transmit them to the connection layer. The main framework of the proposed CNN in this paper is shown in Fig. 2.
Convolution calculation: set the convolution kernel weight as w and bias as b, extract the feature map of input data in the form of sliding window, and generate new feature m i by activating the feature map with nonlinear activation function as follows: where f denotes the activation function, and item i:iþsÀ1 represents the data in the sliding window.
Pooling operation: the maximum over time polling operation is used to reduce the feature dimension of the generated features, and the maximum value m 0 is selected as the final data feature to compress the number of parameters and reduce over-fitting.
Activation Function: we use ReLU function as the activation function, which can help us make the model converge faster and keep it continuously stable. A general ReLU function is described as Eq. (3).
The specific steps involved in the extraction of status features via CNN are as follows; some methods or layers refer to Algorithm 2. Step 1: We input the 28 Â 28 matrix into the first convolution layer, which contains the c 1 filters. Here, the size of filter is s 1 Â s 2 , and the activation function is set to ReLU.
Step 2: The convolution layer can output the characteristic matrix whose size is 32 Â 28 Â 28, and we input this matrix into a two-dimensional maxpooling layer whose size is p 1 Â p 1 .
Step 3: The first maxpooling layer can output 32 matrices whose sizes are 14 Â 14 each, and we use zero-padding to extend these matrices, whose final size is 16 Â 16.
Step 4: We input these 32 matrices into the second convolution layers containing c 2 filters. Here, the size of filter is s 2 Â s 2 , and the activation function is set to ReLU.
Step 5: By using the leaky ReLU method, we get 64 matrices whose sizes are 8 Â 8 each, and input them into the second two-dimensional maxpooling layer whose size is p 2 Â p 2 .
Step 6: Through the computation in the second maxpooling layer, we input 64 matrices whose size is 2 Â 2 into the dropout layer.
Step 7: The dropout layer outputs 64 matrices whose size is 2 Â 2; we use zero-padding to extend these matrices whose final sizes are 4 Â 4 each.
Step 8: We input these 64 matrices into the third volume cumulant layer containing c 3 filters. Here, the size of each filter is s 3 Â s 3 , and the activation function is set to ReLU. Step 9: By using the leaky ReLU method, we get 128 matrices whose sizes are 4 Â 4 each, and input them into the two-dimensional global maxpooling layer whose size is p 3 Â p 3 .
Step 10: The status feature vector V 1 is the output of the global maxpooling layer, whose length is 128.

Connection Feature Extraction
In this step, we use LSTM to extract the transmission connection features from industrial communication data in networked control systems, since it can help us resolve the problems of gradient disappearance and gradient explosion effectively. LSTM introduces the concept of cell status, which determines the reserved and forgotten statuses using the forgetting gate, the input gate, and the output gate in the RNN training process.
The corresponding pseudo-code is shown in Algorithm 2, and the specific steps of the algorithm are listed as follows: Step 1: We first input the 28 Â 28 matrix into an LSTM layer containing l 1 cells.
Step 2: A vector of length 50 can be computed by the LSTM layer, and we can input this vector into the dropout d 1 layer.
Step 3: A vector of length 50 can be computed by the dropout d 1 layer, and we can input this vector into the dense layer, which also represents the full connection layer.
Step 4: A vector of length 1024 can be computed by the dense layer, and we can input this vector into the dropout d 2 layer. After that, we can get one new vector whose length is 1024, and this vector can be regarded as the connection feature vector V 2 .

Concatenation and Classification
By connecting the status-connection features in the target datasets, which are extracted by CNN and LSTM, we can improve the feature accuracy in the dropout layer setting from the last step. We use SoftMax as the classifier in the final output layer to calculate the possibility of each network payload category. After that, we can further classify different industrial communication behaviors. The main framework integrating CNN and LSTM is shown in Fig. 3, and the detailed steps to realize the integration of CNN and LSTM and the classification of industrial communication behaviors are described below: Step 1: We can get the vector V 3 by connecting the status features V 1 extracted by CNN and the connection features V 2 extracted by LSTM, where Step 2: We can get the vector V 4 by inputting the vector V 3 into the full dense connection layer.
Step 3: We perform the classification by inputting the vector V 4 into the activation function SoftMax, and output the classifications of different behaviors.

Evaluation and Analysis
To evaluate our deep learning anomaly detection approach, we use the standard intrusion detection datasets and extract the hierarchical status-connection features by analyzing the transmission and payload contents in these datasets. Furthermore, we not only give the performance discussion on the optimal evaluation results but also compare our approach with other existing detection approaches.

Experimental Environment
We designed a Python program to perform all experiments with our deep learning anomaly detection approach. The basic framework was built by using Keras 2.0 and Tensorflow 2.0. Additionally, all procedures were run on the same PC with 64 GB RAM, Intel Xeon e5-2620v4 2.10 GHz CPU, Nvidia Geforce RTX2080S GPU, and Windows Server 2016 OS.

Evaluation Indicator
The main purpose of the proposed approach is to accurately detect various abnormal industrial communication behaviors in networked control systems. To this end, we use the authoritative evaluation indicators: AC (accuracy), DR (detection rate), and FAR (false alarm rate) [26]. To be more specific, AC refers to the ratio of all correctly classified samples, which may be normal samples or malware. The DR is used to evaluate the system's performance with respect to its malware traffic detection. The FAR is used to evaluate the misclassifications of normal traffic. The main calculation formulas are listed as follows.
Here, the true positive (TP) represents the number of attack samples, which are classified into the attack category. The true negative (TN) represents the number of normal samples, which are classified into the normal class. The false positive (FP) represents the number of normal samples, which are classified into the attack category, and the number of attack samples, which are classified into the normal category. The false negative (FN) is the number of samples that are failed to be classified into target category.

Experimental Parameter Setting
To construct an optimal model, we created different structures of convolution layer and LSTM layer. Based on the two-dimensional matrix generated in the preprocessing process, we performed the experimental tests. Specifically, when the maximum number of training times was set to be 20, the model had stabilized. The accuracy and loss comparison of different structures in the training are shown in Fig. 4.
Through the debugging of multiple experiments, we produced the most effective model, which contains three convolution layers, two pooling layers, one global pooling layer, one LSTM layer, and two full connection layers. Additionally, the SoftMax layer is considered as the final classification function. The settings of all parameters are shown in Tab. 2.

Comparison and Analysis
Based on the experimental parameter setting in Tab. 2, we were able to obtain an optimal anomaly detection model. When the training epoch was set to 30, the detection accuracy in the training process was basically stable at 99.50%. The change of loss and detection accuracy in the training process is depicted in Fig. 5.
After we obtained the optimal anomaly detection model, we used the test datasets to evaluate the actual detection performance. The confusion matrix diagram is shown in Fig. 6, which shows the classification of the data sets by our model during the evaluation process.
From the confusion matrix, it can be seen that our proposed model has a high anomaly detection efficiency. Additionally, Tab. 3 shows the evaluation results for each behavior, including Normal, NMRI, CMRI, MSCI, MPCI, MFCI, Dos, and Reconn. AC, DR, and FAR can reach 99.8%, 98.9%, and 0.082%, respectively.  To further explain the superiority of our approach, we compare the detection performance with some existing anomaly detection approaches under the same test datasets. Furthermore, in order to better evaluate the superiority of the algorithm, we compare the overall evaluation index of the proposed algorithm with other published algorithms in unified datasets. As shown in Fig. 6, we can see that the classification efficiency of deep learning algorithms such as CNN and LSTM is higher than machine learning methods such as SVM.
Additionally, we construct a two-layer CNN model and apply it to the same datasets, and the accuracy can reach 98.5% [16,27]. Moreover, we also compare with other algorithms: the work in Yu et al. [28] uses the LSTM algorithm to detect the anomaly behavior, and the accuracy can be improved to 96.5%; the work in Liu et al. [29] improves the SVM algorithm to those same datasets, and the accuracy can be improved to   [30] improves the C4.5 algorithm, and the accuracy can be improved to 91.30%. The accuracy of the different schemes is shown in Fig. 7. Therefore, our proposed approach has a better performance.

Conclusion and Future Work
To detect industrial cyberattacks in networked control systems, we propose a deep learning anomaly detection approach based on hierarchical status-connection features. In the view of industrial control logic, the proposed approach generalizes industrial communication data into two types of features: transmission connection features and status features. According to the characteristics of CNN and LSTM, we use a CNN to extract status features and LSTM to extract transmission connection features from industrial communication data in networked control systems. Furthermore, the proposed approach integrates the  advantages of CNN and LSTM to achieve the high-efficiency anomaly detection. Based on the actual datasets, all experimental results show that the proposed approach, which has the advantages of strong detection capability and low false alarm rate, is a more feasible means of anomaly detection by comparing with other anomaly detection algorithms.
In future work, we will not only perform a comprehensive feature extraction from other fields of industrial communication data but also optimize and improve its detection efficiency to realize large-scale applications.