An Anomaly Detection Method of Industrial Data Based on Stacking Integration

: With the development of Internet technology, the computing power of data has increased, and the development of machine learning has become faster and faster. In the industrial production of industrial control systems, quality inspection and safety production of process products have always been our concern. Aiming at the low accuracy of anomaly detection in process data in industrial control system, this paper proposes an anomaly detection method based on stacking integration using the machine learning algorithm. Data are collected from the industrial site and processed by feature engineering. Principal component analysis (PCA) and integrated rule tree method are adopted to reduce the dimension of the process data, which can restore the original feature information of the data to the maximum extent. Random forest (RF), Adaboost, XGboost, SVM were selected as the first layer of basic learners. Logistic regression (LR) was used as the secondary learner to build the exception detection model based on stacking integrated method. TE data was used to train the base learner model and the integrated model. By comparing and analyzing the experimental results of between integrated model and each basic learning model. By comparing and analyzing the experimental results of the constructed anomaly detection model and the basic learning model, the accuracy of process data anomaly detection is effectively improved, and the false alarm rate of process data anomaly detection is effectively reduced.


Introduction
As we all know, the industrial control system is the core of the entire industrial production process. It used to improve the automation level of industrial production, improved production efficiency and stability. At present, industrial control systems are widely used in important fields such as petrochemicals, smart grids, transportation, and manufacturing [1].
With the rapid development of the Industrial Internet in recent decades, industrial control systems and IT networks have been deeply integrated. The traditional control system has become open and fragile. Hackers attacked the industrial control system by attacking the network, attacking the industrial control system by assaulting the network. The traditional industrial control system is relatively closed and relatively independent, and it is not easy to be attacked [2]. The security risks in cyberspace are increasing. There is a large difference between industrial control system and IT network. There are a large number of private protocols in the industrial control system, and the network space has a unified protocol at each level. Due to production requirements, industrial control systems pursue real-time and accuracy, and data delays and false alarms can cause serious attack consequences. The focus of IT network security is the confidentiality of information, followed by availability and integrity. The upgrade and replacement of industrial control systems requires a strong monetary price. It is not like network security that can be updated by downloading patches. Attacks on industrial control systems will directly affect the industrial production environment, causing a devastating blow to industrial production.
Since the 1990s, the Internet has grown exponentially, expanding and penetrating into various fields of economy and society explosively. It is currently widely used in new application fields such as industrial control systems, cloud computing, cloud services, and mobile payments. The risks of network security are also increasing [3][4]. Anomaly detection is an important tool for intrusion detection. By comparing and matching the behavior of the process data with the normal behavior library, it can be found whether the data is abnormal, which can effectively detect unknown network attacks. The methods used for anomaly detection mainly include statistical learning, machine learning, and deep learning. Among them, the application of machine learning is more extensive, through the continuous accumulation of data from industrial control systems, and then independent learning, which can effectively reduce the false alarm rate of anomaly detection. The idea of integrated was first proposed by Dasarathy et al. [5]. Boosting integration method was proposed by Schapire et al. [6], which promoted the weak learner to a strong one and trained the next base learner by constantly changing the weights of the misclassified samples, so as to achieve the optimal results eventually. In 1992, Wolpert proposed the Stacking Generalization Model (Stacking) [7].  Figure 1: Flow chart of anomaly detection method Random forest algorithm is an extended branch of Bagging algorithm [8]. The decision tree is selected as a simple learner and random attributes are added to the training and learning of the decision tree. Dietterich listed ensemble learning as the top four research directions in the field of machine learning in Aimagazine magazine [9]. In 2001, Zhou et al. put forward the idea of "selective integration" [10][11], pointing out that the number of basic learning will affect the prediction speed and increase the running time and storage space. Better results can be obtained by selecting some of the basic learners to build integrated learners [12]. The industrial production environment is characterized by noise and high data dimensions, and the constructed anomaly detection model should have high real-time performance. In order to reduce noise interference and prevent over-fitting, the constructed anomaly detection model should have good robustness [13]. This paper proposes an anomaly detection model based on stacking integrated supervised learning algorithm. First, the principal component analysis method is used to reduce the dimensionality of the data. By constructing a two-layer learner to train the data samples, the accuracy of anomaly detection can be improved.
The flowchart of process data exception detection method based on Stacking integration proposed in this paper is shown in Fig. 1. Collecting sequence data that produced in industrial production process. These data are matched and processed, engineering features are processed, dimensionality is reduced by principal component analysis method, and an anomaly detection model is constructed to detect the data, and the detection results are obtained. Through the test results, the industrial control system equipment is judged to be abnormal to prevent the occurrence of dangerous conditions. This paper proposes an anomaly detection model based on Stacking integration for industrial data. Through PCA dimensionality reduction, high-dimensional and complex process data can be reduced to the dimensions required by the algorithm, based on Random Forest, SVM, XGboost, Adaboost. The Stacking integrated model composed of the base learner improves the accuracy rate and reduces the false alarm rate.

Build an Anomaly Detection Method Based on Stacking Integration
The theoretical basis of ensemble learning is PAC theory [13], strong learning and weak learning theory [14]. We choose a simple learner as the basic learner, and through some methods and strategies, the basic learner is integrated and converted into a strong learner [15]. There are three integrated frameworks for ensemble learning. The distribution is boosting, bagging and stacking. The experiment used in this article is the third Stacking integrated learning method.

Stacking Algorithm
The thought of Stacking algorithm is to first train the base learner with the data set, and generate a new sample set through the base learner training. Use the new sample set to train the secondary learner. Stacking integrated base learner chooses heterogeneous algorithm as base learner. For training samples, cross-validation and leave-one-out methods are generally used for training. The data set is first divided into training set and test set. A cross-validation method is adopted to train and predict the base model. The prediction result of each base model is used as a feature of the training data of the secondary learner. The average of the prediction results is taken as a feature of the new sample, and the secondary The learner makes predictions. Use logistic regression algorithm as secondary learner [16][17][18]. Seewald proposed that using different attributes in MLR can make the algorithm more time-efficient and effective [19][20]. Fig. 2 is a schematic diagram of the constructed Stacking algorithm.

Adaboost of the Base Learner Model
The idea of Adaboost algorithm is realized by changing the distribution of sample data. The weight of each sample next time is determined according to whether the classification of each sample in the previous training set is correct and the accuracy of the previous overall classification. The modified weight data set is trained by the next base learner, and finally the base learner obtained from each training is combined as the final learner.

Random Forest of Base Learner Model
Random forest algorithm is an extended algorithm of bagging algorithm, which uses decision tree as a base learner to construct bagging integration algorithm. The bootstrap method is used to generate m training sets, and then a decision tree is constructed for each training set. Part of the features are randomly selected, and the optimal solution is found for feature splitting among the extracted features. The random forest algorithm is relatively simple to implement, and the computational time is relatively small, but the learning ability is very powerful. The main parameters of the random forest algorithm are the number of decision trees n, the depth of the tree d, the lowest node gain m and so on. By adjusting these parameters, the random forest algorithm learns better.

SVM of Base Learner Model
The idea of the support vector machine model is to find a plane in the sample space of the data set to classify the data samples. Find the most suitable hyper-plane is a research crucial question. The linear equation dividing the hyper-plane is shown below: W is the normal vector, through which the direction of the hyper-plane is determined, and the distance between the hyper-plane and the origin is determined by the b displacement term. These two parameters can determine the hyper-plane required for classification. The hyper-plane can be expressed as (W, b), and the distance from the data sample point to the hyper-plane is: If the hyper-plane (W, b) can correctly classify the samples in the data space, then: At this time, the optimal hyper-plane is found.

Xgboost of Base Learner Model
Xgboost algorithm is a kind of Boosting algorithm family. It is improved on the basis of GBDT. It uses regression tree as the basic learner, divides the regression tree into structure part and leaf weight part, and chooses the negative gradient direction to adjust the performance of the regression tree. Regularizing the training samples can reduce the risk of over-fitting. Data can be processed in parallel during operation, which can shorten the algorithm running time Good results for classification and regression problems.

Logistic Regression of Secondary Learners
Logistic regression algorithms are mainly used in classification problems to find a suitable classification function. The classification function is used to predict the discriminant result of the input data. Construct a loss function. The loss function is used to describe the deviation between the predicted result of the data and the actual situation, and all deviations are processed to evaluate the performance of the algorithm.
The logistic regression formula is expressed as: The logistic regression algorithm has fast training speed and simple form, which is convenient for the adjustment of output results.

Experimental Data Description
The data used in this article is the TE Tennessee-Eastman Process Data Set. The entire data set consists of training set and test set.
The TE data set consists of 22 different simulation data. Each data sample contains 52 observation variables, that is, the data dimension is 52 dimensions, and the data is divided into normal data and abnormal process data. In this experiment, normal operating condition data and 4 abnormal operating condition data are selected as data samples. The data is shown in Fig. 3 below.

Figure 3: TE data set
There are a total of 52 nodes in the setting of various equipment parameters in the TE production process, that is, the initial dimension of TE process data is 52 dimensions, which respectively include the factors of feed volume, control valve flow, separator flow, reactor temperature and pressure. Real-time simulation of data in the production process was obtained by using Matlab tool simulation. The data description is shown in Tab. 1 below. The constructed data set is 52 dimensions and has no labels. The data needs to be labeled according to the working conditions. The experimental data selected 1 normal working condition and 4 abnormal working conditions. Add the normal working condition data set to label column 10, and label the 2 abnormal working conditions in label columns 1 and 2, respectively. The data of various working conditions are uniformized. Fig. 4 shows the visualization and uniform distribution of processed data labels.

Data Missing Value Processing
Due to various disturbances in the data in the industrial production process, certain dimensions of the data will be lost. The data with missing dimensionality will produce distortion and produce bad results for the training of the model. Perform missing value query on the data, discard the data with too many missing values, and fill in the data with a small amount of missing values according to the average method, or select 0 to fill in.

Use Principal Component Analysis to Reduce Dimensionality of Data
Principal component analysis is a widely used dimensionality reduction algorithm. The idea of dimensionality reduction is to convert n-dimensional data into new k-dimensional data. The new k-dimensional features are called principal components. The principle of PCA is to re-establish coordinate axes on the basis of n-dimensional feature data, and construct new k-dimensional feature data. Find a group of orthogonal coordinate axes from the original data space, make the variance of each group of coordinate axes orthogonal, and similar features will be merged. Reduce the dimensionality of the data, and the data after the dimensionality reduction is completely new. PCA dimensionality reduction steps are as follows: 1a) Perform mean normalization on continuous raw data to ensure that the data magnitude of each dimension is the same.
1b) Find the covariance matrix of the feature: 1c) Obtain eigenvalues and eigenvectors according to SVD. 1d) Arrange from largest to smallest characteristic value. 1e) Select k high variance features. Use the PCA method to reduce the dimensionality of the data, and use the loop iteration method to determine the value of K. If the value of K is too large, the dimensionality reduction effect is not obvious and the training time is too long. If K is too small, it will cause data distortion. The experimental results of different values of K are shown in Fig. 5.

Figure 5: Graph of K value
It can be seen from the experimental curve result graph that when K is 35, the reduction rate can reach 0.95, and the experimental effect is better.

Feature Selection on Data
Use the integrated rule tree to perform feature selection on the reduced data. The idea of the rule tree model is to divide the data set features into multiple school-scale data groups by selecting the appropriate split point, and when selecting the split point, it contains all the features of the data and the division of a single feature split point. The Gini coefficient and cross entropy are generally selected as the method of measuring purity. The regular tree model is to generate different tree models by using different characteristics and the uncertainty of random sampling to ensure the generalization performance of the results to prevent the model from falling into over-fitting. The constructed tree model has good robustness to process data. According to the Gini coefficient change value of each feature in different rule tree models as the basis of feature importance, the feature importance index map is generated.
In this experiment, the principal component analysis method and the integrated rule tree feature selection method are selected to perform dimensionality reduction operations on the data, which can reduce the dimensionality of the data while retaining important features, which can better restore the original features of the data and make the model The training effect is better.

Model Training
Adjust the parameters of the basic model. This paper selects the method of combining loop iteration and cross-validation to adjust the parameters of the basic model. Select 10 times of cross-validation, you will get 10 different data sets, and output 10 results. The parameters of the optimal result are selected as the parameters of the basic learner model.

Comparative Analysis of Model Results
Precision and deviation are used to evaluate model performance. The precision rate is calculated as shown in Tab. 2.
P represents the precision, TP represents the correct positive example of the prediction, and FP represents the negative example of the correct prediction. The distribution uses TE data to train the base learner model and the integrated model, and compares and analyzes the experimental results of each model.
The random forest of the base learner is optimized by using the grid search method. The main parameters of the random forest are the number of decision trees n estimators, the maximum depth of the tree max depth, and the number of decision trees is the parameter of the grid search method to find the optimal parameters. The experimental results are shown in Fig. 6. As can be seen from the above figure, the number of decision trees ranges from 0 to 50. When the number of decision trees is set to 30, the accuracy of the random forest algorithm is the highest, which is 0.971. At this time, the maximum depth of the tree is set to 2, and the number of jobs is -1. The learning curve query of random forest is shown in Fig. 7. It can be seen from the learning curve that as the training samples increase, the cross-validation bias gradually decreases, and tends to stabilize when the number of samples reaches 800.
Optimize the parameters of the base learner model adaboost algorithm. The research direction of this paper is the problem of multi-class abnormality detection of process data, so SAMME.R is selected as the default parameter, a single-level decision tree is selected, finally the number of trees is determined by the grid search method. The experimental results are shown in Fig. 8.   From the comparison of anomaly detection accuracy results, it is found that the accuracy of the built integrated anomaly detection model is effectively improved, and the false alarm rate is greatly reduced. This paper proposes to build a process data anomaly detection model based on a supervised learning algorithm. Through feature engineering processing of process data, principal component analysis is used to reduce the dimensionality of the data, and then the Stacking integrated anomaly detection model is built, which effectively improves the anomaly detection Accuracy, this model method can be applied to anomaly detection of industrial and manufacturing process data, and can improve the effect of anomaly detection.

Conclusion
This paper proposes an anomaly detection model based on Stacking integration for process data production conditions. Through PCA dimensionality reduction, high-dimensional and complex process data can be reduced to the dimensions required by the algorithm, based on random forest, svm, xgboost, adaboost The Stacking integrated model composed of the base learner improves the accuracy rate and reduces the false alarm rate. It can judge the operating status of the industrial equipment according to whether the process data is abnormal, and help the staff to take security measures more quickly according to the abnormal situation. The built ensemble model still has many directions worth studying. Improving the accuracy and reducing the running time of the algorithm are the next research interests.

Conflicts of Interest:
The authors declare that they have no conflicts of interest to report regarding the present study.