A stacked autoencoder‐based convolutional and recurrent deep neural network for detecting cyberattacks in interconnected power control systems

Modern interconnected power grids are a critical target of many kinds of cyber‐attacks, potentially affecting public safety and introducing significant economic damages. In such a scenario, more effective detection and early alerting tools are needed. This study introduces a novel anomaly detection architecture, empowered by modern machine learning techniques and specifically targeted for power control systems. It is based on stacked deep neural networks, which have proven to be capable to timely identify and classify attacks, by autonomously eliciting knowledge about them. The proposed architecture leverages automatically extracted spatial and temporal dependency relations to mine meaningful insights from data coming from the target power systems, that can be used as new features for classifying attacks. It has proven to achieve very high performance when applied to real scenarios by outperforming state‐of‐the‐art available approaches.


| INTRODUCTION
With the increasing success of the Internet of Things (IoT) and automation technologies, most of the industrial facilities and public service utilities, such as manufacturing plants, electricity, water and gas distribution infrastructures, transportation control and monitoring systems, and many others have been interconnected through public and private cyber-infrastructures, providing advanced automation control and interoperation capabilities. Despite the obvious advantages in terms of technological advancement, such practices have nonnegligible effects on the attack surface exposed by modern cyber-physical infrastructures. We remark that nowadays, such infrastructures include sophisticated field sensor or actuator devices as well as supervisory control and data acquisition (SCADA) systems, often involving legacy hardware or software solutions with a limited resistance and robustness to external cyber-attacks or generic abuses.
In this scenario, the electrical power grid, characterized by a significant improvement in automation and remote control facilities, becomes one of the most critical targets for attacks, potentially affecting public safety and hence involving homeland security-level attention. These infrastructures now include many smart control and metering devices, such as phasor measurement units (PMUs), digital fault recorders, and so on, providing a huge amount of data about the overall system's state and operations, supporting automation decisions and fault monitoring. Specific power system events, like transmission line faults, that can be spontaneous or originated by compromised control devices sending fraudulent control commands or measurement data, can start a reaction chain leading to cascading blackouts or critical equipment faults, if proper reactions take not place in a timely way. More specifically, attacks against power distribution infrastructures operate by exploiting vulnerabilities in remote management or metering devices and corrupting control signals to introduce faults, impersonate disturbances, or fraudulently performing wrong control actions. 1 Proper situational awareness capabilities are needed to analyze these phenomena and discriminate attacks from real disturbance or faults occurring during the normal system operations. This is not an easy task due to both the overall system complexity and the huge variety of attacks available as well as to their unpredictability. Furthermore, some specific attacks can be only spotted by considering the occurrence of a combination of specific events over time, and consequently stateful and context-dependent analysis and inference capabilties are needed exploring correlation between events/observations in both space and time. Consequently, a new generation of anomaly detection systems is needed, capable of recognizing and classifying attacks from disturbances or faults or legitimate control actions, for timely generating alerts to operators as well as automatically triggering the proper actions when an attack is discovered. To provide the needed versatility and situational awareness, these systems must be empowered by the most advanced artificial intelligence (AI) and machine learning (ML) technologies, to significantly outperform the traditional solutions based on the mining and analysis of specific field-related features. These systems must be able to exploit spatial and temporal correlations among all the different observations characterizing the system activities by autonomously eliciting new field-related knowledge. Besides, the elicited knowledge consists of new (presumably unknown) features capable to describe unambiguously the attacks and the normal power systems operations by also correctly classifying them.
To this purpose, we propose a complex deep neural network (DNN) framework able to mine meaningful information from the power system monitoring data regarding both relations among the different observations and their correlations over time, represented as spatial and temporal dependencies, constituting more expressive and representative features to be used for attack detection. As it is described below, this is accomplished through the usage of two sparse autoencoders (SAEs) whose encoder-decoder functions are replaced with two DNNs, respectively a convolutional and a recurrent ones, providing the ability of capturing the aforementioned correlations between their inputs. Although the convolutional and recurrent networks have proven to be effective also without being used in combination with the autoencoders, their performance tends to decrease if they are trained on an unbalanced data set. Furthermore, for a high number of hidden layers, such networks are affected by the well-known problem of the vanishing gradient. 2 On the contrary, the stacked autoencoder-based deep network solves many issues related to the training of networks enclosing a high number of layers and offers a powerful tool that can be used also in different fields of application. Indeed, since our approach does not preliminarily require any specific field-related knowledge about the system of interest as well as about the kind of devices of interest and generates all the knowledge needed for its operations at the training time, it can be ideally used for any kind of industrial control system (ICS) involving the production of a sufficiently expressive amount of time-sequenced system monitoring observations. The complexity of sophisticated (and hidden) dependencies between events occurring over time as well as among the values of many apparently unrelated observations, that does not emerge with more traditional analysis techniques, can be easily captured by the combination of the stacked networks. Excellent performance evaluation results demonstrated that the proposed contribution and ideas are extremely promising for further investigation and implementation in real-world production scenarios to improve the dependability of critical infrastructures.
The main contributions of the paper are summarized as follows: • A stacked-autoencoder-based convolutional and recurrent neural network is used for detecting cyberattacks in interconnected power control systems. • Meaningful information is mined from the power system monitoring data, which are used for inferring the operation status of the control system. • The elicited knowledge is able to describe unambiguously the attacks and the normal power systems operations. • Such information is expressed by using spatial and temporal features, which are able to capture correlations existing among the data and among them over time. • A detailed mathematical description of the network architecture is provided to highlight the effects of any network component in mining the aforementioned features.
The remainder of the paper is organized as follows. In Section 2, the related works are shown. A detailed description of the proposed architecture is reported in Section 3. The experiments and their results are discussed in Section 4. Finally, concluding remarks are provided in Section 5.

| RELATED WORK
The reliable and timely detection of attacks and anomalous events is an extremely challenging issue and hence a hot topic in IoT and ICS research arena. An anomaly detection scheme for power systems based on whitelisting legitimate transactions within control-layer communications has been presented in Reference [3]. However, this approach is not able to detect fraudulently injected valid sequences of commands that aim at tripping protection relays to cause a blackout. Other detection solutions are based on power system-related theoretical issues such as the one proposed in Reference [4] that by leveraging optimal power flow programming allow the detection of attacks that cause wrong power flow dispatching through the alteration of service measurements. Similarly, the approach presented in Reference [5] is able to spot attacks based on the injection of malicious data for tampering state variables in the power system, through proper weighted state estimation techniques. Also the unsupervised detection solution described in Reference [6] is aimed to spot false data injection attacks in power systems by analyzing historical data and recognizing state vectors the deviate respect to normal trends. Clearly, the above approaches lack of generality and their scope is limited to the attacks on which they are specifically targeted, without any chance of extension to other ones. Host-based detection approaches for intelligent electronic devices (IEDs) operating within smart grids have been proposed in References [7] and [8]. Unfortunately, these localized approaches are only able to identify attacks against individual device without a global and more articulated vision of the phenomena of interest. To determine causal relationships between multiple different observations/information and temporal state transitions the approach presented in Reference [9] leveraged a Bayesian network, resulting in a specification-based detection framework capable of detecting attacks and classifying several substation scenarios. Temporal state-based specifications and sequential patterns have been also used for classifying attacks and power disturbance events 10 with heterogeneous time-synchronized observations. 11 However, all the above approaches based on mining large amounts of data for determining the nature of power system events may be not suitable for timely spotting phenomena of interest from measurements coming from synchro-phasors since all the data to be analysed has to be available in memory to perform pattern classification. However, the combination of more traditional data-oriented solution, leveraging physics-based PMU features, with ML capabilities may result in a more promising approach. 12 Indeed, the use of ML technology is a quite effective approach for timely detection of attacks against industrial power control systems. The authors in Reference [13] extensively explored ML potentialities in classification of power system disturbances, by explicitly focusing on discriminating attacks on power grids. A a kernel ML method has been used in Reference [14] for baselining SCADA systems with the aim of isolating intrusions and faults. However, due to the lack of attack data, such approach resulted of limited effectiveness in discriminating attacks. An ML-based model leveraging autoencoders for detecting PMU manipulation attacks has been presented in Reference [15], whereas two ML-based schemes for detecting stealth attacks in smart power grids are reported in Reference [16]. The first one is based on principal component analysis (PCA) and distributed SVMs and the second, unsupervised, detects measurement deviations. An artificial neural network (ANN)-based approach for baselining power systems at the substations level and generating alarms in presence of anomalous outliers has been presented in Reference [17]. The right choice of features, assumes paramount importance in guaranteeing the success of these approaches. A convolutional neural network (CNN) has been used in Reference [18] together with random forests for learning features to be used in classification from convolution and downsampling of massive smart metering data, with the purpose of detecting electricity theft events. Finally, an approach based on a stacked long short term memory (LSTM) network and softmax classifier has been proposed in Reference [19] for multilevel anomaly detection in ICS. Nevertheless, most of these works have shown a reduction in performance when applied to unbalanced data sets, and also have proven to be time-expensive in the training phase. On the contrary, the experimental results related to our proposal have shown that the achieved performance of the proposed architecture remains steady with respect to the different application scenarios and training/testing data sets.

| THE ANOMALY DETECTION FRAMEWORK
In this section, the proposed anomaly detection architecture is presented in detail. It is essentially focused on building a DNN able to mine meaningful evidence to be used for detection (i.e., the features) from measurement data related to ICS activity to distinguish between normal and anomalous situations/behaviors associated with disturbance, control, and cyber-attack events. To accomplish this, relations existing among the power system monitoring data (referred to as status variables from here on) and their correlations over space and time are extracted. We refer to them as spatial and temporal dependencies, respectively. More precisely, the aim of the spatial dependencies is to discover relevant hidden insights capable to represent useful relations among all the involved status variables. On the other hand, the temporal ones are used for learning higher-order dependencies that may exist among a set of status variables over time.
As widely described in the scientific literature, CNNs have proven to be the best candidate in extracting spatial insights from data, especially in computer vision and image processing fields. 20 As it will be shown below, the usage of several stacked convolutional layers in CNNs, used to create a hierarchy of progressively more abstract representations of the input data, constitutes their main strength for mining spatial dependencies. Similarly, recurrent neural networks (RNNs) are the most suitable neural network typology for modeling temporal dependencies in time-series data. 21 Ultimately, the combination CNN-RNN in a unified network is able to elicit temporal dependencies among spatial relations extracted by a convolutional-based network. This type of combined network is already widely used in the state-of-the-art related to several scientific and industrial fields, 22 where modeling spatio-temporal information is crucial, such as financial applications, speech recognition, weather forecasting, and more.
Nevertheless, training may be almost time-expansive for high-dimensional data. Besides, the well-known problem of the vanishing gradient 2 for networks enclosing numerous layers could cause a significant reduction in performance. To overcome these drawbacks, known also as the curse of dimensionality problem, 23 we used the CNN and RNN as encoder/decoder stages in two distinct SAEs. Each SAE is trained separately and then, along with a softmaxbased classifier network, they are stacked to obtain a unified network.
SAEs are mainly used for learning a compressed representation of a set of data, namely for a dimensionality reduction. Unlike PCA, SAEs are able to learn nonlinear transfer functions. This ability leads to represent the input data in a more meaningful state space, which could better offer the possibility to extract more representative features from data, better suitable for classification purposes. As depicted in Figure 1, the proposed detection architecture makes use of two SAEs, named CNN-SAE and RNN-SAE, respectively. The encoder-decoder function of the former encloses a CNN, and it is trained on status variables values arranged as a two-dimensional (2D) image-like matrix. The latter SAE uses an RNN for the encoder-decoder functions, and it is trained on the data coming from the latent space of the first SAE. Next, the encoder stages of both SAEs are stacked to form a unique network, whose output is fed to a classifying network that will perform final attack detection.
As it will be shown from the theoretical point of view, the proposed architecture firstly extracts compressed spatial relations from system variables through the CNN-SAE, next, a compact representation of their behavior over time (temporal dependencies) is derived by the RNN-SAE. Finally, a softmax classifier, using all the extracted elements as its input features, concludes the network. As reported in Reference [24], such an approach has proven to be effective also in network traffic classification. The mathematical characterization of each component is provided in the following to ease the understanding of the deepest architectural insights, starting from its theoretical bases.

| Input adaptation
Usually, the raw data captured by an industrial power control system are provided as a set of time-series status variables arranged as a one-dimensional (1D) vector, each related to a given F I G U R E 1 The proposed architecture. It includes an input adapter that serves as a dimensional transformer of the input data, two SAEs, and a softMax full-connected neural network. CNN-SAE uses a CNN for its encoder, while RNN-SAE makes usage of an long short term memory network. Both SAEs are trained separately, and then their encoders are being stacked along with the input stage and the softMax network to form the final network. CNN, convolutional neural network; RNN, recurrent neural network; SAE, sparse autoencoder [Color figure can be viewed at wileyonlinelibrary.com] timestamp. Let d be such a dimension, the d-dimensional input needs to be arranged to a 2D matrix for being fed into 2D-CNN. Accordingly, let ∈ x d  be the raw input and let x y be a derived 2D-matrix, then the input to CNN can be expressed as follows: (1) where the superscript notation * γ [ ] refers to all the specific parameters related to the CNN network.

| Sparse autoencoder
With reference to the general autoencoder (AE) architecture, ENotice that here we have replacedquation (2) expresses the overall input-output transfer function. As it can be observed, the input ( ) is fed to the hidden layer, whose output (h, the latent space), is ) the input through the output layer ( y). We use the same superscript notation * α [ ] referring to all the specific parameters related to the AE network.
where σ is the activation function.
As it is known, to obtain a compact representation from data, SAEs are used. 25 The sparsity property is obtained by training the network so that only a small number of hidden neurons are being simultaneously activated. In the extreme case, in which only one neuron is active at a time, the activation of a hidden neuron is caused by a specific content (features) in the input. Consequently, it is possible to capture, in theory, as many numbers of content-features as the number of hidden neurons.
Indeed, according to Equation (3), the output of the lth hidden neuron (h l ) of an SAE is given by: . Notice that here we have replaced the superscript notation * α [ ] with * [ϑ] to refer to the SAE.
) of the input (x [ϑ] ) capable to activate the lth hidden neuron can be expressed by: That is, without considering the constant term associated with the denominator, the weights (w l k , ), related to the kth hidden neuron, identify a specific features-vector representing a specific content enclosed into raw data. In other words, the set of w l k , for a given k corresponds to the features that we were looking for.
Nevertheless, as it is shown below, several transformations can be applied to these weights by using the CNN and RNN as encoder-decoder functions. Such derived weights are able to extract more complex content from data, and in particular to mine spatial and temporal dependencies to be used as detection features.

| CNN-SAE
The convolution operation between an input matrix ( ) and N f bidimensional filters, whose dimensions are P and Q ( ∈ K f P Q ×  ), is defined as follows: with Y i j , being the components of the convolutional operation. Note that the components of x γ [ ] , depicted in Equation (6), are directly related to the components of the 1D vector x, according to the following relation: i p j q ( ) = ( + − 1, + − 1) be an analytic relation associated to Equation (7), and replacing Equation (6) into Equation (4), we have: with: As depicted, the weights derived by using a CNN as encoder-decoder function in the SAE express a more complex relation involving the feature extracted by the latent space of a pure SAE. Indeed, as it can be observed from Equation (9), the new features w ′ l k , are expressed as a linear combination of the basic ones, w l k , of Equation (4). As a consequence, we can assert that features of Equation (9) are able to capture the aforementioned spatial dependencies from the input data.

| RNN-SAE
RNNs are networks that are widely used for mining temporal dependencies among data by maintaining the memory of the information over time. Nevertheless, as already mentioned, due to the vanishing gradient problem, the training of RNNs could be difficult, and in some cases impossible. LSTM and gated recurrent unit (GRU) are powerful alternatives capable to overcome this issue. 26 A single LSTM unit includes a cell of memory maintaining information for sufficiently time intervals. A set of gates are used to control the information flow and to make decisions about what data should be put into memory, and when they need to be forgotten. GRUs are similar to LSTMs but they use a simplified structure comprising fewer gates to control the flow of information. As depicted in Figure 2, a LSTM unit includes different gates which can be described by the following equations: where ⊗ indicates the element-wise product, W W W W , , , f i c o are matrices that act as linear transformations on data, C t and h t are, respectively, the state of the memory and the output at a given timestamp t. Note that the output (h t ) is time-dependent and is related to the state that the cell assumes in the previous timestamps.
LSTM accepts as input a time-series of data of a given length T, and the final output is produced only after all the input data have been examined. Consequently, the equations from (10) to (15) are continuously and recursively updated for T timestamps.
Let ∈ t T {1, …, } be the timestamps at which the input data are being considered, and by indicating with x t ( ) the input at the timestamp t, then the output (y T ) at the timestamp T can be expressed by: where Ψ T expresses the updating of equations from (10) to (15) Comparing Equation (17) with Equation (4), it can be noted that the derived features gather the behavior of the input over time (T timestamps) in a compact form.
A first-order Taylor approximation of Ψ T can help to better understand how the derived features are related to the basic ones provided by the pure SAE of Equation (4).
To accomplish this, let X Ψ ( ) k T be the kth component of Ψ T , with X the set of the input x t ( ) for t T = 1, …, , then according to Taylor's theorem, the first-order expansion, at null initial hypothesis (i.e., Note that at the initial stage of operation, the cell outputs expressed by Equations (14) and (15) are null, then X Ψ (¯) = 0 k T , and by considering that X Ψ ( ) k T is a multivariable scalar-valued function, we have: which replaced in Equation (18) it yields: Besides, remarking that x t ( ) is a d-dimensional column vector, then: By replacing x k α [ ] of Equation (4) with y k T , it yields: which can be rewritten as follows: , As depicted, the derived features (w l u t , ( ) ) express a complex form of content enclosed in data observed at different timestamps t. Also, their summation (on k) is able to group in a compact result all the behaviors of the input over T timestamps. Ultimately, Equation (23)

| Stacked CNN-RNN-SAE
Once the two SAEs are being separately trained, each on the input provided of the previous stage, the input adapter (see Section 3.1) and the encoder functions of both CNN-SAE (Section 3.3) and RNN-SAE (see Section 3.4) are stacked to form a unique network.
With reference to Equations (1), (6), and (7), any component (Y i j t , at time t can be expressed as: Next, according to Equation (16), any component k of the LSTM-cell output after T iterations can be expressed as: Also in this case we use the superscript notation * [ϱ] referring to all the specific parameters related to the LSTM network.
As depicted from these equations, the derived features are produced by a combination of the weights associated with the encoders of both the CNN and RNN-SAEs, that is (w l k ), respectively.
As above, a first-order Taylor expansion can help to understand the meaning of Equation (27).
By substituting Y u t ( ) of Equation (24) into Equation (25), for a sigmoid activation function, it yields: Thus, if we assume X = (0, …, 0)′, b with w ′ l u , given by Equation (29), and ( ) being a column vector of the input at time t. (1) [ ] ( ) , for X X =¯, then: where s is the number of hidden neurons, first defined in 3.2. Therefore, the Taylor series expansion of Ψ k T can be expressed as follows: Finally, by substituting Equation (33) into Equation (26), and then in Equation (27), it yields: which can be rewritten as follows: , ( ) of Equation (29) into Equation (35), it is possible to observe that the temporal dependency features at timestamp t are given by: [ϱ] f (36) As depicted, the derived features (w l ν t ,

| EXPERIMENTAL RESULTS
To evaluate the effectiveness of the proposed approach, in this section, the results obtained over a realistic testbed are shown and compared with the state-of-the-art.

| Power system testbed and attack scenarios
The data used in our experiment refer to a Power system testbed framework provided by Mississippi State University and Oak Ridge National Laboratory, 27 enclosing several real-world operating scenarios. As depicted in Figure 3, it includes two power generators (i.e., G1 and G2), four IEDs, that is, R1 through R4, each capable to control an assigned breaker (BR1 through BR4). IEDs are able to detect faults and consequently trips the breakers, but they are not equipped with intelligence allowing them to distinguish if the faults are actually valid or fake. The control panel includes all the tools need for monitoring and control the power system.
The provided framework generates 37 total scenarios, which include 8 Natural-events, 1 Noevent, and 28 Attack-events. Each event comprises thousand of monitoring observations captured by four PMUs working at 120 samples/s, one control panel logging system, and a Snort intrusion detection system (IDS). Any PMU provides 29 status variables, thus the set of all PMUs provides 116 (29 × 4) variables that become the features that directly describe the power system behavior, while the Snort IDS and the logs provide other 12 additional elements to be used for spotting anomalous events in the overall power control system. This, results into 128 basic elements describing the system activity, to be used for detecting and recognizing attacks, plus the class (or supervisory signal) consisting in a label reporting the involved attack type or the absence of anomalies (normal activity). For a more detailed description of these items, we remand to Reference. 27 These basic elements will be the starting point for mining the more meaningful space or time-related features that will be useful in detecting attacks.
All data are arranged into three schemes: • Binary. Only two classes of operations are considered, that is Attack and Normal which include 28 and 9 events, respectively. Table 1 shows the numeric distribution of the items for each class. • Three-Class. All the scenario are grouped into three classes, that is Attack (28 events), Natural (8 events), and "No events" (1 event). Table 2 shows the details of the numeric distribution.
class. As shown below, we tested our approach for each of these schemes and compared it with some ML-based solutions characterizing the state-of-the-art.

| Experimental setting and network training
The proposed solution and all the experiments were implemented by using the Python language and the Keras-Tensorflow library running on a Desktop PC equipped with an Intel i7 CPU operating at 2.00 GHz frequency, 16 GB RAM, and an NVIDIA GeForce MX150 GPU with 4 GB of memory. According to Equation (7), the 128 basic elements to be used for detection were arranged in a (16 × 8)-dimensional matrix.
The CNN-SAE was implemented by using 2 Conv2D layers including 14 and 23 (6 × 3)dimensional filters, respectively. A padding same = and a Relu activation function were used. We remark that the inverse layer configuration, that is, 23 and 14, was used for the decoder. An Adam optimizer and a MSE loss function were used for training.
From the CNN-SAE, a new net (CNN-SAE-Encoder) was extracted comprising the input adapter and the encoder function of CNN-SAE. The resulting features, representing spatial relations among status variables, were fed into this network and its output (cf. Equation 8) was used to train the second SAE, that is, the RNN-SAE that will extract temporal dependencies.
The RNN-SAE was implemented by using two LSTM layers of 16 and 36 cells, respectively. A timesteps of 10 with stride of 1 was used, which corresponds to around 80ms for PMU working at 120 samples/s. The reconstruction over time of the input was obtained by using a TimeDistributed layer. Also, in this case, an Adam optimizer and an MSE loss function were used for the training.
The encoder function of RNN-SAE was stacked with CNN-SAE-Encoder and its output (cf. Equations 27 and 35) was fed into a fully connected softmax neural network. The softmax network was used to classify the scenarios, and it includes 2 dense layers of 64 and 24 neurons, respectively. Also, to avoid overfitting, a dropout layer with a rate of 0.5 was added after each dense layer of the softmax network. As it is known, dropout is a regularization process that probabilistically removes some inputs during training by yielding the effect of making layers with a different number of nodes, thus avoiding overfitting. A Relu activation function was used, and the training was performed by using the Adam optimizer and the categorical crossentropy _ loss function. Finally, the whole final stacked network was further trained for fine-tuning. Note that all the networks were trained on a very low number of epochs, that is 100. This means that the time spent on the training was around 1 min and 20 s.

| Evaluation metrics and results
To test the performance of the proposed stacked network architecture, all data sets associated to the aforementioned schemes were first divided by classes and then, the data of each class were divided into two parts, namely the training (70%) and testing (30%) sets. As usual, all data sets were rescaled through the min-max normalization to avoid biases. Besides, the exploratory data analysis (EDA) showed no outliers and missing values. Therefore, no further preprocessing was required. Well-known metrics derived from the multiclass confusion matrix 28 were used for the evaluation of the results obtained on testing sets, that is: Accuracy (Acc), Sensitivity (Sens), Specificity (Spec), Precision (Prec), Area under the ROC curve (AUC), and F_Measure (Fmea).
More precisely: where TPs denote true positives and are the scenarios correctly classified, FPs denotes false positives and are the scenarios incorrectly classified, FNs denotes false negatives are the scenarios incorrectly rejected, and TNs denotes true negatives and are the scenarios correctly rejected.
For each metric, the results are reported as the average evaluated on the different classes. All the metrics were computed by using the scikit-learn Python library. 29 The results for any schema are provided in the following sections.  Figure 4 depicts, for any data set (data1-data15), the performance metrics averaged on all the classes. As it is shown, for most of the data sets the metrics achieved 100%. Although the metrics associated with some data sets achieved lower values, it should be noted that they correspond to values just below 100%. Indeed, the worst case is represented by data11, which achieved 98.91%, 97.73%, 99.29%, 97.73%, and 98.48% for Accuracy, Sensitivity, Specificity, Precision, AUC, and F_measure, respectively.

| Three-class
Also in this case, for most of the data sets (9 out of 15, i.e., data 1, 5, 7, 8, 9, 10, 11, 13, and 15) the metrics achieved the maximum value of 100%, while for the remaining data sets the metrics achieved values just below the maximum. Indeed, as depicted in Figure 5, both data2 and data6 achieved the 99.91% of Accuracy and 99.87% for F_measure. Slightly lower values were achieved with data12 and data14 data sets, with an Accuracy of 99.74% and 99.46%, respectively. The worst cases are associated to data4 and data3 with an Accuracy of 99.17% and 99.08%, respectively.

| Multiclass
As shown in Figure 6, in the multiclass scenario, even though the metrics achieved high (near perfect) values overall, only data15 achieved the 100% of performance. Nevertheless, the metrics associated with the data sets from 3 to 14 achieved values very close to the ones of data15, that is greater than 99%. Besides, unlike data15, the sensitivity associated with other data sets achieved lower values with respect to both Binary and three-class schemes. However, it should be noted that all the metrics assumed high values. Indeed, the worst case, corresponding to data6, achieved performance higher than 90%.

| Comparison and discussion
As shown in the previous sections, our approach achieved very high performance in all the scenarios. In addition, we want to remark that the training procedure was very fast (80 s on average, despite operating with limited computing resources) and for all the experiments the accuracy evaluated on the training sets achieved 100%. To give further proof of the effectiveness of the proposed architecture, Figures 7-9 show a comparison with the state-of-the-art based on ML. In particular, we compared our approach with seven different algorithms presented in Reference [27], that is, OneR, 30 NNge, 31 Random Forests, 32 Naive Bayes, 33 SVM, 34 JRipper, 35 and Adaboost. 36 The results refer to the average value of the Accuracy computed by using the F I G U R E 7 Comparison with the state-of-the-art-Binary-Class schema. Our approach outperformed the best state-of-the-art solution (i.e., Adaboost) by around 5% on average [Color figure can be viewed at wileyonlinelibrary.com] F I G U R E 8 Comparison with the state-of-the-art-Three-Class schema. Our proposal outperformed the best stateof-the-art solution (i.e. Adaboost) by around 10% on average [Color figure can be viewed at wileyonlinelibrary.com] 10-fold cross-validation method. Accordingly, for any schema (i.e., binary, three, and multiclass), each data set was partitioned into 10 subsets including random instances coming from each category. Any subset was employed as a testing set and used for the performance evaluation, while the remaining sets were used as training set. All the considered approaches were tested on the same data set. No other particular setting was used in the comparison. 27 As depicted, our approach outperformed the others in all the testbed evaluation schemes. In particular, for both the Three-Class and Multi-Class schemes, our proposal outperformed the best ML-based algorithms by around 10% on average. While, for Binary-Class, although the performance achieved by some ML-based algorithms are very high, our approach outperformed the best state-ofthe-art solution (i.e., Adaboost) by around 5% on average. Finally, it should be noted that, differently from most of the available state-of-the-art solutions, the results of our approach remain steady with respect to the specific evaluation schemes and data sets used. Besides, to highlight how the proposed neural network configuration, and in particular the usage of the SAEs, influences the classification performance, we compared our findings with ones obtained by using only a CNN or an LSTM network followed by a fully connected soft-max network (referred to as CNN-NN and LSTM-NN, respectively). To ensure a fair comparison, all the configurations have been compared starting from the same operating conditions, that is same training/testing data set pair for each fold of the 10-fold cross-validation approach. Tables 3 and 4 reports the evaluation metrics averaged across all data sets (i.e., Data1-Data15) and classes. As it is depicted, for both the network configurations, the metrics achieved lower values than our proposal (see Table 5). The best performance was achieved F I G U R E 9 Comparison with the state-of-the-art-Multi-Class schema. Our proposal outperformed the best stateof-the-art solution (i.e., Adaboost) by around 10% on average [Color figure can be viewed at wileyonlinelibrary.com] T A B L E 3 Evaluation metrics averaged across all data sets and classes for the CNN-NN configuration  Table 3). Although both Sensitivity and Specificity achieved a good score of 89%, they were around 10% lower than ones obtained by our proposal, which achieved 100% performance.

| CONCLUSIONS AND FUTURE WORKS
In this paper, a stacked autoencoder-based convolutional and recurrent DNN is presented for detecting cyber-attacks in power control systems interconnected over the IoT. More specifically, the proposed architecture leverages two SAEs whose encoder-decoder functions are replaced with a convolutional (CNN-SAE) and a recurrent neural network (RNN-SAE), respectively. The encoders' functions are combined and fed into a softmax-based classifier for inferring the operation status of the control system. Such detection architecture can be particularly useful for supporting early alerting and automatic reaction systems involved in situations in which the reactivity time and precision are crucial. Indeed, extensive performance evaluation experiments on realistic power system data resulted in almost perfect classification outcomes in both binary-and multi-class scenarios. The proposed architectural framework also exhibits good adaptivity to any kind of ICS applications without requiring preliminary field knowledge, since autoencoders with convolutional and recurrent encoding demonstrated to be effective in autonomously mining new more meaningful features from data. In particular, they have allowed the extraction of hard-to-detect correlations among data (spatial dependencies) and among data over time (temporal dependencies). Besides, because each autoencoder has been separately trained, it is evident that the architecture does not suffer from the vanishing gradient problem that affects large neural networks. Also, the compact representation (latent space) offered by autoencoders has allowed overcoming the training problem related to unbalanced data sets. Finally, the proposed architecture has proven to have a very low training latency. Indeed, the training procedure required only 80 s on average despite using very limited computing resources. Nevertheless, for all the experiments the accuracy evaluated on the training sets has always achieved 100%. The excellent results, obtained from the proposed approach without using preliminary knowledge of the applicant field, suggest us to continue the investigation in this direction. In this regard, in the future, we intend to evaluate the performance of the proposal on a number of scenarios that include any type of application field, and not necessarily related to power systems. As a consequence, we intend to generalize the proposed architecture and then implement a web-based service capable to offer an anomaly detection framework usable online through the Internet network.