Generative Adversarial Networks With AdaBoost Ensemble Learning for Anomaly Detection in High-Speed Train Automatic Doors

Due to the scarcity of abnormal condition data in components of transportation systems, only normal condition data are typically used to train models for anomaly detection. One of the main challenges is the difficulty of properly representing the data distribution which is typically non-smooth, high-dimensional and on a manifold. This work develops an anomaly detection model based on an Auto-Encoder (AE) formed by the generator of a Generative Adversarial Network (GAN) and an auxiliary encoder to capture the sophisticated data structure. The reconstruction error of the AE is, then, used as anomaly score to detect anomalies. Additionally, an adaptive noise is added to the data to make easier the GAN optimization, an AdaBoost-based ensemble learning scheme is used to improve detection performance and a new approach for setting the hyperparameters of the AE-GAN model based on the derivation of a lower bound of the Jensen-Shannon divergence between generator and normal condition data distributions is developed. The method has been applied to synthetic and real data collected from automatic doors of high-speed trains.

in transportation systems. These latter applications are made possible by the fact that sensors measure a variety of signals for the control operation and monitoring of behavior of critical components of transportation systems can allow improving the efficiency of operation and reducing the cost of maintenance. Anomaly detection approaches are typically categorized as supervised, unsupervised and one-class classification [5]. Supervised methods require the availability of a sufficient number of signal measurements labelled with the information on the component health state, i.e. normal or anomalous. They typically face the problems of dealing with imbalanced datasets, being abnormal condition data typically rare and of the variability of the operating conditions which causes major modifications to the data distributions. In [6], noise-filtered and under-sampling methods are combined to address the issue of imbalanced data. In [7], the 'TrAdaBoost' method, which extends the classical AdaBoost method to deal with the situation in which data distributions change due to variations of operating conditions using a domain transform method is proposed. Unsupervised methods do not need labelled data, but they typically assume that i) a sufficient number of patterns collected in both normal and anomalous conditions is available, ii) anomalous condition patterns are sufficiently dissimilar to normal condition patterns to allow discriminating them [8]. On the other hand, in many industrial applications anomalous conditions are rare and changes in operating and environmental conditions cause variations of the measured signals that are larger than the variations caused by the onset of a degradation of a component, at least at the early stages after its occurrence. For this reason, this work considers detection methods based on one-class classification [9], which are trained on a dataset containing only normal condition patterns. Examples of classification methods applied to the one-class classification problems are: Support Vector Machines (SVMs) [10], nearest neighbor-based methods [11], statistical-based models [12] and Deep Learning based (DL) [13]. One-Class SVM (OC-SVM) defines a kernel to identify the region that fits the distribution of the normal condition data. Then, if a test pattern falls out of the learned region, it is declared as anomalous. A method which integrates Support Vector Data Description (Deep-SVDD) with a deep feature extraction for anomaly detection has been developed in [14] and applied to image benchmark datasets. Nearest neighbors-based methods use properly defined measures of This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ dissimilarity among patterns and assume that normal condition data are located in dense neighborhoods, whereas anomalies are far from their closest neighbors [15]. For example, the Auto Associative Kernel Regression (AAKR) method has been used to detect anomalous conditions in an energy production plant in [16]. The method is based on the reconstruction of the test pattern as a weighted sum of normal condition patterns, where the weights are proportional to the patterns similarity to the test pattern. Two similarity measures based on the Euclidean distance have been introduced in [16] and [17]. Then, if the reconstruction error exceeds an alarm threshold, the test pattern is identified as abnormal. Statistics-based methods, such as Gaussian Mixture Models (GMMs) statistics, construct probabilistic models describing the normal condition patterns. Then, an anomaly is detected if the likelihood of occurrence of the test pattern is lower than a predefined threshold [12]. For example, in [18] a novel multiscale drift detection test is proposed to solve the classification problem when data distributions change over time. In [19], novel outlier detection strategy, based on SetMembership Filtering model is developed to identify measurements corrupted by outliers. A deep generative model stacked with multiple GMM-layers has been proposed to detect abnormal events in video surveillance [20]. Deep learning-based anomaly detection methods have recently gained a lot of attention due to their ability of effectively learning the characteristics of complex data, such as multivariate time series, and the flexibility of designing problem-specific loss functions. In [21], a pairwise Gaussian loss function is developed to address the problem of intra-class compactness and successfully applied to synthetic data. In [22], a full-center loss function is used to improve the separability of features in fraud detection. In [23], a method combining auto-encoders, which extracts nonlinear features, and backpropogation to obtain a fault diagnosis index is developed and applied to a benchmark case. These deep learning-based methods assume that small reconstruction errors are achieved for normal condition data, whereas large reconstruction errors are obtained for anomalous condition patterns [24]. However, detecting anomalies using conventional deep learning methods, such as RNNs, Auto-Encoders and hybrid DNNs, can be challenging due to the long-term time dependency and cross correlation among time series [25].
Generative Adversarial Network (GAN) is a deep learning method which consists of a generator and a discriminator, where the generator is trained to reproduce the training data distribution and the discriminator provides the probability of a new pattern coming from the same training [26]. GANs have been shown able to learn dataset with complex structure, e.g., spheres or torus manifolds [27], and of reproduce real data distribution for data augmentation [28]. GAN-based anomaly detection techniques were first proposed for medical image analysis [29]. In the transport field, a data augmentation method has been developed for synthesizing anomalies of the minority classes in lane detection specifically. A GAN is used to learn the distribution of anomalous condition patterns, and generate synthetic anomalies, which are, then, used to train a supervised anomaly detection model [3]. A limitation of the method is that it cannot be used when abnormal condition patterns are completely missing. In [24], a deep oneclass classifier formed by an auto-encoder and a discriminator trained in an adversarial way is developed. In [30], GAN and Variational Auto-Encoder(VAE) are combined to synthesize auxiliary positive patterns in a problem of predicting lung cancer.
In this context, the objective of the present work is to develop a methodology for detecting anomalies in components behaviour using measurements collected from components of critical systems of transportation systems. To this aim, we develop an Auto-Encoder aided GAN (AE-GAN) which allow associating an anomaly score to the multidimensional signal time series. The GAN is trained to obtain a generator which reproduces the distribution of normal condition patterns, i.e. time slices of multivariate time series. Then, the encoder and the trained generator form an Auto-Encoder, which is trained to minimize the reconstruction errors of normal condition patterns. Finally, to improve the detection performance, an ensemble of anomaly detectors is developed by adapting the AdaBoost ensemble learning scheme. A test pattern is identified as anomalous if the reconstruction of the ensemble of Auto-Encoders error is larger than a certain threshold. Two different AE-GANs variants are considered in this work: variant a) sets up a dedicated AE-GAN for every time slice of the multivariate time series, whereas variant b) sets up a single universal AE-GAN to be applied to all time slices.
A synthetic case study concerning three complex distributions of normal condition patterns, e.g. Cone, Two Sphere and Bowl distributions, to used to verify the performance of the AE-GAN. Then, the developed method is applied to a real-world industrial case study concerning automatic doors of high speed trains. Notice that failures of automatic doors are a cause of unavailability of high-speed train that has recently attracted the attention of the main stakeholders of the transportation systems. The performance of the proposed method has been compared to six state-of-the-art anomaly detection techniques including OC-SVM, AAKR, GMM and deep-learning based AE, VAE-and Deep SVDD algorithms.
The contributions of this work are: 1) a combined framework of AE and GAN is developed to identify anomalous patterns in the situation, common in transportation systems, in which abnormal condition data are not available and normal condition data are characterized by complex distributions, e.g. with manifold support; 2) the lower bound of Jensen-Shannon (JS) divergence is used to guide the setting of the AE-GAN hyper-parameters; 3) an adaptive noise is added to the input data in case of non-smooth data distributions; 4) The use of the Adaboost algorithm has allowed to extend the AE-GAN to treat multidimensional long-term time series data.
The remaining of the paper is organized as follows: Section II states the problem and illustrates the work objectives; Section III introduces the background and preliminaries of the proposed methodology and Section IV specializes the proposed methodology of anomaly detection for long-term multivariate time series; Section V introduces the numerical synthetic case study with three complex distributions and the real-world industrial case study of the automatic doors in high speed trains, and then discusses the results obtained; finally, some conclusions and remarks are given in Section VI.

II. PROBLEM STATEMENT
We consider N nor components operating in normal conditions. For each component, N f features related to its health condition are measured during operation. The N f × L matrix, X r , r = 1, . . . , N nor , contains the time series of length L collected during component operation in normal conditions. The aim of this work is to build an anomaly detection model to identify the normal/abnormal health state of a test component given the N f -dimensional time series X test measured during its operation. Notice that the situation in which only normal condition data are available is common in many applications of transportation systems involving, for example, safety critical or newly designed components.

A. Generative Adversarial Networks
Let X ⊆ R N x be the space of the training data whose distribution is p data . A GAN consists of a generator and a discriminator, where the generator is a multilayer perceptron aiming at generating patterns from the same distribution of the training data and the discriminator is a multilayer perceptron aiming at providing the probability that a test pattern x comes from the same data distribution [26].
The generator G (z; θ G ) : Z → X with associated parameters θ G maps a latent variable z from the latent space Z ⊆ R N z to the data space of the patterns X ⊆ R N x . The entries of the latent variable z, z ∈ Z are independent among them and follow a standard Gaussian distribution N (0, 1). The discriminator D (x; θ D ) : X → [0, 1] with associated parameters θ D discriminates whether a test pattern x belongs to x (true) or is generated by the generator (fake) by estimating the probability that x comes from the true data distribution p data . The generator G is trained to approximate p data , whereas the discriminator D is trained to distinguish the training patterns from the patterns generated by G. Mathematically, the GAN is trained by conducting a minmax optimization with loss function F (θ D , θ G ): where p z (z) is the prior probability distribution function of the latent variable z.

B. Auto-Encoders
An Auto-Encoder is a neural network composed of an encoder and a generator, trained to replicate its input data [31]. The encoder maps the data space X into the latent space Z, whereas the generator reconstructs the input data from the latent variable z. A typical form of an encoder E is a composition of a nonlinear activation function f and an affine transformation: and the offset vector b E of dimension N z . The generator G maps back the resulting latent variable z into the reconstructed N x -dimensional vector x. Its typical form is similar to E: where the parameters θ G = {W G , b G } are weight matrix of size N x × N z and the offset vector b G of dimension N x ; f G is nonlinear activation function, e.g. tanh(·). The Auto-Encoder is trained by minimizing the reconstruction error L rec , which quantifies the expected distance between the input vector x and its reconstruction where '·' denotes the L2 norm.

C. AdaBoost Ensemble Learning
AdaBoost is an ensemble learning algorithm which constructs a classifier as a linear combination of several weak classifiers [32]. In practice, the output of the boosted classifier is the weighted sum of the outputs of the weak classifiers. AdaBoost is adaptive in the sense that subsequent weak learners are tweaked in favor of those instances misclassified by previous classifiers.
where h t denotes the t-th base classifier, α t the weight assigned to h t and T the number of base classifiers.

D. Adam Optimization
The Adam optimization algorithm is an extension of the stochastic gradient descent algorithm [33]. It combines the advantages of a) Adaptive Gradient Algorithm (AdaGrad), which uses a per-parameter learning rate to improve its performances on problems with sparse gradients and b) Root Mean Square Propagation (RMSProp), which iteratively adapts the parameter learning rates on the basis of how quickly the gradients of the weights is changing, to improve its performance in non-stationary problems.

A. Base Anomaly Detector With Auto Encoder Aided Generative Adversarial Networks (AE-GAN)
The proposed anomaly detector is based on the use of a GAN to reconstruct the expected signal behavior in normal conditions, from which the reconstruction error can be obtained and used to discriminate normal from abnormal condition patterns. Its development requires: 1) training the GAN model on normal condition patterns for reproducing the distribution of the normal condition data and 2) training the AE to reconstruct the expected signal behavior in normal conditions.
In 1), the GAN is trained to minimize the Jensen Shannon Divergence J SD( p G p X nor ), where X nor denotes the set of patterns collected from components operating in normal conditions, p X nor their probability distribution and p G the generated pattern probability distribution. Notice that if the GAN generator were perfectly trained, then J SD( p G p X nor ) would converge to 0 [34]. Let θ G = {W G , b G } and θ D = {W D , b D } denote the generator and discriminator parameters, respectively. Similarly to the AutoEncoder (Section III-B), the discriminator D is formulated as: is an offset vector of dimensionality N z and f D is the nonlinear activation function, e.g. f D = sigmoid(·). For the purpose of anomaly detection, we set p data = p X nor and p z as a Gaussian distribution N (0, 1) of independent variables and we address the minmax problem of Equation (1).
Before the optimization of the generator parameter θ G , the discriminator parameter θ * D (θ G ) is set by using a gradient optimization method based on Adam (Section III): determined by the gradient of the loss function F with respect to θ D , and β 1 and β 2 are the control parameters of Adam [33], and η is the learning rate, and θ (k) D is the optimization result at the previous k-th gradient descent iteration step, and θ The generator parameter is also optimized based on Adam (Section III): D , θ G ); β 1 , β 2 is determined by the gradient of the loss function F with respect to θ G . Note that for each updating step of θ G (Equation (5)), there are k updating steps of θ (k) D (Equation (3)), because θ (k) D depends on θ G . In 2), to obtain the reconstruction x of data x, it is necessary to query its latent variable z ∈ Z and, then, to map z into the data space X nor by using the generator, x = G(z). According to [35], the search of z optimal is treated as an optimization task, i.e. min z x − G(z; θ * G ) 2 . In this work, an auxiliary auto-encoder is proposed for efficiently minimizing the reconstruction error: from which we obtain where θ * E is the optimal parameter of encoder E. The encoder parameter θ E is optimized by Adam (Section III): , β 2 is determined by the gradient of the loss function L rec with respect to θ E . The above gradient-based optimization is typically applied using a small learning rate and multiple iteration steps, which leads to a large number of epochs N epoch [33]. Note that the Adam is choosed in the GAN and AE because of its competitive performance in the data reconstruction experiments presented in [36] with respect to other possible gradient-based optimization methods (e.g. Stochastic Gradient Descent (SGD), AdaGrad, RMSProp).
The anomaly score function of pattern x is: If A(x) is larger than a threshold value, A threshold , set considering normal condition data: , then x is anomalous, otherwise, x is normal. The rationale behind the use of a threshold for detecting anomalies is that the distribution of abnormal condition data is expected to be significantly different from that of normal condition data [15]. If the generator were optimal, the optimal latent variable query z optimal corresponding to a pattern in normal condition x, which provides zero reconstruction error, could be identified. In contrast, if x is a pattern collected in abnormal conditions, we would expect a large reconstruction error, because the generator which is trained to reproduce only normal condition data, cannot cover the region of the space corresponding to anomaly data. However, the optimization of z optimal may suffer from large computational burden as one needs to perform the optimization task for each x [37]. Therefore, inspired by [37], we propose to use an auxiliary encoder E to replace the optimization task for efficiently querying the latent variable [37]. Figure 1 shows the overall structure of the developed anomaly detection model.

B. AE-GAN Hyper-Parameter Optimization
Although J SD( p G p X nor ) can be used as an actual objective to optimize the GAN architecture, its true value cannot be obtained during GAN training [28]. The generator G provides in output the probability D(x) that x belongs to the distribution of the normal condition data. The label y(x) associated to x is "1" if x ∼ p X nor (x) and "0" if x ∼ p G (x) is 0, then D is trained to minimize the cross entropy loss: Notice that −L BC is equivalent to the GAN loss F (θ D , θ G ). According to [28], equation (10) can be written as, (11) For a given generator G, the identification of the optimal discriminator D * (·) that minimizes L BC is equivalent to solve the differential equation: whose solution is: Then, the JS divergence between p G and p X nor is: By combining equations (13) and (14), we obtain: , and, therefore: In this work, we use J SD L B to monitor the convergence of the GAN training process. In particular, the JS Divergence (JSD) is bounded in the range [0, ln 2] and when J SD L B becomes close to 0 a good generator G that is able to reproducing the 'true' distribution is obtained. Thus, GAN hyperparameters can be further optimized and performance of generator can be quantified without the use of abnormal patterns.
The GAN hyper-parameters include the number of hidden neurons, the number of hidden layers, the size of latent space in generator, N z , the iteration steps of discriminator per each iteration of generator, k, and the number of epochs, N epoch . Notice that the encoder module of the AE-GAN shares the same network architecture with the discriminator module, and encoder, generator and discriminator are all multiple layers perceptrons with the same number of hidden layers, and each hidden layer has the same number of hidden neurons.

C. Ensembled Anomaly Detector by AdaBoost Algorithm
The ensembled anomaly detection method based on AE-GAN is shown in Figure 2. The two main challenges encountered in the real industrial applications are: 1) the densities of data distributions are not smooth and 2) high dimensionality.
With respect to 1), we have found that training GAN on distributions whose densities are not smooth prevents J SD L B ( p G p X nor ) from converging ( Figure 8). To tackle this challenge, we add a normally distributed adaptive noise k (i ) to the k-th feature at the i -th time stamp of the r -th healthy components, X r k (i ), then we derive: where the standard deviation σ k (i ) is a variable that changes according to the standard deviation of X r k (i ) r=1,...,N nor , γ ∈ (0, +∞) is a scaling factor and δ ∈ (0, +∞) is a bias term to ensure that σ k (i ) > 0, because X r k (i ) can be a constant for r = 1, . . . , N nor .
The rationale of adding an adaptive noise to the data is that if the probability distributions p G and p X nor are disjoint manifolds, then the optimal discriminator D x;θ * D (θ G ) = 1 for any true data x ∈ X nor , and is 0 for any generated data G(z). Therefore, GAN loss F (θ * D (θ G ), θ G ) will be zero [38], and, as a consequence, the value of J SD( p G p X nor ) is equal to ln(2) ≈ 0.69. By adding an adaptive noise N (0, σ 2 ) to the data distribution p X nor , we obtain that p G and p X nor are not disjoint and the gradient of J SD( p G p X nor ) over θ G does not vanish during the training of GAN. With respect to 2), many industrial application are characterized by long-term multivariate time series with a large number of features N f and long sequences of length L, which makes the data dimensionality N x much larger than the number of the patterns. As a consequence, computation complexity becomes infeasible. To deal with this problem, we propose to use non-overlapped sliding time windows which allows splitting the multivariate time series and treating each time window as a separate pattern for anomaly detection. Then, AdaBoost algorithm is adopted to aggregate the anomaly detection results for each time window. Particularly, this work has developed two variants: variant a) trains individual AE-GANs for each time window and variant b) trains one universal AE-GAN for all time windows. Before the training of AE-GANs and testing for anomaly detection, the training and test data are linearly normalized in the range [−1, 1] according to [28]. In the training phase, the adaptive noise is added to the data, whereas, in the test phase, the adaptive noise is not added to the data.
where x r (m) denotes data in the m-th time window collected from healthy components r = 1, . . . ,N nor . The proposed AE-GAN is trained on the data set X nor,m for each m-th time window by applying Equations (3)(5)(6); then, the optimal generator G z; θ * G m and encoder E x; θ * E m for the m-th time window can be obtained. Finally, the anomaly score function of the m-th time window is: The AE-GAN anomaly detector with variant a) is illustrated in Figure 3a, where A m and A m+1 denote the anomaly score of the m-th and m+1-th windows, respectively.

Variant b)
. Let x (m) denote the concatenation of the normalized time index in the range [0, 1] and the generic data of the m-th time window: where symbol ';' represents the vertical concatenation. Note that the dimension of column vector x (m) is N f · L W + 1. This variant constructs the matrix X nor of size N nor × N m containing all the normal condition patterns, independently from their time window whose generic element (r, m) is x r (m), i.e. the data at m-th window from healthy components r = 1, . . . , N nor . The proposed AE-GAN is trained on the data distribution X nor by applying Equations (3)(5)(6). Then, the universal optimal generator G z; θ * G and encoder E x; θ * E for all time windows can be obtained and, finally, the universal anomaly score function A b x (m) is: The training of the AE-GAN anomaly detector with variant b) is illustrated in Figure 3b

5A
Construct set X nor,m by using Equation (18) and assign to X nor .
term to keep stability in the AdaBoost Ensemble learning, in particular, avoiding h m ( A) = 0.
Considering a AE-GAN network with N neu hidden neurons, the number of weight connections between the input x and the first layer of Encoder E is N neu × N f × L w , and the number of weight connections between the last layer of Generator G and the output G(z) is N f × L w × N neu . As a consequence, the computational complexity of a single AE-GAN is 2O(N f · L w · N neu ) + O(C), where C is a constant representing the computational complexity associated to the input and output weight connections. When all the N m time N v set to 1 N v , initial error rate m , m=1,. . .,N m set as 1 2 .

6
Update weights w windows are considered, the total computational complexity is 2O(N m · N f · L w · N neu ) + O(N m · C). Note that L w · N m is equal to L and is, therefore, independent from the length of the time window L W . The second term linearly increases with the number of windows N m , but the network computational complexity O(C) is expected to be small when the time windows become small and N m large.

A. Protocol and Setting
Performance Metrics. The metrics of accuracy, precision, recall, balanced F-score and the Area Under the receiver operating characteristic Curve (AUC) are used to evaluate the performance of the proposed method with respect to the anomaly detection task. Specifically, the number of normal condition patterns correctly classified is indicated as true positive (t p), the number of abnormal condition patterns correctly classified as true negative (tn), the number of abnormal condition patterns misclassified as normal as false positive ( f p), the number of normal condition patterns misclassified as abnormal conditions as false negative ( f n). Accuracy is defined as the fraction of correctly classified patterns among all patterns, Accur acy =

Recall
. Receiver Operating Characteristic (ROC) curve is a curve obtained by plotting the recall against the false positive rate (F P R), F P R = f p f p+tn . Area Under the ROC Curve (AUC) is calculated by using an average of a number of trapezoidal approximations [40].
The range of the performance metrics Accuracy, Precision, Recall, F-score and AUC for anomaly detection is [0, 1], and larger value means better performance.
Methods considered for the results comparison. The proposed methods are compared with the following state-of-theart anomaly detection methods: OC-SVM [41], AAKR [17], GMM [42], AE [43], AnoGAN [35], VAE-β [44] and Deep-SVDD [45]. OC-SVM [41] estimates the support of a high-dimensional distribution, and a hypersphere is defined to distinguish the normal and abnormal, the score function is generated by measuring the "distance" between the margin of the hypersphere of normal condition data distribution. The larger the score, the higher probability the pattern is abnormal. AAKR [17] reconstructs a test pattern as a weighted sum of normal condition patterns. The weights are obtained by applying a radial basis function between which measures the similarities between the test pattern and the normal condition patterns. Then, the anomaly score is defined by: [42] is used to model distribution of the normal condition patterns. Then, the anomaly score of a test pattern is defined as  [43] has been illustrated in Section III-B and it is trained on normal condition data and the reconstruction error x − x 2 is used for detecting anomalies according to the basic assumption of reconstruction-based anomaly detection [13]. This work defines A AE (x) = x − x 2 as anomaly score of AE. It should be noted that the encoder and generators of AE and AE-GAN have the same architecture. Also, the setting of the Adam coefficients β 1 , β 2 , batch size L batch , learning rate η, number of epoch N epoch , is the same for AE and AE-GAN. AnoGAN for anomaly detection [35] associates an optimal latent variable z optimal to a pattern x by minimizing the reconstruction error. Then, the anomaly score is defined as A AnoG AN (x) = x − G(z optimal ) 2 . It should be noted that AnoGAN shares the GAN module with the proposed AE-GAN. Therefore, the parameter settings used for optimizing the latent variable z optimal , Adam coefficients β 1 , β 2 , learning rate η, number of epochs N epoch , are the same as those for AE-GAN training. VAE-β differs from AE since it encodes the input data into a multivariate latent distribution, which is constrained by the Kullback-Leibler divergence between the parametric posterior and the true posterior [44]. The use of a VAE-β for anomaly detection requires the training of the VAE using normal condition data with the objective of minimizing the sum of the reconstruction error x − x 2 and a term equal to β · K L( p G p X nor ). Then, the anomaly score of a test pattern is defined by x − x 2 . The network architectures (number of neurons and model layers) and hyperparameters setting (batch size L batch , epoch number N epoch , learning rate η and Adam coefficients β 1 , β 2 ) of the developed VAE-β are set equal to that of AE and AE-GAN. Deep-SVDD has been introduced in [45] with the objective of learning a neural network transformation φ(·; W) from an input space to an output space properly defined in such a way that normal patterns fall within a hypersphere of minimum volume. Then, the anomaly score of a test pattern is defined as the distance between its projection in the output space and the hypersphere center.

B. Synthetic Case
Three synthetic case studies, which will be referred to as Cone, Two Spheres and Bowl, are considered to verify the anomaly detection performance of the proposed AE-GAN anomaly detector on data that mimic the complexity of real industrial applications. In all three case studies the training set is formed by 3000 normal condition patterns and the test set of 643 normal condition patterns and 642 abnormal condition patterns. Cone patterns are sampled from a 3-D Gaussian distribution with mean [4, 0, 0] and variance di ag (1, 1, 1). Then, only the patterns inside a cone of bottom radius 2 and height 3 are kept. Two Spheres patterns are sampled from two 3-D Gaussian distributions with mean [±4, 0, 0] variance di ag (1, 1, 1). Then, only patterns inside two 3-D spheres, whose centers are located at [±4, 0, 0] and the radius is 2, are kept. Bowl patterns are uniformly sampled from a hemisphere with radius 6 and center point [0, 0, 0].
In all the three case studies, abnormal condition data are uniformly sampled from a Uniform distribution in the hypercube with a range of [−10, 10] for each dimension. Then, the patterns within the normal condition region are deleted.
Implementation details. The AE-GAN contains three subnetworks, namely generator, discriminator and encoder, and each sub-network is implemented by a Multiple Layer Perceptron (MLP) neural network with two hidden layers. The GAN module is composed by a discriminator and a generator. According to [38] and [46], during the model training, for each iteration of the generator is followed by 5 iterations of the discriminator. The AE-GAN model architecture is  characterized by an encoder, a generator, and a discriminator. They are all made by two hidden layers of 50 neurons. The Latent Space Layer acts as output layer of the encoder and input layer of the generator, and the number of neurons in the Latent Space Layer is 2 for the Bowl, and 3 for the Cone and Two Sphere. Also, both Encoder and Generator activate their hidden layers by Rectified Linear Unit (ReLU) [47], whereas Discriminator uses Leaky ReLU [48] as activation function and the leaky rate is set as 0.2. The choice of using 2 neurons for the Bowl case study is motivated by the observation that 2 coordinates are enough to describe the normal condition data, whereas 3 coordinates are needed in the other case studies. The Adam optimizer is used to train the GAN (Generator and Discriminator modules) and AE (Encoder and Generator), setting the learning rate η = 0.0002, the coefficients β 1 = 0.9, β 2 = 0.999, the batch size L batch = 100 and the number of epochs N epoch = 1000. The threshold to detect abnormal conditions is set equal to the maximum of the anomaly score among the patterns of the train set for OC-SVM, GMM, AAKR, AE, AnoGAN. Figure 4 shows that the GAN provides a satisfactory reconstruction of the distribution of the normal condition patterns in the three case studies. Figure 5 shows the latent space followed by AE-GAN when fed by Manifold patterns. Note that the latent variables corresponding to normal patterns are located nearly to the core of the Normal distribution, whereas the abnormal condition data are widely spread. Since Generator is developed to map patterns of the latent space into normal condition data, the distributions between normal and abnormal are disjoint with the majority of the abnormal conditional patterns that can be distinguished.  Table I reports that the AE-GAN provides the most satisfactory accuracy and F-score in all three case studies. In particular, it achieves zero false and missed alarms on the Cone. In the Two Sphere and Bowl case study, although the proposed AE-GAN method cannot achieve at the same time the smallest false and missed alarm rates, it provides the most satisfactory accuracy, and F-score precision and recall scores are always among the best three. It is, therefore, possible to conclude that the proposed AE-GAN allows obtaining the best trade-off between false and missed alarms, which is the key ability in risk-critical applications such as anomaly detection. Notice that the state-of-the-art DL methods (VAE-β and Deep-SVDD) tend to remarkably underperform with respect to the proposed AE-GAN when applied to the Bowl due to their tendency of being trapped in local minima when applied to highly curved surfaces.

C. Anomaly Detection for Automatic Door in High-Speed Train
Real Industrial Dataset. The real industrial dataset is collected from the automatic door components of high-speed trains. There is a current sensor (recording tractive force) and a decoder sensor (recording position) to record the state during the door opening and closing processes. Due to the different time of duration to operate the door, the sensor records for a fixed time of duration, 855 time units, to ensure that the entire operation process is covered. This real industrial dataset contains 138 components operated on normal condition (100 used for training, 20 for validating and 18 for testing), and 22 components on fault type A and 33 components on fault type B. This work uses the signals during both the door opening and closing processes to detect whether the component is normal or abnormal; so, the start time of door opening and closing needs to be synchronized to derive a multivariate time series. Figure 6    Effect of the window size. This paragraph experiments the effect of different window sizes on the convergence of J SD L B ( p G p X nor ). In the experiment, the window size L W is set to 1, 3, 5, 50 time steps, and the normal components signals at the time window with a starting time of 400 is used to train the GAN (see details in Equations (18)); the size of latent space in GAN is set to 4 × L W . During the GAN training process, J-S divergence at each iteration of the generator optimization is recorded (see Figure 7). Notice that when the window size is equal to 50, the input dimension becomes 200 which is large in comparison to the number of pattern 100. Figure 7 shows that it causes mode collapse of GAN, which is revealed by the collapsed generated data that is nothing but randomness and the non convergence of JS divergence. On the other side, as the window size gets smaller, J SD L B ( p G p X nor ) gradually converges close to 0.
Effect of the adaptive noise. This paragraph experiments the effect of adaptive noise on the convergence of J SD L B ( p G p X nor ). In the experiment, the normal components signal at time 460 is used to train the GAN (see Figure 8b) and the parameters of the adaptive noise that is added to the data are set as γ = 0.02, δ = 0.001 (see Equations (17)), and the size of latent space in GAN is set to 4. During the GAN training process, the J SD L B ( p G p X nor ) at each iteration of the generator optimization is recorded (see Figure 8a). The experiment shows that the original data distribution is non-smooth and is the cause of non convergence of GAN. Also, it verifies that after adding adaptive noise to the data distribution with non-smooth density, the original Fig. 8. (a) The effect of adaptive noise on the convergence of J-S divergence, (b) the normal component data whose distribution has non-smooth density at time 460. distribution is converted to a smooth distribution, so that J SD L B ( p G p X nor ) converges to 0 according to Section IV-C.
Effect of using variant a) and b). The methods reported in Table II and Figure 12 shows that variant a) outperforms variant b).This is due to the fact that variant b), which develops a single universal AE-GAN for all windows, is less specific than variant a), which develops a dedicated AE-GAN for each time window. Notice, however, that variant b) is more computationally efficient than a).
Notice that the AE-GAN activation function is ReLU and the batch size L batch is set to 20 in all cases. Figure 9 shows the optimization results of the AE-GAN hyper-parameters (the default initial AE-GAN hyper-parameters are indicated by the solid blackline): N epoch = 1000, iteration steps of discriminator for each iteration step of generator, k = 5, latent space size N z = 4, number of hidden layers = 2 and number of hidden neurons = 200. The normal components signal at time 400 have been used to train AE-GAN. Due to the limited computing power, the successive greedy search is used to do the optimization and the order of search is epochs number, iteration steps of discriminator for each iteration step of generator, latent space size, number of hidden layers An example of true data distribution and the generated data distribution produced by the optimal AE-GAN. The true data comes from the normal components signal at time 400. The experiment shows that the true data distribution can be nearly perfectly reproduced, which satisfies the basic prerequisite of GAN-based anomaly detection methods. and hidden neurons number. The optimization objective is J SD L B ( p G p X nor ) (Section IV-B), Observing Figure 9 one can notice that: i) a large epoch number, e.g. larger than 1000, can degrade the JS divergence; ii) too small k makes GAN training unstable and large value k > 5 degrade the JS divergence; iii) the larger the latent size, the more stable the GAN training; iv) the number of hidden layer has less impact on GAN training than k and the latent size; v) larger hidden neuron number brings better performance. After training of AE-GAN with the optimal hyper-parameters, an example of the generator distribution is shown in Figure 10.
This work compares the proposed AE-GAN with OC-SVM, AAKR, GMM, AE, VAE-β and Deep-SVDD. AnoGAN is not compared because it is very computationally intensive when finding optimal latent variable z optimal w.r.t. each training and test patterns. As for the compared methods, this work uses two strategies. The first is to treat the multivariate time series as the input data pattern and obtain the anomaly score (Section V-A); then, the threshold (in Section IV) is set to the maximum value of the anomaly scores among the training normal patterns. Fig. 11. Example of base anomaly detectors weight α m of, e.g. AE-GAN variant (a), at each time step. We observe that the weights of the base anomaly detectors suddenly drop at time 500, when, interestingly, the value of the original signal ( Figure 6) becomes a constant, which means that the signal after time 500 is irrelevant to component health monitoring.
The second strategy is similar to the proposed ensembled anomaly detector with AE-GAN, which uses non-overlapped sliding time windows (size set to 1) to split multivariate time series and treat each time window as a separate data pattern for anomaly detection and obtain the anomaly score (Section V-A); then, it uses the proposed Algorithm 2 (see Section IV-C) to obtain the ensembled anomaly detection result. The ensembled compared methods are referred to as

OC-SVM (Ens), AAKR (Ens), GMM (Ens), AE (Ens), VAE-β (Ens) and Deep-SVDD (Ens).
Table II compares the anomaly detection results considering different values of percentile c in Algorithm 2. Notice that variants a) and b) of the proposed AE-GAN achieve the best performances for the majority of the considered percentiles. The best F-score (0.7750) is obtained by variant a) when the percentile c is set equal to 75%, whereas the best F-score of variant b) (0.7527) is obtained when c is 85%. Overall, variant a) is better performing than variant b), since this latter introduces discrete time into the data space (see Equation (20)), which makes it difficult for the generator, which has a continuous data space, to fit the data space containing discrete times. Notice that GMM has not been applied to this dataset given the large dimensionality of the data which makes matrix computation unfeasible and the methods not based on ensemble of classifiers (OC-SVM, AAKR, AE, VAE-β,  -GAN with variants (a) and (b), and of the methods used for the comparison. The x axes report the false positive rates and the y axes the true positive rates. In the ablation study, AE-GAN(a-I) is the experiment applying AE-GAN variant (a) without adding adaptive noise, AE-GAN(a-II) is the experiment applying AE-GAN variant (a) with the mode collapsed generator using window size 50 (see Fig. 7).
Deep-SVDD) which provide in output a single pattern class for each test pattern, and, therefore do not allow computing the percentiles of the output distribution.
By using the proposed improved AdaBoost ensemble learning (Algorithm 2), the F-score and AUC is boosted for nearly all the compared methods Table II. In order to obtain the comprehensive anomaly detection performance, we look at the ROC curve adjusting the percentile c and obtaining the AUC (Figure 12), which shows that the proposed AE-GAN variant a) outperforms any other compared methods. Additionally, the ablation studies AE-GAN (a-I, a-II) whose results are reported in Figure 12 prove the effectiveness of the proposed strategies: adding adaptive noise has a more significant impact on the performance than optimizing AE-GAN hyperparameters. The interpretation is that the proposed method can address the real industrial data difficulties, including complex, non-smooth and manifold distributions.
To deeply analyze the performance of the proposed method, we construct two test sets from the industrial dataset. Test Set 1 contains 18 normal condition patterns and 22 anomalous patterns of fault class A and Test Set 2 contains the same 18 normal condition patterns and 33 anomalous patterns of only fault class B. The AUC on Test Set 1 is 0.6304 for AE-GAN(a) and 0.5407 for AE-GAN(b), on Test Set 2 is 0.9474 for AE-GAN(a) and 0.8469 for AE-GAN(b), which shows the proposed method can better detect the anomalies of fault class B than A. The reason is that, according to the original data, data distribution of fault type A is almost the same with the distribution of normal condition data, whereas the data distribution of fault type B is clearly distinguishable from normal condition data.
Advantages: i) Differently from the state-of-the-art methods, the JS divergence approximation introduced in this work allows assessing the performance of the GAN and thus optimize AE-GAN anomaly detector hyperparameters without the use of true anomalous data; ii) The combination of the AE-GAN with the Adaboost algorithm allows automatically filtering the task-related features by assigning different weights to the base anomaly detectors and providing interpretable results; iii) The proposed method is capable of simultaneously addressing the following challenges of real industrial anomaly detection problems: treating long-term time series with complex, manifold and non smooth data distribution.
Disadvantages: i) More hyperparameters to tune during optimization of GAN; ii) Although, the inference time is competitive, the computational effort needed to train the AE-GAN is larger.

VI. CONCLUSION
In this paper, an AdaBoost ensembled AE-GAN anomaly detection method based on the use of GAN and AdaBoost ensemble learning has been proposed for high-speed train automatic doors where abnormal condition data is not available. For obtaining the anomaly score, e.g. reconstruction error, the latent variable corresponding to the data pattern in GAN needs to be queried and we propose to embed an auxiliary encoder in front of the generator to avoid local optimal solutions for data with manifold distribution. Furthermore, we derive the lower bound of Jensen-Shannon divergence between generator distribution and normal condition data distribution to optimize the AE-GAN hyperparameters. To overcome real industrial challenges, like 1) the densities of data distributions are not smooth and 2) high dimensionality, we propose to add adaptive noise on data and adapt the AdaBoost algorithm to integrate AE-GAN base anomaly detectors which treat each time window separately for anomaly detection. Extensive experiments are conducted on both synthetic and real industrial data sets, which demonstrate that the proposed ensembled AE-GAN anomaly detection method outperforms state-of-the-art anomaly detection methods for long-term multivariate time series. Piero Baraldi received the B.S. degree in nuclear engineering and the European Ph.D. degree in radiation science and engineering from the Politecnico di Milano, Italy, in 2002 and 2006, respectively.
He has been a Full Professor of nuclear engineering with the Department of Energy, Politecnico di Milano since September 2021. He is the coauthor of two books and more than 200 papers on international journals and proceedings of international conferences. His main research efforts are currently devoted to the development of methods and techniques for system health monitoring, fault diagnostics, prognostics, and maintenance. He has been an invited Keynote Lecturer at the plenary sessions of the European Safety and Reliability Conference, ESREL 2014, Wrocław, Poland; the 2016 Prognostics and System Health Management Conference, Chengdu, China; and the 4th International Conference on System Reliability and Safety (ICSRS 2019), Rome, Italy. He has been invited to present four tutorials at international conferences.
Xuefei Lu received the Ph.D. degree in statistics from Bocconi University, Italy.
She has been working as a Post-Doctoral Fellow with the Politecnico di Milano and a Lecturer with The University of Edinburgh. In 2021, she joined the SKEMA Business School as an Assistant Professor. Her research interests include statistical machine learning and uncertainty quantification.
Enrico Zio (Senior Member, IEEE) received the first Ph.D. degree from the Politecnico di Milano, Italy, in 1996, and the second Ph.D. degree in probabilistic risk assessment from MIT, in 1998. He is currently a Full Professor with the Centre for Research on Risk and Crises (CRC), École de Mines, ParisTech, PSL University, France; a Full Professor and the President of the Alumni Association at the Politecnico di Milano; a Distinguished Guest Professor at Tsinghua University, Beijing, China; an Adjunct Professor with the City University of Hong Kong, Beihang University, Beijing, China, and Wuhan University, China; and the Co-Director of the Center for REliability and Safety of Critical Infrastructures (CRESCI) and the sinofrench laboratory of Risk Science and Engineering (RISE), Beihang University. His research focuses on the modeling of the failure repair maintenance behavior of components and complex systems; the analysis of their reliability, maintainability, prognostics, safety, vulnerability, resilience and security characteristics; and the development and use of Monte Carlo simulation methods, artificial techniques, and optimization heuristics. He is the author or coauthor of seven books and more than 300 articles on international journals, the Chairman and the Co-Chairman of several international conferences, an associate editor of several international journals, and a referee of more than 20.
Open Access funding provided by 'Politecnico di Milano' within the CRUI CARE Agreement