An End-to-End, Real-Time Solution for Condition Monitoring of Wind Turbine Generators

: Data-driven wind generator condition monitoring systems largely rely on multi-stage processing involving feature selection and extraction followed by supervised learning. These stages require expert analysis, are potentially error-prone and do not generalize well between applications. In this paper


Introduction
Increasingly, countries around the world seek to replace their carbon-emitting power plants with renewable sources such as wind, sunlight, tides, waves and geothermal heat. An important consideration for renewable plant operators is the maintenance of their assets which, for offshore wind turbines, has been estimated at approximately 25% of installation cost [1]. While the Levelized Cost of Electricity (LCOE) from offshore wind turbines has decreased significantly [2], newer technologies for Condition Monitoring (CM) have the potential to drive down cost. Transitioning to renewable electricity sources will not happen overnight and in the meantime some of these tools can be applicable to high-carbon sources as well as to mitigate their impact [3].
CM systems need to operate on a diverse range of wind turbines across different sites, each having its own sophisticated control system [4]. The authors describe existing data collection systems which use vibration analysis, oil particle counters, ultrasonic testing, acoustic emissions and many others. The first generation of wind turbine CM systems used for diagnostics and prognostics relied on physical

The Examined Demagnetized Rotor Conditions
A healthy rotor and two rotors with uniform symmetrical PM faults were prototyped for this study. The uniform demagnetization faults are manifested as identical modulation of the PM segments flux density under one pole ( ) [7]. To emulate this, two rotors with a symmetric ( ) reduction at the PM edges were prototyped. Each rotor represents a different uniform fault level: the rotors F13 and F50, corresponding to 13% and 50% of reduction in the leading edge of a PM pole arc, are shown in Figure 2.

Test-Rig Description
The performance of the proposed algorithm was obtained from experiments on a laboratory test-rig whose configuration is shown in Figures 3 and 4. A PMSM was coupled to a Direct Current (DC) motor connected to a resistor load bank. The PMSM was driven by a Parker 890SSD drive operating in Permanent Magnet Alternating Current (PMAC) vector control mode. The PMSM phase currents were sensed by LEM LA305-S current transducers connected to a NI-9205 voltage input module, mounted on a NI-compactDAQ-9178 chassis measurement system. The rotational speed of the PMSM is obtained from the resolver and sent directly to the drive feedback and computer. The PMSM can be operated using the 3 different types of rotors presented in Figures 1  and 2 to emulate the healthy and faulted rotor conditions.

The Examined Demagnetized Rotor Conditions
A healthy rotor and two rotors with uniform symmetrical PM faults were prototyped for this study. The uniform demagnetization faults are manifested as identical modulation of the PM segments flux density under one pole (B r ) [7]. To emulate this, two rotors with a symmetric (B r ) reduction at the PM edges were prototyped. Each rotor represents a different uniform fault level: the rotors F13 and F50, corresponding to 13% and 50% of B r reduction in the leading edge of a PM pole arc, are shown in Figure 2.

The Examined Demagnetized Rotor Conditions
A healthy rotor and two rotors with uniform symmetrical PM faults were prototyped for this study. The uniform demagnetization faults are manifested as identical modulation of the PM segments flux density under one pole ( ) [7]. To emulate this, two rotors with a symmetric ( ) reduction at the PM edges were prototyped. Each rotor represents a different uniform fault level: the rotors F13 and F50, corresponding to 13% and 50% of reduction in the leading edge of a PM pole arc, are shown in Figure 2.

Test-Rig Description
The performance of the proposed algorithm was obtained from experiments on a laboratory test-rig whose configuration is shown in Figures 3 and 4. A PMSM was coupled to a Direct Current (DC) motor connected to a resistor load bank. The PMSM was driven by a Parker 890SSD drive operating in Permanent Magnet Alternating Current (PMAC) vector control mode. The PMSM phase currents were sensed by LEM LA305-S current transducers connected to a NI-9205 voltage input module, mounted on a NI-compactDAQ-9178 chassis measurement system. The rotational speed of the PMSM is obtained from the resolver and sent directly to the drive feedback and computer. The PMSM can be operated using the 3 different types of rotors presented in Figures 1  and 2 to emulate the healthy and faulted rotor conditions.

Test-Rig Description
The performance of the proposed algorithm was obtained from experiments on a laboratory test-rig whose configuration is shown in Figures 3 and 4. A PMSM was coupled to a Direct Current (DC) motor connected to a resistor load bank. The PMSM was driven by a Parker 890SSD drive operating in Permanent Magnet Alternating Current (PMAC) vector control mode. The PMSM phase currents were sensed by LEM LA305-S current transducers connected to a NI-9205 voltage input module, mounted on a NI-compactDAQ-9178 chassis measurement system. The rotational speed of the PMSM is obtained from the resolver and sent directly to the drive feedback and computer. The PMSM can be operated using the 3 different types of rotors presented in Figures 1 and 2 to emulate the healthy and faulted rotor conditions.

Descriptive Statistics of the Data
Using a data acquisition card (DAQ) from National Instruments, we collected currents for representing various, significant system operating states. We framed Speed and Load as continuous variables and we selected 4 equally distant points covering their whole potential state (e.g., generator speed operating range and load operating range): 1/4, 2/4, 3/4 and 4/4. For the actual examined Speed and Load conditions on the examined generator, this translates to 450 rpm, 900 rpm, 1350 rpm and 1800 rpm speeds, and 0.275 kW, 0.55 kW, 0.825 kW and 1.1 kW loads, respectively. All data were sampled at 5 kHz and split into examples containing 250 measurements, in line with findings from previous research [10] indicating this size to be a good trade-off between size and prediction accuracy. Switching through demagnetization levels makes data collection and real-time testing slower, as this requires the disassembly/assembly of the test generator and its rotor between taking measurement; while relatively time consuming due to practical work complexity constraints, this process does enable extraction of essential feature data in the time domain. We have framed the health state as a categorical variable with three states: healthy, F13 (13% demagnetization) and F50 (50% demagnetization). Figures 5 and 6 show the distribution of the recorded data as a function of Speed, Load and State. Healthy data were collected first, followed by F50 and F13 cases as more data were required for accurate prediction.

Descriptive Statistics of the Data
Using a data acquisition card (DAQ) from National Instruments, we collected currents for representing various, significant system operating states. We framed Speed and Load as continuous variables and we selected 4 equally distant points covering their whole potential state (e.g., generator speed operating range and load operating range): 1/4, 2/4, 3/4 and 4/4. For the actual examined Speed and Load conditions on the examined generator, this translates to 450 rpm, 900 rpm, 1350 rpm and 1800 rpm speeds, and 0.275 kW, 0.55 kW, 0.825 kW and 1.1 kW loads, respectively. All data were sampled at 5 kHz and split into examples containing 250 measurements, in line with findings from previous research [10] indicating this size to be a good trade-off between size and prediction accuracy. Switching through demagnetization levels makes data collection and real-time testing slower, as this requires the disassembly/assembly of the test generator and its rotor between taking measurement; while relatively time consuming due to practical work complexity constraints, this process does enable extraction of essential feature data in the time domain. We have framed the health state as a categorical variable with three states: healthy, F13 (13% demagnetization) and F50 (50% demagnetization). Figures 5 and 6 show the distribution of the recorded data as a function of Speed, Load and State. Healthy data were collected first, followed by F50 and F13 cases as more data were required for accurate prediction.

Descriptive Statistics of the Data
Using a data acquisition card (DAQ) from National Instruments, we collected currents for representing various, significant system operating states. We framed Speed and Load as continuous variables and we selected 4 equally distant points covering their whole potential state (e.g., generator speed operating range and load operating range): 1/4, 2/4, 3/4 and 4/4. For the actual examined Speed and Load conditions on the examined generator, this translates to 450 rpm, 900 rpm, 1350 rpm and 1800 rpm speeds, and 0.275 kW, 0.55 kW, 0.825 kW and 1.1 kW loads, respectively. All data were sampled at 5 kHz and split into examples containing 250 measurements, in line with findings from previous research [10] indicating this size to be a good trade-off between size and prediction accuracy. Switching through demagnetization levels makes data collection and real-time testing slower, as this requires the disassembly/assembly of the test generator and its rotor between taking measurement; while relatively time consuming due to practical work complexity constraints, this process does enable extraction of essential feature data in the time domain. We have framed the health state as a categorical variable with three states: healthy, F13 (13% demagnetization) and F50 (50% demagnetization). Figures 5 and 6 show the distribution of the recorded data as a function of Speed, Load and State. Healthy data were collected first, followed by F50 and F13 cases as more data were required for accurate prediction.    Figure 7 illustrates the differences between low vs. high speed samples. With 250 measurements we capture one current cycle at 450 rpm or four at 1800 regardless of health state. The Speed of the generator is positively correlated with the number of cycles in the electric signal, whereby increasing the former increases the latter (Figure 7). Ceteris paribus, increasing the sample size results in higher accuracies but fewer predictions per time unit. This is a logical consequence of the model receiving more data and thus increasing its chance of better predictions.     Figure 7 illustrates the differences between low vs. high speed samples. With 250 measurements we capture one current cycle at 450 rpm or four at 1800 regardless of health state. The Speed of the generator is positively correlated with the number of cycles in the electric signal, whereby increasing the former increases the latter (Figure 7). Ceteris paribus, increasing the sample size results in higher accuracies but fewer predictions per time unit. This is a logical consequence of the model receiving more data and thus increasing its chance of better predictions.    Figure 7 illustrates the differences between low vs. high speed samples. With 250 measurements we capture one current cycle at 450 rpm or four at 1800 regardless of health state. The Speed of the generator is positively correlated with the number of cycles in the electric signal, whereby increasing the former increases the latter (Figure 7). Ceteris paribus, increasing the sample size results in higher accuracies but fewer predictions per time unit. This is a logical consequence of the model receiving more data and thus increasing its chance of better predictions.  Figure 7. A set of current samples at low speed, low load, 100% magnetization (a) and full speed, full load 50% demagnetized (b). X-axis represents measurement's time index while Y-axis represents its value.
For training and validating of the models we utilized an 80/20% split of the data. Given that we use our models for diagnostic and samples are independent and identically distributed (i.i.d.) this is a valid procedure for identifying patterns in signals. If, however, models with temporal dependency (forecasting or prognostic models where the current and past data are used to predict the future) are needed, an advanced validation methodology, such as walk forward cross validation, is more appropriate [11].

Convolutional Neural Network (CNN) Model
We have experimented using Keras [12] with the one-dimensional CNN architecture template described in [13], which has been shown to be a strong time series classification model over a wide range of datasets [14]. Different hyperparameters were explored, such as number of layers, filters and kernels sizes in each layer, etc.
The general architecture is made up of repeated applications of a convolution function C ( ) that has two operators (1): x (input) and a w (kernel to be learned); it is defined as (1) for continuous and as (2) for discrete signals: Each layer of a convolutional block has multiple kernels (filters) each of a predefined size (see the hyperparameters in Figure 8). The result of applying a filter is essentially performing a dot product between the input and the shifted kernel. To keep the output of the convolution the same length as the input, the latter is padded with zeros. A batch normalization BN (x) layer is defined as: Figure 7. A set of current samples at low speed, low load, 100% magnetization (a) and full speed, full load 50% demagnetized (b). X-axis represents measurement's time index while Y-axis represents its value.
For training and validating of the models we utilized an 80/20% split of the data. Given that we use our models for diagnostic and samples are independent and identically distributed (i.i.d.) this is a valid procedure for identifying patterns in signals. If, however, models with temporal dependency (forecasting or prognostic models where the current and past data are used to predict the future) are needed, an advanced validation methodology, such as walk forward cross validation, is more appropriate [11].

Convolutional Neural Network (CNN) Model
We have experimented using Keras [12] with the one-dimensional CNN architecture template described in [13], which has been shown to be a strong time series classification model over a wide range of datasets [14]. Different hyperparameters were explored, such as number of layers, filters and kernels sizes in each layer, etc.
The general architecture is made up of repeated applications of a convolution function C w (x) that has two operators (1): x (input) and a w (kernel to be learned); it is defined as (1) for continuous and as (2) for discrete signals: Each layer of a convolutional block has multiple kernels (filters) each of a predefined size (see the hyperparameters in Figure 8). The result of applying a filter is essentially performing a dot product between the input and the shifted kernel. To keep the output of the convolution the same length as the input, the latter is padded with zeros. A batch normalization BN γβ (x) layer is defined as: where B = {x 1 . . . x m } is the current mini-batch and γ, β are parameters to be learned in training. The activation function used is the Rectified Linear Unit (RELU) is defined as: The CNN applies a series of three Conv w , BN γβ and Act blocks to input x as shown in Figure 8. To implement our approach, we utilized Keras which provides convolutional layer functionality for 1, 2 and 3-dimensional data. The Conv1 layer used for our models has several parameters which control the number of filters, kernel sizes, strides, padding initialization, etc. While the number of filters was kept fixed, the kernel sizes were different (see Figure 8), as this achieved better validation results on our data. Stride controls the amount that the filter moves at each step and for our purposes was set to the default of zero. We have followed each Conv1D layer with a BN layer, which in Keras is implemented as the BatchNormalization function. Its use is to normalize and scale the inputs from previous layers which in turn results in faster training and lower validation errors [15]. Using BN layers before activation layers has been shown to be successful in mitigating the internal covariant shift problem in the task of predicting faults in bearings and gearboxes [16]. Keras provides various activation functions via the Activation layer implementation: RELU, LeakyRELU, ELU, SELU, as well as the classical sigmoid and tanh. While RELU provided better results in our experiments and represents a good default it will be interesting to explore how it fares against its derivatives. A potential issue of RELU is the "dying RELU" problem which happens when the weighted sum that is fed into the activation function is less than zero resulting in a zero output and gradient, see Equation (7) and [17]. For this reason, computationally more expensive functions, such as LeakyRELU, that have a predefined slope for values less than zero, might yield better results in some settings. Keras provides implementations of several pooling functions (Max, Average, Global) over 1, 2 and 3D. This functionality downsamples the input allowing the next layers to work with smaller representations and avoid overfitting. The combination of Dense layer with Global Average Pooling in 1D (GAP) was used in our experiments as this facilitates the use of Class Activation Maps (CAMs)-discussed in Section 4.2.
Energies 2020, 13, x FOR PEER REVIEW 7 of 18 where B = x … x is the current mini-batch and γ, β are parameters to be learned in training. The activation function used is the Rectified Linear Unit (RELU) is defined as: The CNN applies a series of three Conv , BN and Act blocks to input x as shown in Figure 8. To implement our approach, we utilized Keras which provides convolutional layer functionality for 1, 2 and 3-dimensional data. The Conv1 layer used for our models has several parameters which control the number of filters, kernel sizes, strides, padding initialization, etc. While the number of filters was kept fixed, the kernel sizes were different (see Figure 8), as this achieved better validation results on our data. Stride controls the amount that the filter moves at each step and for our purposes was set to the default of zero. We have followed each Conv1D layer with a BN layer, which in Keras is implemented as the BatchNormalization function. Its use is to normalize and scale the inputs from previous layers which in turn results in faster training and lower validation errors [15]. Using BN layers before activation layers has been shown to be successful in mitigating the internal covariant shift problem in the task of predicting faults in bearings and gearboxes [16]. Keras provides various activation functions via the Activation layer implementation: RELU, LeakyRELU, ELU, SELU, as well as the classical sigmoid and tanh. While RELU provided better results in our experiments and represents a good default it will be interesting to explore how it fares against its derivatives. A potential issue of RELU is the "dying RELU" problem which happens when the weighted sum that is fed into the activation function is less than zero resulting in a zero output and gradient, see Equation (7) and [17]. For this reason, computationally more expensive functions, such as LeakyRELU, that have a predefined slope for values less than zero, might yield better results in some settings. Keras provides implementations of several pooling functions (Max, Average, Global) over 1, 2 and 3D. This functionality downsamples the input allowing the next layers to work with smaller representations and avoid overfitting. The combination of Dense layer with Global Average Pooling in 1D (GAP) was used in our experiments as this facilitates the use of Class Activation Maps (CAMs)-discussed in Section 4.2.
(a) (b)   Speed and Load are parameters for which we have data covering a large range of values (1/4, 2/4, 3/4 and 4/4 of their maximum). We formulated the prediction of these operating parameters as regression problems with dependent continuous variables. In addition, with more data, in the case of Speed, we can further experiment in real-time. The speed of the rig was programed a priori and was estimated using a laser tachometer and compared with our numeric predictions.
The two-regression models were trained using Mean-Squared Error (MSE) as shown in Figure 9. MSE is used to compute errors made on the training and validation sets and is defined as: whereŶ(i) is the output of our network for input instance w and Y(i) is the ground truth.
where k represents the k-th filter activation in the previous layer and n represents the size of vector x.

Model Performance
Without knowing the hyperparameters that work for a given or similar domain, fitting the best capacity deep learning model requires empirical investigation. We have experimented with different numbers of CNN blocks and hyperparameters such as filter numbers and size. Three convolution blocks of 100 filters with 100, 25 and 10 kernel sizes achieved the best results in our experiments. Different kernel sizes were used to detect smaller, medium and larger patterns. As expected, smaller networks tend to take longer (in terms of epochs) until reaching acceptable levels of MSE. For Speed prediction, we achieved an MSE of 46.61 on the training data and 40.85 on the validation set representing 20% of the data.  This result was achieved using rpm as units ranging from 450 (1/4 speed) to 1800 (4/4 of speed). The Load was trained on a range ranging from 1/4 to 4/4 and we achieved an MSE of 0.0040 (0.0018 on the validation set representing 20% of the data) after 100 epochs.
An important capability of the entire CNN-based system is to generalize the learned knowledge beyond the training cases. For this, we programmed the test-rig to run for 7 min, firstly, at different speeds seen during training- Figure 10: 450 rpm, 900 rpm, 1350 rpm, 1800 rpm levels marked in red (a)-secondly, for several unseen cases-from 450 rpm with increases of 100 rpm, up to 1800 rpm (b). While errors were close to zero in the first case, we note very small errors even for unseen cases (up to 50 rpm deviation from actual). The deviations tend to have higher volatilities for lower speeds which tend to be more challenging to address, as the samples contain less information (fewer electric cycles) (Figure 10b). The relationship between model complexity (parameters and inference times) and model performance in terms of accuracy has long been understood to be logarithmic [18]. With sustainability and explainability in mind simpler models with fewer parameters and thus lower computational complexity should be explored first ("Green AI"). Increases in the number of parameters of CNNs are justified when the function that models operational states or failures is complex and cannot be captured by the proposed architecture. Fitting the right capacitive model to the task is essential: a low capacitive model might underfit while a high capacitive model might overfit the training data. This is known as the bias/variance trade-off problem and is explained in depth in [19]. The total cost of the system grows linearly with the cost of processing a single example, the size of the training dataset and the number of hyperparameter experiments. We have also considered the strong CNN baseline architecture for time series classification proposed by Wang et al. [13] as a starting point for our models.
Learning rate α is an essential parameter of the network which controls how much weight adjustment is being made with respect to the gradients. A small value of α results in the network converging more slowly, whilst a larger value might result in the network learning more quickly but missing the local minima for non-convex functions. MSE, the cost function used in our regression, is a convex function meaning that the local minimum is also its global minima. For Speed, we used a learning rate of 0.1, while Load was 0.01. During training we have utilized ReduceLROnPlateau, a Keras callback function that is invoked when the learning process stagnates. This tool monitors the loss function and if no improvement is seen for a "patience" number of epochs (in our case 5), the learning rate is reduced by half. The "patience" is defined by the user, with the default value being 10. During our experiments, we found that choosing a patience of 5 resulted in the training converging faster to lower levels of the loss function. We have formulated Health State prediction as a multi-class classification problem with three categorical states of the magnets. For future consideration, it is worth noting that classification can be framed as binary (one of two classes is predicted), multi-class (one of many classes is predicted), multi-label (several classes are predicted at once) or hierarchical (one class is predicted which is further divided into subclasses or grouped into super classes) [20]. In fact, magnetization is a continuous variable where any value on the spectrum between 0 and 100% may occur and a regression model for demagnetization will be considered in our future work.
To move the architecture from regression to classification we added a final dense layer with three neurons representing the potential classes. Finally, the softmax function (9) produced the outputs of the dense layer in the [0, 1] range to be used in the cross-entropy cost function computation (10): Cost(x, y) = − 9 i=1 y i log(x i ) (10) where x encodes the output of our network and y encodes the true class of the output. The GAP layer enables the use of Class Activation Maps (CAM) to visualize the contributing regions in the time series which lead to specific classification: where k represents the k-th filter activation in the previous layer and n represents the size of vector x.

Model Performance
Without knowing the hyperparameters that work for a given or similar domain, fitting the best capacity deep learning model requires empirical investigation. We have experimented with different numbers of CNN blocks and hyperparameters such as filter numbers and size. Three convolution blocks of 100 filters with 100, 25 and 10 kernel sizes achieved the best results in our experiments. Different kernel sizes were used to detect smaller, medium and larger patterns. As expected, smaller networks tend to take longer (in terms of epochs) until reaching acceptable levels of MSE. For Speed prediction, we achieved an MSE of 46.61 on the training data and 40.85 on the validation set representing 20% of the data.
This result was achieved using rpm as units ranging from 450 (1/4 speed) to 1800 (4/4 of speed). The Load was trained on a range ranging from 1/4 to 4/4 and we achieved an MSE of 0.0040 (0.0018 on the validation set representing 20% of the data) after 100 epochs.
An important capability of the entire CNN-based system is to generalize the learned knowledge beyond the training cases. For this, we programmed the test-rig to run for 7 min, firstly, at different speeds seen during training- Figure 10: 450 rpm, 900 rpm, 1350 rpm, 1800 rpm levels marked in red (a)-secondly, for several unseen cases-from 450 rpm with increases of 100 rpm, up to 1800 rpm (b). While errors were close to zero in the first case, we note very small errors even for unseen cases (up to 50 rpm deviation from actual). The deviations tend to have higher volatilities for lower speeds which tend to be more challenging to address, as the samples contain less information (fewer electric cycles) (Figure 10b With the Cross-entropy function (Equation (10)) as a cost function in the architecture and hyperparameters set as in Figure 8, the CNN trains for 100 epochs reaching 99.67% accuracy on the validation set representing 20% of data ( Figure 11a). As a function of the number of epochs, the accuracy follows the rule of diminishing returns with the highest levels plateauing after 100 epochs (a sign that the training has reached the local minima of the cost function). These validation data are set aside at the beginning of the experiment.
Supervised classification algorithms are often presented with confusion matrices to help understand on which classes the model makes errors and/or where it is most accurate. Here ( Figure  11b), we observe that the network successfully identifies the F13 case (100% accurate) with 0.34% of the errors being made on the F50 and Healthy cases.

Discussion
To our knowledge, this is the first paper that presents an operational and highly accurate working prototype trained in an end-to-end approach, with real-time capability and efficient storage and monitoring of predicted diagnostics. With the Cross-entropy function (Equation (10)) as a cost function in the architecture and hyperparameters set as in Figure 8, the CNN trains for 100 epochs reaching 99.67% accuracy on the validation set representing 20% of data ( Figure 11a). As a function of the number of epochs, the accuracy follows the rule of diminishing returns with the highest levels plateauing after 100 epochs (a sign that the training has reached the local minima of the cost function). These validation data are set aside at the beginning of the experiment.
Supervised classification algorithms are often presented with confusion matrices to help understand on which classes the model makes errors and/or where it is most accurate. Here (Figure 11b), we observe that the network successfully identifies the F13 case (100% accurate) with 0.34% of the errors being made on the F50 and Healthy cases. With the Cross-entropy function (Equation (10)) as a cost function in the architecture and hyperparameters set as in Figure 8, the CNN trains for 100 epochs reaching 99.67% accuracy on the validation set representing 20% of data (Figure 11a). As a function of the number of epochs, the accuracy follows the rule of diminishing returns with the highest levels plateauing after 100 epochs (a sign that the training has reached the local minima of the cost function). These validation data are set aside at the beginning of the experiment.
Supervised classification algorithms are often presented with confusion matrices to help understand on which classes the model makes errors and/or where it is most accurate. Here ( Figure  11b), we observe that the network successfully identifies the F13 case (100% accurate) with 0.34% of the errors being made on the F50 and Healthy cases.

Discussion
To our knowledge, this is the first paper that presents an operational and highly accurate working prototype trained in an end-to-end approach, with real-time capability and efficient storage and monitoring of predicted diagnostics.

Discussion
To our knowledge, this is the first paper that presents an operational and highly accurate working prototype trained in an end-to-end approach, with real-time capability and efficient storage and monitoring of predicted diagnostics.

Related Work
Recent algorithmic developments, allied to availability of data and increased computing power, suggest that deep learning is the most advanced technique for pattern recognition [19]. Being able to stack hundreds of layers of neurons has expanded the range of complex functions that can be learned from a variety of structured and unstructured data. Fawaz et al. [21], provides a detailed account on existing end-to-end time series architectures that utilize deep learning for prediction.
These approaches have begun to be adapted for machine health monitoring models (Zhao et al., [5]). The authors explain that machine learning models for diagnostics fall into three categories:

1.
Traditional physics-based models: these models assume a mathematical understanding behind failures and are rigid in updating with new data; 2.
Conventional data-driven models: bottom-up approaches that offer more flexibility but are unable to model large scale data; such models require expert knowledge and hand-crafted features; 3.
Deep-learning models: a bottom-up approach based on Artificial Neural Networks (ANNs), which find discriminating features in data without the need for an expert and represent a stepping stone towards end-to-end learning.
Hence, deep learning offers significant potential for more timely, accurate and economical diagnostics to achieve a lower maintenance cost which is essential for the adoption of renewables.
Ince et al. [22] describe work using CNNs for real-time motor fault detection. They developed an elegant and flexible solution which combines feature extraction and classification at a low computational cost using a CNN. From a Three-Phase Squirrel Cage Induction Motor, they obtained 2600 of healthy and 2600 of faulty bearing samples. Their model consists of three hidden convolution layers and two deep Multilayer Perceptron (MLP) layers. These CNN-based results are competitive in comparison to complex models requiring preprocessing using Fourier or Wavelet Packet Transforms, in terms of accuracy (97.4%), sensitivity (97.8%), specificity (97.0%) and positive prediction (97.0%).
Wang et al. [13] introduced an end-to-end CNN model for time series classification that claimed state-of-the-art performance on 44 time series datasets from the University of California Riverside (UCR) archive [14]. Their CNN architecture achieved lower Mean Per-Class Error than approaches such as MLPs, ResNet, Multi-Scale CNN (MCNN) and several others. To interpret the decision making behind their CNN model, the authors [13] used Class Activation Maps (CAMs) to identify the contributing regions in the raw time domain. This uses the Global Average Pooling Layers, which also reduce the number of parameters and improves generalization. With previous claims of state-of-the-art performance on univariate time series classification, MCNN [23] has seen recent successes in CM of rolling bearing in shipborne antennas [24]. The drawback of MCNN is the need for additional data-preparation operations such as down-sampling, smoothing, sliding windows, etc., which requires expert intervention.
Sun et al. [25] proposed a CNN and SVM to diagnose motor faults using vibration signals. The model accurately distinguishes (97.8 to 100%) between six motor conditions: normal, stator winding defect, unbalanced rotor, defective bearing, broken bar, bowed rotor. However, as well as introducing the complexity of new parameters for the SVM (choice of kernel function, C-penalty parameter, degree of the polynomial kernel, etc.), vibration signals have been identified as potentially unreliable.
Kao et al. [26] showed that vibration signals often encounter problems related to noise that cannot be fully suppressed and influence diagnostic accuracy. To address these issues, the authors proposed a CNN mode utilizing stator current to predict five different motor states, including two different demagnetization fault states and two bearing fault states. Further, varying the speed between 150 and 3000 rpm results in more discriminative features being selected. Using CNN achieves better performance (98.8%) compared with a Wavelet-packet transform model (98.1%), a traditional methodology requiring expert hyperparameter tuning (e.g., mother function, number of features, etc.).
Jeong et al. [27] trained a CNN to predict demagnetization faults and inter-turn short circuits for Interior Permanent Magnet Synchronous Machines (IPMSMs). Using stator currents, the authors extracted nine components of the Fourier Transform which were used with a shallow, two-layer CNN. The authors report 99.87% accuracy for the training set and 98.96% on the test set. For demagnetization fault prediction, they identify a fundamental component much larger than normally explained by current flow due to the back electromotive force. As was shown with our procedure, a deeper CNN is able to extract the relevant features without the use of the intermediate step of Fourier Transform. A normal behavior model was proposed by Kong et al. [28] that uses both a CNN and a Gated Recurrent Unit (a type of RNN). This fused model was trained on a healthy Supervisory Control and Data Acquisition (SCADA) dataset which allows for residual analysis when functioning in real-time. At each time t, m features are produced by convolving SCADA sensor data with a filter of size one, extracting spatial features. These spatial features are then fed into an autoregressive Gated Recurrent Unit (GRU) model that predicts future values for each sensor. Predicted values are compared to real ones and deviations indicate departure from the healthy state. The authors showcase their approach on a use case involving gear crack and compare their method positively to several others.
Given that ANNs come in many types of architecture (e.g., Multi-Layer Perceptrons or MLPs, CNN, Recurrent Neural Networks (RNN), Transformers, etc.) the selection of the right approach for a given problem is challenging. Compounding this problem is the fact that they are generally flexible enough to make relatively good predictions even for wrong input data. MLPs with non-linear activations have been shown, through the universal approximation theorem, to be able to represent any function with small error [29]. In classification problems where there is spatial (e.g., neighborhood pixels in 2D images) or sequential relatedness (as in univariate or multivariate signals), CNNs represent an appropriate choice. A major benefit from using CNNs is their sparse interaction with kernels that are much smaller than the input which makes them more efficient, with fewer parameters needed and a lower memory footprint [19]. The convolution operation results in outputs which are a function of relatively small neighborhoods predefined by the user in convolution layer kernel sizes.
For our diagnostic task, we have used CNN as it can capture the neighboring measurement relationship when point-predictions are made on real-time samples (our models see the smallest, last batch of measurements).
When longer term temporal dependencies between measurements are expected and the task is to predict the system's future state (as in prognostics), RNN variants might become more relevant. Unlike CNNs, which share parameters in kernels, RNNs make use of internal memory states which are passed between consecutive computations. RNNs, such as the Gated Recurrent Unit (GRU) and Long Short-Term Memory (LSTM), have been successfully used in other sequence prediction problems in domains ranging from natural language modelling [30], to physics [31], speech recognition [32] and so on. A requirement of RNNs is that sequential data are processed in an order that limits their parallelization. A more recent architecture called Transformer [33] overcomes this limitation achieving state-of-the-art results on Natural Language Processing (NLP) tasks with potential for time series forecasting [34]. This architecture will be investigated in the future.
A major benefit from using models, such as CNN, is their powerful hierarchical representation of learned features. While the initial layers capture small patterns and variations in the input signals, the complexity of the learned features increases as a function of network depth. Transfer learning is about reusing lower layers of a pretrained model (Model 1 Section A in Figure 12) to speed up the training of a second model (Model 2 in Figure 12) on a related task. In some cases, the training time can be effectively reduced by a factor of 4 as in [35]. Future work will investigate the reuse of weights via transfer learning for the task of predicting failures.

Interpreting CNN Results
An argument against deep learning is lack of interpretability. Unlike decision trees, which have self-explainable, tree-like decision structure, the patterns learned by CNNs are encoded in synaptic weights which are stored as a matrix of floats. Using backpropagation, these weights are adjusted during training in such a way as to minimize our cost function of choice (e.g., MSE).
While model agnostic methods (e.g., partial dependence plots) are generally applicable [36], special tools have been developed to aid interpretation. The layers of a CNN learn increasingly complex features as a function of depth and in order to visualize these features one can optimize: * = 1 ( ) where ts can be a time series representing a sample, then optimization essentially selects the sample which maximizes the activations in layer L. While this is a valid approach, it does not work well when multiple elements are present at the same time yielding mixed features. The solution here was to start with an empty series of 250 measurements (same size as our sample data), which was tweaked through multiple iterations so as to maximize sum ( ). Finally, we have 100 maximized activations, one for each convolution filter; an effective way to summarize and plot them is by clustering into four groups (using K-means [37])-see Figure 13 which presents the first (a) and last activation (b) layers of the model trained to predict generator speed. As expected, increasing feature complexity illustrates the hierarchical nature of feature learning in convolutional networks. The shapes of the series in Figure 13b show patterns that are intuitive to understanding of speed: some filters are activated when a sample has more cycles (high speed) or fewer cycles (low speed).

Interpreting CNN Results
An argument against deep learning is lack of interpretability. Unlike decision trees, which have self-explainable, tree-like decision structure, the patterns learned by CNNs are encoded in synaptic weights which are stored as a matrix of floats. Using backpropagation, these weights are adjusted during training in such a way as to minimize our cost function of choice (e.g., MSE).
While model agnostic methods (e.g., partial dependence plots) are generally applicable [36], special tools have been developed to aid interpretation. The layers of a CNN learn increasingly complex features as a function of depth and in order to visualize these features one can optimize: ts * = argmax ts 1 n n i=1 act L (ts) (12) where ts can be a time series representing a sample, then optimization essentially selects the sample which maximizes the activations in layer L. While this is a valid approach, it does not work well when multiple elements are present at the same time yielding mixed features. The solution here was to start with an empty series of 250 measurements (same size as our sample data), which was tweaked through multiple iterations so as to maximize sum (act L ). Finally, we have 100 maximized activations, one for each convolution filter; an effective way to summarize and plot them is by clustering into four groups (using K-means [37])-see Figure 13 which presents the first (a) and last activation (b) layers of the model trained to predict generator speed. As expected, increasing feature complexity illustrates the hierarchical nature of feature learning in convolutional networks. The shapes of the series in Figure 13b show patterns that are intuitive to understanding of speed: some filters are activated when a sample has more cycles (high speed) or fewer cycles (low speed). Another tool of note is Class Activation Maps (CAMs), which show which segments in the time domain cause the network to make specific classifications. The equations describing CAMs are: where ( ) represents the k-th filter activation in the last convolutional layer at temporal location x and represents the weight of the final softmax function for the output from filter k and class c. Figure 14 shows the parts of the signal that the CNN finds to be relevant for predicting the State as healthy-13 and 50% demagnetized at high speeds and high loads. We illustrate this by coloring in red the discriminative segments and in blue the ones that the network thinks are irrelevant. The CNN considers larger, uniform segments in the time domain to confidently predict the healthy case, valleys for the 13% demagnetized case and peaks and valleys for the 50% demagnetized current.  Another tool of note is Class Activation Maps (CAMs), which show which segments in the time domain cause the network to make specific classifications. The equations describing CAMs are: where S k (x) represents the k-th filter activation in the last convolutional layer at temporal location x and w c k represents the weight of the final softmax function for the output from filter k and class c. Figure 14 shows the parts of the signal that the CNN finds to be relevant for predicting the State as healthy-13 and 50% demagnetized at high speeds and high loads. We illustrate this by coloring in red the discriminative segments and in blue the ones that the network thinks are irrelevant. The CNN considers larger, uniform segments in the time domain to confidently predict the healthy case, valleys for the 13% demagnetized case and peaks and valleys for the 50% demagnetized current. Another tool of note is Class Activation Maps (CAMs), which show which segments in the time domain cause the network to make specific classifications. The equations describing CAMs are: where ( ) represents the k-th filter activation in the last convolutional layer at temporal location x and represents the weight of the final softmax function for the output from filter k and class c. Figure 14 shows the parts of the signal that the CNN finds to be relevant for predicting the State as healthy-13 and 50% demagnetized at high speeds and high loads. We illustrate this by coloring in red the discriminative segments and in blue the ones that the network thinks are irrelevant. The CNN considers larger, uniform segments in the time domain to confidently predict the healthy case, valleys for the 13% demagnetized case and peaks and valleys for the 50% demagnetized current.

Storage and Visualization
To gather data, we utilize NI-DAQmx [38] which is a Python API for interacting with the National Instruments Data Acquisition Cards. For developing the CNNs, we utilize Keras API [12] which provides a layer of abstraction above TensorFlow [39]. Knowledge of TensorFlow is particularly useful if finer control of the underlying models is needed.
Once the models have been trained, with the laptop connected to the test-rig, a callback function is called 20 times per second requesting the CNNs to predict speed, load and health state. The results are packed into a JSON document and sent to an InfluxDB [40], an open source, Time-Series Optimized Database (TSDB). Compared to traditional database systems, TSDBs have fast and easy range queries, high writing performance, data co-location, data compression, scalability and usability [41]. As of August 2019, InfluxDB dominates the ranking of DB-Engines [42] which collects monthly information about usage of database management systems. We further utilize Grafana [43], an open source analytics and monitoring platform, to monitor the predictions made by our CNNs in real-time.
To visualize predictions, we utilize Grafana [43]. Grafana can connect to various DBs natively providing specific query editors. It supports time series plots, heatmaps, histograms, geomaps and many other tools which aid in understanding data. In Figure 15, we present a screenshot of Grafana's dashboard showing Speed and Load as time series plots-Speed as a gauge, Health Status as a Text Box. Of relevance for diagnostics and maintenance is Grafana's alerts system, which can be set visually using the dashboard. Criteria such as "Speed > threshold in the last hour" can trigger alerts with custom messages being sent via email, Slack, PagerDuty, etc.

Storage and Visualization
To gather data, we utilize NI-DAQmx [38] which is a Python API for interacting with the National Instruments Data Acquisition Cards. For developing the CNNs, we utilize Keras API [12] which provides a layer of abstraction above TensorFlow [39]. Knowledge of TensorFlow is particularly useful if finer control of the underlying models is needed.
Once the models have been trained, with the laptop connected to the test-rig, a callback function is called 20 times per second requesting the CNNs to predict speed, load and health state. The results are packed into a JSON document and sent to an InfluxDB [40], an open source, Time-Series Optimized Database (TSDB). Compared to traditional database systems, TSDBs have fast and easy range queries, high writing performance, data co-location, data compression, scalability and usability [41]. As of August 2019, InfluxDB dominates the ranking of DB-Engines [42] which collects monthly information about usage of database management systems. We further utilize Grafana [43], an open source analytics and monitoring platform, to monitor the predictions made by our CNNs in real-time.
To visualize predictions, we utilize Grafana [43]. Grafana can connect to various DBs natively providing specific query editors. It supports time series plots, heatmaps, histograms, geomaps and many other tools which aid in understanding data. In Figure 15, we present a screenshot of Grafana's dashboard showing Speed and Load as time series plots-Speed as a gauge, Health Status as a Text Box. Of relevance for diagnostics and maintenance is Grafana's alerts system, which can be set visually using the dashboard. Criteria such as "Speed > threshold in the last hour" can trigger alerts with custom messages being sent via email, Slack, PagerDuty, etc.

Conclusions
Before its universal adoption, renewable energy has various hurdles to cross, such as intermittency, variable capacity and the high cost of maintenance. This study provides requirements for and analyzes the efficacy of establishing an autonomous deep learning application for CM of a Type IV permanent magnet wind turbine generator. The paper further describes an end-to-end, real time set of models for system diagnostics, an integral part of operation and maintenance.
We have shown how to create a complete and functional diagnostic system from a fundamental level, starting with test-rig hardware setup, data acquisition, end-to-end model training and validation, real-time operation, efficient signal storage and interactive visualization. Moreover, the

Conclusions
Before its universal adoption, renewable energy has various hurdles to cross, such as intermittency, variable capacity and the high cost of maintenance. This study provides requirements for and analyzes the efficacy of establishing an autonomous deep learning application for CM of a Type IV permanent magnet wind turbine generator. The paper further describes an end-to-end, real time set of models for system diagnostics, an integral part of operation and maintenance.
We have shown how to create a complete and functional diagnostic system from a fundamental level, starting with test-rig hardware setup, data acquisition, end-to-end model training and validation, real-time operation, efficient signal storage and interactive visualization. Moreover, the system only relies on raw current signals which are readily available, cost effective to monitor and usually already captured in commercial wind turbine systems for control purposes.
Using a purpose-built Type IV wind turbine generator test rig, we sampled currents at 5 kHz representing different levels of speed, load and health state (magnetization). Splitting into examples of 250 measurements allowed us to train models to make 20 predictions per second, a rate which can theoretically increase with higher sampling rates. Given the continuous range of generator speed and load, we sampled uniformly four equidistant points in their operating range for training data selection. Cast as a regression problem, both models achieved low MSE (x and y). We have shown that the speed model can predict and generalize well beyond the specific ranges on which it was trained.
Deep Learning with CNNs has been used with MSE (Speed and Load) and Cross-Entropy (Health State) as cost functions. Compared to traditional data-driven approaches that require feature engineering, these models combine multiple feature extraction stages with a classifier resulting in end-to-end architectures that run on raw data. This represents a paradigm shift from time-consuming, error-prone feature processing of raw signals.
Utilizing Class Activation Maps enables visualization of segments in the time domain which triggers the models to predict classification decisions. This has the potential to guide and inform a physics-based understanding of the relation between signals and fault states. Connected to the test-rig our system makes 20 predictions per second in real time, stores them in an efficient time-series oriented database (InfluxDB) and interactively visualizes them using Grafana. Future work will expand the collection of models to include other operational and fault states. We will extend the work on magnetization and frame it as a regression problem aiming to predict arbitrary levels of demagnetization. Moreover, we will consider non-uniform demagnetization where magnets fail arbitrarily around the shaft. Further, we are interested in exploring diagnostics for other types of failures, such as those occurring in bearings. For faults that cannot be reliably predicted using sensed currents alone, we are interested in sensor data fusion and multivariate CNNs.