Online learning of windmill time series using Long Short-term Cognitive Networks

Forecasting windmill time series is often the basis of other processes such as anomaly detection, health monitoring, or maintenance scheduling. The amount of data generated on windmill farms makes online learning the most viable strategy to follow. Such settings require retraining the model each time a new batch of data is available. However, update the model with the new information is often very expensive to perform using traditional Recurrent Neural Networks (RNNs). In this paper, we use Long Short-term Cognitive Networks (LSTCNs) to forecast windmill time series in online settings. These recently introduced neural systems consist of chained Short-term Cognitive Network blocks, each processing a temporal data chunk. The learning algorithm of these blocks is based on a very fast, deterministic learning rule that makes LSTCNs suitable for online learning tasks. The numerical simulations using a case study with four windmills showed that our approach reported the lowest forecasting errors with respect to a simple RNN, a Long Short-term Memory, a Gated Recurrent Unit, and a Hidden Markov Model. What is perhaps more important is that the LSTCN approach is significantly faster than these state-of-the-art models.


Introduction
Humanity's sustainable development requires the adoption of less environmentally aggressive energy sources. Over the last years, renewable energy sources (RES) have increased their presence in the energy matrix of several countries. These clean energies are less polluting, renewable, and abundant in nature. However, limitations such as volatility and intermittency reduce its reliability and stability for power systems. This hinders the integration of renewable sources into the main grid and increases their generation costs [38].
Power generation forecasting [13] is one of the approaches adopted to facilitate the optimal integration of RES in power systems. Overall, the goal of power generation forecasting is to know in advance the possible disparity between generation and demand due to fluctuations in energy sources [1]. Forecasting methods used for renewable energies are based on physical, statistical, or machine learning (ML) models. Although ML models often achieve the highest performance compared to physical and statistical models, their deployment in real applications is limited [17]. On the one hand, most ML models require feature engineering before building the model and lack interpretability. On the other hand, these methods usually assume that the training data are completely available in advance. Hence, most ML methods are unable to incorporate new information into the previously constructed models [40].
Within the approaches to clean energy, wind energy has shown sustained growth in installed capacity and exploitation in recent years [2]. However, wind energy has some peculiarities to be considered when designing new forecasting solutions. Firstly, wind-based power generation can heavily be affected by weather variability, which means that the power generation fluctuates with extreme weather phenomena (i.e., frontal systems or rapidly evolving low-pressure systems). Weather events are unavoidable, but their impact can be minimized if anticipated in advance. Secondly, wind generators are dynamic systems that behave differently over time (i.e., due to wear of turbine components, maintenance, etc). Finally, although following certain patterns, weather conditions have a high level of uncertainty. These characteristics make traditional ML methods inadequate to model the dynamics of these systems properly. This means that new approaches are needed to improve the prediction of wind generation. The development of algorithms capable of learning beyond the production phase will also allow them to be kept up-to-date at all times [27].
In this paper, we tackle the task of forecasting power generation in windmills using recurrent neural networks. This task is modeled as an online learning problem since windmill data is not static (i.e., turbines will continue to operate and new pieces of data will be available). The proposed model relies on a brand new neural system termed Long Short-term Cognitive Network (LSTCN) [30] that allows for online learning. In this model, each iteration processes a data chunk using a Short-term Cognitive Network (STCN) block [31] that operates with the knowledge transferred from the previous block. This means that the model can easily be retrained without compromising what the networks has learned from previous data chunks. The numerical simulations using a case study comprised of four windmills show that our model (i) outperforms other recurrent neural networks when it comes to forecasting error, and (ii) reports significantly faster training and test times.
The reminder of the paper is organized as follows. Section 2 revises the literature about recurrent neural networks used to forecast windmill time series. Section 3 presents the proposed LSTCN-based power forecasting model for an online learning setting. Section 4 describes the case study, the state-of-the-art recurrent models used for comparison purposes and the simulation results. Finally, Section 5 concludes the paper and suggests further research directions to be explored.

Forecasting models with recurrent neural networks
Neural networks are a family of biologically-inspired computational models that have found applications in many fields. One of the most prominent engineering applications of neural models is to support the operation and maintenance of wind turbines. In this area, neural models dedicated to the analysis of temporal data have proven to be essential. This is motivated by the fact that typical data describing the operation of a wind turbine are collected by sensors forming a supervisory control and data acquisition (SCADA) system [11,42]. Such a data are comes in the form of long sequences of numerical values, thus making Recurrent Neural Networks (RNNs) the right choice for their processing. This section presents the literature on the applications of RNNs for data analysis in the area of wind turbine operation and maintenance support.
RNNs differ from other neural networks in the way the input data is propagated. In standard neural networks, the input data is processed in a feed-forward manner, meaning the signal is transmitted unidirectionally. In RNN models, the signal goes through neurons that can have backward connections from further layers to earlier layers [6]. Depending on a particular neural model architecture, we can restrict the layers with feedback connections to only selected ones. The overall idea is to allow the network to "revisit" nodes, which mimics the natural phenomenon of memory [18]. RNNs turned out to be useful for accurate time series prediction tasks [39], including wind turbine time series prediction [9].
As reported by Zhang et al. [46], the task of analyzing wind turbine data often involves building a regression model operating on multi-attribute data from SCADA sensors. Such models can help us understand the data [10,16].
Currently, the most popular variant of RNN in the field of wind turbine data processing is the Long Short-Term Memory (LSTM) model [15,29]. In this model, the inner operations are defined by neural gates called cell, input gate, output gate, and forget gate. The cell acts as the memory, while the other components determine the way the signal propagates through the neural architecture [47]. The introduction of these specialized units helped prevent (to some extent) the gradient problems associated with training RNN models [37].
Existing neural network-based approaches to wind turbine data forecasting do not pay enough attention to the issue of model complexity and efficiency. In most studies, authors reduce the available set of input variables rather than optimizing the neural architecture used. For example, Feng et al. [12] used LSTM model with hand-picked three SCADA input variables, while Sun et al. [36] used eleven SCADA variables. Qian et al. [33] also used LSTM to predict wind turbine data. In their study, the initial set of input variables consisted of 121 series, but this was later reduced to only three variables and then to two variables using the Mahalanobis distance method. The issue of preprocessing and feature selection was also raised by Wang et al. [41], suggesting Principal Component Analysis (PCA) to reduce the dimensionality of the data.
LSTM has been found to perform well even when the processed time series variables are of incompatible types. In that regard, it is worth citing the study of Lei et al. [21], who used LSTM to predict two qualitatively different types of time series simultaneously: (i) vibration measurements that have a high sampling rate and (ii) slow varying measurements (e.g., bearing temperature). It should be noted that existing studies bring additional techniques that enhance the capabilities of the standard LSTM model. For example, Cao et al. [5] propose segmenting the data and using segment-related features instead of raw signals. Xiang et al. [43] also do not use raw signals. Instead, they use Convolutional Neural Networks (CNNs) to extract the dynamic features of the data, which is then fed to LSTM. A similar approach, combining CNN with LSTM, was presented by Xue et al. [44]. Another interesting technique was introduced by Chen et al. [7], who combined LSTM with an auto-encoder (AE) neural network so that their model can detect and reject anomalies while achieving better results for nonanomalous data. Liu et al. [25] used wavelet decomposition together with LSTM and found that it achieves better results than standard LSTM, but this comes at the cost of increased time complexity (training time increases by about 30%). Other studies on LSTM and wind power prediction have focused on tuning the LSTM architecture, for example, by testing different transformation functions [45] or by adding a specialized imputation module for missing data [22].
In addition, the bidirectional LSTM model [14] has also been applied to forecast wind turbine data. The application of this model is present in the study of Zhen et al. [48] and, in a deeper architecture, in the study of Cao et al. [4].
While most of the recently published studies using neural models to predict multivariate wind turbine time series employ LSTM, there are also several alternative approaches focusing on other RNN variants. For example, there are several papers on the use of Elman neural networks in forecasting multivariate wind turbine data [23,24]. Kramti et al. [20] also applied Elman neural networks, but with a slightly modified architecture. In addition, we should mention the work of López et al. [26], which involved Echo State Network and LSTM. Finally, it is worth mentioning the work of Kong et al. [19], in which the task of processing data from wind turbines is implemented using CNNs and Gated Recurrent Unit (GRU) [8]. The latter neural architecture is a variant of RNN, which can be seen as a simplification of the LSTM architecture. GRU was also used in the study of Niu et al. [32], which employs attention mecha-nisms to reduce the forecasting error.
There are other models equipped with reasoning mechanisms similar to the one used by neural networks. In particular, the concept of "neuron" can also be found in Hidden Markov Models (HMMs) [35]. Such neurons are implemented as states, and the set of states essentially plays a role analogous to that of hidden neurons in a standard neural network. HMMs have also found applications in wind power forecasting. The studies of Bhaumik et al. [3] and Qu et al. [34] should be mentioned in this context. Both research teams highlight decent predictions and robustness to noise in the data.
As pointed out by Manero et al. [28], the task of comparing wind energy forecasting approaches described in the literature is challenging due to several factors such as the differences in time series datasets, the alternative forecast horizons, etc. In this paper, we conducted experiments for key state-of-theart models for our data alongside the LSTCN formalism. The methodology adopted allows us to draw conclusions about the forecasting accuracy of different models and compare their empirical computational complexity.

Long Short-term Cognitive Network
In this section, we elaborate on the LSTCN model used online learning of multivariate time series.

Data preparation for online learning simulations
Let x ∈ R be a variable observed over a discrete time scale within a period t ∈ {1, 2, . . . , T } where T ∈ N is the number of observations. Hence, a univariate time series can be defined as a sequence of observations Similarly, we can define a multivariate time series as a sequence A model F is used to forecast the next L < T steps ahead. In this paper, we assume that the model F is built as a sequence of neural blocks with local learning capabilities, each able to capture the trends in the current time patch (i.e., a chunk of the time series) being processed. Both the network architecture and the parameter learning algorithm will be detailed in the following sub-sections.
Let us assume that X ∈ R M×T is a dataset comprising a multivariate time series (Figure 1a). Firstly, we need to transform X into a set of Q tuples with the form (X (t−R) , X (t+L) ), t − R > 0, t + L ≤ T where R represents how many past steps we will use to forecast the following L steps ahead (see Figure 1b). In this paper, we assume that R = L for the sake of simplicity. Secondly, each component in the tuple is flattened such that we obtain a Q × N matrix where N = M(R+ L). Finally, we create a partition P = {P (1) , . . . , P (k) , . . . , P (K) } from the set of flattened tuples such that P (k) ∈ R C×N is the k-th time patch involving two data pieces P (k) 1 and P (k) 2 . The former (orange section in Figure 1c) is regarded as the input while the latter (blue section in Figure 1c) is the corresponding output.

Network architecture and neural reasoning
In the online learning setting, we consider a time series (regardless of the number of observed variables) as a sequence of time patches of a certain length. Such a sequence refers to the partition P = {P (1) , . . . , P (k) , . . . , P (K) } obtained with the data preparation steps discussed in the previous subsection. Hence, the proposed network architecture consists of an LSTCN model able to process the sequence of time patches.
An LSTCN model can be defined as a collection of STCN blocks, each processing a specific time patch and transferring knowledge to the following STCN block in the form of a weight matrix. Figure 2 shows the recurrent pipeline of an LSTCN involving three STCN blocks to model a multivariate time series decomposed into three time patches. It should be highlighted that learning (see Figure 4) happens inside each STCN block to prevent the information flow from vanishing as the network processes more time patches. Moreover, weights estimated in the current STCN block are transferred to the following STCN block to perform the next reasoning process (see Figure 3). These weights will no longer be modified in subsequent learning processes, which allow preserving the knowledge we have learned up to the current time patch. That makes our approach suitable for online learning setting.
The reasoning of within an STCN block involves two gates: the input gate and the output gate. The input gate operates the prior knowledge matrix W (k) 1 ∈ R N×N with the input data P (k) 1 ∈ R Q×N and the prior knowledge matrix B (k) 1 ∈ R N×1 denoting the bias weights. Both matrices W (k) 1 and B (k) 1 are transferred from the previous block and remain locked during the learning phase to be performed in that STCN block. The result of the input gate is a temporal state H (k) ∈ R C×N that represents the outcome that the block would have produced given P (k) 1 if the block would not have been adjusted to the block's expected output P (k) 2 . Such an adaptation is done in the output gate where the temporal state is operated with the matrices W (k) 2 ∈ R N×N and B (k) 2 ∈ R N×1 , which contain learnable weights. Figure 3 depicts the reasoning process within the k-th block. Equations (1) and (2) show the short-term reasoning process of this model in the k-th iteration, and where f (x) = 1 1+e −x , whereasP 2 (k) is an approximation of the expected block's output. Notice that we assumed that values to be forecast are in the [0, 1] interval. As mentioned, the LSTCN model consists of a sequential collection of STCN blocks. In this neural system, the knowledge from one block is passed to the next one using an aggregation procedure (see Figure 2). This aggregation operates on the knowledge learned in the previous block (that is to say, the W (k−1) 2 matrix in Figure 4). In this paper, we use the following non-linear operator in all our simulations: and such that g(x) = tanh(x). However, we can design operators combining the knowledge in both W (k−1) 1 and W (k−1)

.
There is an important detail to be discussed. Once we have processed the available sequence (i.e., performed K short-term reasoning steps with their corresponding learning processes), the whole LSTCN model will narrow down to the last STCN block. Therefore, that network will be used to forecast new data chunks as they arrive and a new learning process will follow, as needed in online learning settings.

Parameter learning
Training the LSTCN in Figure 2 means training each STCN block with its corresponding time patch. The learning process within a block is partially independent of other blocks as it only uses the prior weights matrices that are transferred from the previous block. As mentioned, these prior knowledge matrices are used to compute the temporal state and are not modified during the block's learning process.
The learning task within an STCN block can be summarized as follows. Given a temporal state H (k) resulting from the input gate and the block's expected output P (k) 2 , we need to compute the matrices W (k) 2 ∈ R N×N and B (k) 2 ∈ R N×1 . Figure 4 shows the rationale of this learning process. Mathematically speaking, the learning is performed by solving a system of linear equations that adapt the temporal state to the expected output. Equation (5) displays the deterministic learning rule solving this regression problem, where Φ (k) = (H (k) |A) such that A N×1 is a column vector filled with ones, while Ω (k) denotes the diagonal matrix of (Φ (k) ) Φ (k) , while λ ≥ 0 denotes the ridge regularization penalty. This learning rule assumes that the neuron's activation values inner layer are standardized. When the final weights are returned, they are adjusted back into their original scale.
It shall be noted that we need to specify W (0) 1 and B (0) 1 in the first STCN block. We can use a transfer learning approach from a previous learning process or it can be provided by domain experts. Since this information is not available, we fit a single STCN block without an intermediate state (i.e., H (0) = P (0) 1 ) on a smoothed representation of the whole (available) time series.
The smoothed time series is obtained using the moving average method for a given window size.

Numerical simulations
In this section, we will explore the performance (forecasting error and training time) of the proposed LSTCN-based online forecasting model for windmill time series.

Description of windmill datasets
To conduct our experiments, we adopted four public datasets from the ENGIE web page 1 . Each dataset corresponds to a windmill where measurements were recorded every 10 minutes from 2013 to 2017. The time series of each windmill contains 264,671 timestamps. Eight variables concerning the windmill and environmental conditions were selected: generated power, rotor temperature, rotor bearing temperature, gearbox inlet temperature, generator stator temperature, wind speed, outdoor temperature, and nacelle temperature.
As of the pre-processing steps, we removed duplicated timestamps, imputed missing timestamps and values, and applied a min-max normalization. Moreover, the data preparation procedure described in Figure 1 was applied to each dataset. Table 1 displays a descriptive summary of all datasets after normalization where the minimum, median and maximum of the absolute Pearson's correlation values among the variables are denoted as min, med, max, respectively. We split each dataset using a hold-out approach (80% for training and 20% for testing purposes). As for the performance metric, we use the mean absolute error (MAE) in all simulations reported in this section. In addition, we report the training and test times of each forecasting model. The training time (in seconds) of each algorithm was computed by adding the time needed to train the algorithm in each time patch. Finally, we arbitrarily fix the patch size to 1024.

Recurrent online learning models
We contrast the LSTCNs' performance against four recurrent learning networks used to handle online learning settings. The models adopted for comparison are are a fully-connected Recurrent Neural Network (RNN) where the output is to be fed back to the input, GRU, LSTM and HMM.
The RNN, LSTM and GRU networks were implemented using Keras v2.4.3, while HMM was implemented using the hmmlearn library 2 . The training of these models was adapted to online learning scenarios. In practice, this means that RNN, GRU and LSTM were retrained on each time patch using the prior knowledge structures learned in previous learning steps. In the HMM-based forecasting model, the transition probability matrix is passed from one patch to another, and it is updated based on the new information.
In the LSTCN model, we used L = {6, 48, 72} such that R = L (from now on we will only refer to L) and w = 10. Notice that given the sampling interval of the data, six steps represent one hour while 72 steps represent half a day. We did not perform parameter tuning since the online learning setting demands fast re-training of these recurrent models when a new data chunk arrives. It would not be feasible to fine-tune the hyperparameters in each iteration since such a process is computationally demanding. Instead, we retained the default parameters reported on the corresponding Keras layers. In the HMM-based model, we used four hidden states and Gaussian emissions to generate the predictions. These parameter values were arbitrarily selected without further experimentation.

Results and discussion
As mentioned, the knowledge used by the first STCN is extracted from a smoothed representation of the time series data we have. Nevertheless, we can start with a zero-filled matrix if such knowledge is not available. Figure 5 shows the MAE of the predictions in the training set of one windmill in both settings. Starting from scratch (no knowledge about the data), the LSTCN starts predicting with a large MAE in the first time patch. As new data is received, the network updates its knowledge and reduces the prediction error. In this simulation, we used five time patches such that each STCN block is fitted on the newly received data. If the time series does not contain outliers, the error is expected to remain with little variability from patch to patch. As expected, LSTCN trained considering general knowledge of the time series (assumed as a warm-up) generates small errors from the first time patch. Table 2, 3 and 4 show the simulation results for L = 6, L = 48 and L = 72, respectively. More explicitly, we report the training and test errors, and the training and test times (in seconds). The LSTCN model obtained the lowest MAE values in all cases (the lowest test error for each windmill is highlighted in boldface). Those results allow us to conclude that our approach is able to produce better forecasting results when compared with wellestablished recurrent neural networks.
Another clear advantage of LSTCN over these state-of-theart algorithms is the reduced training and test times. Re-training the model quickly when a new piece of data arrives while retaining the knowledge we have learned so far is a key challenge in online learning settings. Recurrent neural models such as RNN, LSTM, GRU use a backpropagation-based learning algorithm to compute the weights regulating the network behavior. The algorithm needs to iterate multiple times over the data with limited vectorization possibilities. Overall, there is a trade-off  between accuracy and training time when it comes to the batch size. The smaller the batch size, the more accurate the predictions are expected to be. However, smaller batch sizes make the training process slower. Another issue with gradient-based optimization methods is that they usually operate in a stochastic fashion, thus making them quite sensitive to the initial conditions. Notice that HMM also requires several iterations to build the probability transition matrix. As the last experiment, we further investigate the effect of the prediction horizon (i.e., the number of steps ahead to be predicted in a single reasoning step) on the LSTCNs' performance. Figures 6, 7, 8 and 9 show the moving average of power predictions for the four turbines using the last STCN block, respectively. The reader can notice that the performance deteriorates as the prediction horizon increases. Despite this undesirable behavior, the performance of our LSTCN-based model is still superior to traditional RNN models. Therefore, the results advocate for applying the model in real-time systems that can benefit from shorter prediction horizons where the forecasting accuracy still plays an important role.
To alleviate the problem caused by larger prediction horizons, we could increase the batch size in our model (or decrease the bath size of other recurrent models used for comparison purposes). In that way, we will have more data in each training process, which will likely lead to models with improved predictive power. Alternatively, we could adopt an incremental learning approach to reuse data concerning previous time patches as defined by a given window parameter. However, we should be aware that many online learning problems operate on volatile data that is just available for a short period.

Concluding remarks
In this paper, we investigated the performance of Long Shortterm Cognitive Networks to forecast windmill time series in online setting scenarios. This brand-new recurrent model system consists of a sequence of Short-term Cognitive Network blocks. Each of these blocks is trained with the available data at that moment in time such that the learned knowledge is propagated to the next blocks. Therefore, the network is able to adjust its knowledge to new information, which makes this model suitable for online settings since we retain the knowledge learned from previous learning processes.
The experiments conducted using four windmill datasets reported that our approach outperforms other state-of-the-art recurrent neural networks in terms of MAE. In addition, the proposed LSTCN-based model is significantly faster than these recurrent models when it comes to both training and test times. Such a feature is of paramount relevance when designing forecasting models operating in online learning modes. Regrettably, the overall performance of all forecasting models deteriorated when increasing the number of steps ahead to be predicted. While this result is not surprising, further efforts are needed to build forecasting models with better scalability properties as defined by the prediction horizon.