Spatio-Temporal Wind Speed Forecasting using Graph Networks and Novel Transformer Architectures

This study focuses on multi-step spatio-temporal wind speed forecasting for the Norwegian continental shelf. The study aims to leverage spatial dependencies through the relative physical location of different measurement stations to improve local wind forecasts. Our multi-step forecasting models produce either 10-minute, 1- or 4-hour forecasts, with 10-minute resolution, meaning that the models produce more informative time series for predicted future trends. A graph neural network (GNN) architecture was used to extract spatial dependencies, with different update functions to learn temporal correlations. These update functions were implemented using different neural network architectures. One such architecture, the Transformer, has become increasingly popular for sequence modelling in recent years. Various alterations have been proposed to better facilitate time series forecasting, of which this study focused on the Informer, LogSparse Transformer and Autoformer. This is the first time the LogSparse Transformer and Autoformer have been applied to wind forecasting and the first time any of these or the Informer have been formulated in a spatio-temporal setting for wind forecasting. By comparing against spatio-temporal Long Short-Term Memory (LSTM) and Multi-Layer Perceptron (MLP) models, the study showed that the models using the altered Transformer architectures as update functions in GNNs were able to outperform these. Furthermore, we propose the Fast Fourier Transformer (FFTransformer), which is a novel Transformer architecture based on signal decomposition and consists of two separate streams that analyse the trend and periodic components separately. The FFTransformer and Autoformer were found to achieve superior results for the 10-minute and 1-hour ahead forecasts, with the FFTransformer significantly outperforming all other models for the 4-hour ahead forecasts.


Introduction
In the context of the global climate debate, wind has emerged as a prominent renewable energy resource to ac-• Test and compare the performance of different Transformer architectures for use in wind forecasting, namely the vanilla Transformer, LogSparse Transformer, Informer and Autoformer. This is the first paper to formulate many of these in a GNN setting and the first to apply such models to wind forecasting.
• We propose a new alteration of the Transformer architecture, namely the Fast Fourier Transformer (FFTransformer), and show its competitive performance in wind forecasting. The novel architecture is based on wavelet decomposition and an adapted Transformer architecture consisting of two streams. One stream analyses periodic components in the frequency domain with an adapted attention mechanism based on fast Fourier transform (FFT), and another stream similar to the vanilla Transformer, which learns trend components.

Related Works
Autoregressive moving average methods, such as the autoregressive integrated moving average (ARIMA) model, are robust and easy to implement, making them popular for wind forecasting. Kavasseri et al. [4] studied fractional-ARIMA models to perform one-and two-day-ahead wind speed forecasts, managing to outperform both a persistence and an ARIMA model for four potential wind generation sites in North Dakota. The persistence model is a commonly used benchmark in wind-speed forecasting, where the forecasted values,ŵs t+1 are simply taken as the last recorded value ws t , i.e.ŵs t+1 = ws t . Since wind speed time series are characterised by both long-term trends and high-frequency variation, Singh et al. [5] proposed the RWT-ARIMA model, combining the ARIMA model with wavelet transform (WT) to decompose the signal into multiple sub-series with different frequency characteristics. Various decomposition techniques have been studied for wind time series, showing the potential benefits of introducing some signal decomposition into the forecasting models, such as complete ensemble empirical mode decomposition [6], variational mode decomposition [7] and wavelet packet decomposition [8,9]. Support vector regressor (SVR) and K-nearest neighbour (KNN) algorithms have also been popular within wind forecasting [10,11,12]. The KNN-algorithm is based on finding similar points in the available data and can be fast in both training and testing. SVRs have been shown to yield very good forecasting performance [11], but do not scale well for larger datasets, resulting in longer computation times [10].
In this study, we focus on neural network architectures, which lie at the heart of modern ML and have become increasingly popular for wind speed forecasting, due to their ability to model non-linear relationships. Multi-Layer Perceptrons (MLP) have been successfully used both in isolation [13,14] and in combination with other methods [15,16]. Recurrent (RNN) and convolutional neural networks (CNN) represent the quintessential DL architectures for sequence modelling and are typically favoured over MLPs for wind forecasting [17]. The long short-term memory (LSTM) unit is an alteration of the vanilla RNN architecture, where gating mechanisms and skip connections are introduced to mitigate the problem of vanishing or exploding gradients [18]. Li et al. [19] proposed a hybrid architecture, using empirical wavelet transform and the regularised extreme learning machine, together with an LSTM as the main predictor. Due to the recurrent architecture, the LSTM network relies on encoding all the relevant input information into a fixed-length memory cell, which can cause information loss and limit the network's ability to retain information across longer sequences. Within the context of natural language processing (NLP), the attention mechanism was introduced by Bahdanau et al. [20], to allow the networks to directly attend to previous hidden states according to their importance. Many studies have focused on integrating attention mechanisms with LSTMs to further help the models learn long-term dependencies that can improve forecasting performance [21,22].
Oord et al. [23] proposed the WaveNet architecture for generating raw audio waveforms. The main ingredient of WaveNet is dilated causal convolution, which is a 1D convolutional operation where the causality ensures that the model cannot violate the sequence ordering, while the dilation increases the receptive field by skipping input values with a certain step. Dilated causal convolution is well suited for time series modelling and has been successfully deployed for some wind forecasting studies [24,25]. Other popular DL-based architectures used for wind forecasting also include deep belief networks [26], RNNs with Gated Recurrent Units [27] and the N-BEATS model [28].
Building on the attention mechanism, the Transformer was proposed by Vaswani et al. [29], as a new sequence transduction model, particularly focused on NLP. The Transformer is fundamentally different from previous models in that it does not rely on recurrence or convolution, making it better at learning long-term dependencies. However, since the complexity scales quadratically with sequence length, various alterations of the original architecture have been proposed to alleviate the computational limitations and make the models better suited for time series forecasting, such as the Longformer [30], FEDformer [31], Temporal Fusion Transformer [32], LogSparse Transformer [33], Informer [34], Reformer [35] and Autoformer [36]. Both [37] and [38] managed to outperform LSTM models for wind forecasting by using a Transformer, comprised of an encoder and decoder, as in [29]. A bidirectional LSTM-Transformer model achieved superior results compared to a gated recurrent unit (GRU) and an LSTM model in [39]. Wang et al. [40] proposed a model based on the Informer together with convolutional layers that extract temporal features at different frequencies, to forecast the average wind power over the next three hours. The Spatial-Temporal Graph Transformer Network (STGTN) extends the previous research by leveraging both spatial and temporal correlations within a wind farm, to more accurately forecast wind speeds at a turbine level, 10 minutes -1 hour ahead [41]. Despite some efforts at employing different Transformer-based architectures for wind forecasting, namely the vanilla Transformer and Informer, the research have nevertheless been relatively scarce. In this study, we therefore aim to further research the performance of different Transformer architectures, investigating the Informer, Autoformer and LogSparse Transformer, which are yet to be thoroughly tested for wind forecasting.
Spatial dependencies can be important for improving meteorological forecasts, such as for wind. Considering the 14 measurement stations depicted in Fig. 1 used for this study, spatio-temporal forecasting aims to leverage spatial dependencies between different stations, through the phys- ical distance between them, to improve the local forecasts for the different sites. The spatio-temporal interdependence of different physical locations can be particularly important for meteorological forecasts since physical properties such as wind fields, are non-stationary in both space and time, meaning that historical time series for different physical locations should be jointly considered to better learn global trends and propagation in both space and time. Some studies organise spatial data, i.e. the physical location of different measurement points, into an ordered grid, where the features for a particular location are assigned to a specific cell [42,43,44]. A CNN is then used to extract spatial features, while another network, such as a CNN [42] or LSTM [43], is used to learn temporal correlations. However, considering the complex topology of the different measurement stations in Fig. 1, the strict ordering of the input data required for CNNs might not be able to effectively represent the underlying spatial relationships. Graph neural networks (GNN) can better facilitate arbitrary spatial ordering and have therefore been popularly used for spatio-temporal forecasting. Khodayar et al. [45] made forecasts at different wind sites simultaneously, where each site was represented by a node in an undirected graph, using a Graph Convolutional Network (GCN) and an LSTM to extract spatial and temporal features, respectively. Similarly, Stańczyk et al. [46] also used a GCN, but with a CNN to learn temporal dependencies. Instead of using static edge features, Wang et al. [47] constructed edge features based on the time-varying spatial correlation between wind sites and used a GCN for wind farm cluster power forecasting. Finally, the M2STAN model was also proposed for spatio-temporal multi-step wind power forecasting [48], which employs a Graph Attention Network (GAT) and a Bidirectional GRU for spatial and temporal correlation modelling, respectively, along with a Transformer network for multi-modal feature fusion. This study will further build on the ideas of using GNNs for spatio-temporal forecasting, but also focus on recent advancements in Transformer-based architectures for time series analysis. Furthermore, the FFTransformer is also proposed, which is a novel Transformer architecture for time series forecasting, based on signal decomposition and learning trend and periodic components separately. Finally, even though many studies focus on single-step forecasts, predicting average wind speeds over some future interval, we also aim to improve predictions by considering multistep forecasting, producing higher resolution time series for the forecasting windows.

Multi-Layer Perceptron
MLPs are feed-forward networks that learn weights, θ, which map the input to output, y ≈ f (x|θ). A chain structure, where multiple layers are stacked, gives depth to the model, , for a model with two hidden layers. To improve the learning ability of the model, non-linearities such as the ReLU or sigmoid functions, are applied to the neuron outputs. Optimal weights are determined by minimising a differentiable loss function, using backpropagation, which will update network weights by propagating gradients of the weights with respect to the loss function, back through the network [49].

LSTM
Even though sequence analysis using DL has largely been dominated by RNNs, the original architecture is prone to exploding or vanishing gradients, resulting in significant information loss for longer sequences. The long short-term memory (LSTM) unit introduces gating mechanisms and skip connections to alleviate some of these shortcomings [18]. An illustration showing the internal workings of an LSTM unit is given in Fig. 2, where an input gate controls the contribution of the new input, x t , to the memory, while the forget and output gates control what information to be kept in memory and encoded in the output, h t , respectively.

Transformer Architectures
The Transformer architecture was proposed as a model for sequence analysis, without any recurrence or convolution, but instead based on the attention mechanism [29]. If we consider a time series forecasting task with d input features to predict a single output feature, the Transformer model should take an input, x (e) ∈ R S×d , and produce an output, y ∈ R P ×1 , where, S, is the look-back window and, P , the forecasting horizon. The original architecture consists of an encoder and a decoder, as shown in Fig. 4, where the encoder should encode a longer input into a hidden state representation for the decoder to decode. Inputs to the decoder, x (d) ∈ R (L+P )×d , are typically shorter than those to the encoder, containing the last L elements of the encoder inputs, where L < S, and some placeholders are used in place for the last P -indices, which correspond to the forecasting locations.
Inputs are first linearly transformed to E-dimensional space and then added with some positional encoding, so that the model can make use of the sequence order, without any recurrence. Unless otherwise is stated, it will be assumed that the positional encoding added to the encoder and decoder inputs will be the sine-cosine positional encoding proposed in the original architecture [29]. The multihead attention (MHA) block in the encoder employs full self-attention, where each attention operation can attend to the full input sequence. Scaled dot-product attention takes Q, K and V as inputs, which represent the queries, keys, and values, respectively. Dot products between keys and queries are computed, before being passed to a softmax function to obtain the attention weights, which are then multiplied by the values to produce the final outputs. This process can be summarised as where W (·) ∈ R E×d (·) are weights of the different linear transformations (LT), E is the hidden dimensionality and, d k , the dimension of keys and queries. Performing multi-head attention, with h separate projections, allows the module to simultaneously attend to different information in the input series. Outputs from the MHA are then concatenated and linearly projected to produce a single output.
A residual connection and layer normalisation are applied to the outputs before the signal is passed to an MLP, typically with a single hidden layer, which is applied to each position in the series separately. Multiple encoder layers, with different weights, are stacked to add depth to the model and hence a stronger function approximation ability.
The decoder follows almost the same architecture as the encoder, but with an additional MHA block, referred to as the cross-attention, which precedes the MLP. The crossattention module is the same as the other MHA blocks but takes the encoder outputs as inputs to the keys and values. Additionally, the first MHA block in the decoder use masking to prevent information flow from subsequent positions.
Even though the attention mechanism alleviates the problem of information loss for longer sequences, by allowing every position to directly attend to all other positions, its complexity scales quadratically with sequence length. Furthermore, since the Transformer was originally developed for machine translation and other NLP tasks, various modifications have emerged, which aim to both combat the complexity limitations and further adapt the architecture to facilitate time series forecasting problems. A short sum-mary of some of these will now be given in the subsequent sections.

LogSparse Transformer [33]
The LogSparse Transformer introduces two novel alterations. Sparse attention, where each position is only allowed to attend to other positions with an exponential step size and itself, significantly reduces the space complexity.
Since the point-wise attention operation described in Sec. 3.3 is insensitive to local context, causal 1D-convolution was used to compute keys and queries, instead of pointwise linear transformations. The modified transformation of keys and queries, which will here be referred to as convolutional attention, might be particularly advantageous for time series forecasting, as local context could be very important for signals characterised by high-frequency fluctuations or noise.

LogSparse Transformer [33]
The LogSparse Transformer introduces two novel alterations. Sparse attention, where each position is only allowed to attend to other positions with an exponential step size and itself, significantly reduces the space complexity.
Since the point-wise attention operation described in Sec. 3.3 is insensitive to local context, causal 1D-convolution was used to compute keys and queries, instead of pointwise linear transformations. The modified transformation of keys and queries, which will here be referred to as convolutional attention, might be particularly advantageous for time series forecasting, as local context could be very important for signals characterised by high-frequency fluctuations or noise.

Informer [34]
Instead of introducing sparsity through a fixed pattern decided by heuristic methods, the ProbSparse self-attention mechanism proposed for the Informer introduces sparsity by locating the most dominant queries and only allows keys to attend to these. Dominant queries are taken as those that maximise a surrogate for the KL-divergence between a uniform distribution and the query's attention probability distribution. Furthermore, self-attention distilling in the encoder highlights dominant attention by halving cascading layer inputs through 1D-convolution and MaxPooling, which makes the model much more efficient for very long sequences.

Autoformer [36]
Unlike the Informer and LogSparse Transformer, Wu et al. [36] proposed significant alterations to both the overall Transformer architecture and the attention module, to better facilitate time series forecasting. Instead of the scaled dot-product attention, the Autoformer introduces the Auto-Correlation mechanism, which uses keys and queries to decide on the most important time-delay similarities through autocorrelation and a time-delay aggregation, which rolls the series according to the selected time delays, before adding them together to produce the outputs. Different to point-wise attention, the Auto-Correlation mechanism finds dependencies based on periodicity and is specifically designed for time series forecasting. Series decomposition is applied after every Auto-Correlation and MLP module, using average pooling to decompose a signal, X, into trend    [29] in an encoder-decoder setting, adapted to facilitate forecasting rather than categorical outputs, with multivariate inputs and univariate outputs. B in the inputs and outputs refer to the batch size, for a single prediction B = 1.
and seasonal components, X t and X s , respectively:

Graph Neural Networks
A graph can be defined as a tuple with node-and edgespecific features, G = (V, E) [50]. For the spatial forecasting problem, node features, v i ∈ V , can represent attributes associated with a particular measurement station, while edge features, e ij ∈ E, contain information on the relationship between two nodes, i and j, such as the Euclidean distance between two stations. A GNN is comprised of stacked graph blocks, which perform per-edge and per-node updates in the following order: where φ (·) are the update functions, which could be represented by a neural network such as an LSTM, MLP or Transformer. (·) and(·) represent updated and aggregated features, respectively. Aggregated edge features,ē j , are computed using an aggregation function, ρ e→v , as which could for example be a sum or mean operation. R j is an index set containing all nodes sending to node j, and defines the connectivity of the graph. A visualisation of the GNN architecture is given in Fig. 3.

Multilevel Wavelet Decomposition
Discrete wavelet decomposition (DWD) is a method which can be used to decompose a time series signal into its trend and periodic components, referred to as the approximate and detail signals, respectively, using low-and high-pass filters. An input signal x = x 1 , x 2 , ..., x S is decomposed by convolving a low pass filter l = l 1 , l 2 , ..., l K and a high pass filter h = h 1 , h 2 , ..., h K , where K S, over the input: where A1 i and D1 i are the i th elements of the approximate and detail components obtained from a single level discrete wavelet decomposition. DWD can also be stacked in multiple layers, referred to as multilevel DWD (MDWD), to extract multiple periodic components of different frequency characteristics. In a multi-layer setting, the approximate component from one layer is fed as input to the next, resulting in multiple detail and a single approximate signal, as shown in Fig 5d, where 'D1', 'D2', 'D3' and 'D4' are the detail components and 'A4' the approximation. The Daubechies wavelet, Db4, for the low-and high-pass coefficients, l and h in eq. (7) and (8), has been shown to be suitable for wind forecasting [51]. The decomposition of a wind speed time series is shown in Fig. 5b, where inverse MDWD is applied independently to the outputs from the MDWD. By applying FFT to each time series in Fig 5b, it can be seen in Fig. 5c that detail (D1, D2, D3, D4) and approximate (A4) components yield very different frequency characteristics. The four detail components exhibited clear periodic information, with peaks at different frequencies, while the approximation, 'A4', contained most of the low-frequency trend information, clearly visualised in Fig. 5a. As a result, we argue that MDWD is a suitable decomposition method for wind speed time series, where the special filters and multiple layers potentially allow us to extract more informative trend and periodic information than simpler decomposition methods, such as series decomposition.

Dataset
Due to the future potential for offshore wind energy [52], this study decided to focus on offshore wind speed forecasting, using meteorological measurements recorded on the Norwegian continental shelf. The data was made available by the Norwegian Meteorological Institute and is openly available through the Frost API 1 . Ten-minute averaged measurements on air temperature, air pressure, dew point, relative humidity, wind direction and speed and maximum wind speed in the 10-minute interval were used as input features to forecast the wind speed time series. Wind direction was decomposed into its sine and cosine components in order to fully capture the circular characteristics. For every time-step, we would therefore have a feature matrix, f t ∈ R N ×8 , corresponding to the eight recorded measurements for N available stations. The forecasting problem then becomes, where F is the model, P the number of future time-steps to forecast,V (t+j) ∈ R N ×1 the predicted wind speeds at time, (t + j)∀j ∈ (1, 2, ...P ), and S is the look-back window, i.e. the number of previous time-steps used to make the forecasts. For the period between June 23, 2015, and February 28, 2022, 14 out of 25 available stations had some periods with measurements on all the relevant meteorological variables, and are shown in Fig. 1.
The first 60% of the data was used for training, whereas the following 20% and the remainder were used for validation and testing, respectively. If measurements for a single time-step were missing for a particular station, linear interpolation was used to fill the missing entries. However, if there were consecutive time-steps missing, the station would not be considered for these periods. For different periods, there would be a variable number of stations that had available data, meaning that the models should be able to take any subset of the 14 stations as input.

Fast Fourier Transformer
Since many time series, such as wind measurements, are characterised by both trend and periodic components, we propose the Fast Fourier Transformer (FFTransformer), based on signal decomposition and an adapted attention mechanism, named FFT-Attention. Given the different frequency characteristics for wind speed time series discussed in Sec. 3.5, the FFTransformer is comprised of two streams, one which analyses signals with clear periodicity and another that should learn trend components, i.e. the detail and approximate signals in Fig. 5b, respectively. Here, we used MDWD to decompose the input signals into its periodic and trend components, due to its capability to clearly extract different level frequencies using multiple levels of low-and high-pass filters. Nevertheless, the ar-1 https://frost.met.no/index.html chitecture discussed is not limited to MDWD and other methods for signal decomposition could be used, such as the 'Series Decomposition' in the Autformer.
As shown in Fig. 6, the right-hand stream in both the encoder and decoder were exactly the same as for the original Transformer in Fig. 4. However, to better facilitate the analysis of periodic signals in the left-hand stream, we introduce the FFT-Attention mechanism, which performs attention in the frequency domain. In particular, as the first operation of FFT-Attention, FFT is applied to the key, query and value inputs. Real and imaginary components of the FFT outputs are concatenated with the frequency values, to provide information on the corresponding frequency for the values in a particular position, similar to the motivation behind positional encoding, before being fed to an MHA block. Outputs from the FFT-Attention module are then concatenated and projected, as for the other attention mechanisms, and finally, inverse FFT transform values back to the time domain. Any attention mechanism could be used in place of all the MHA blocks in Fig. 6, such as the ProbSparse or convolutional attention. The final network outputs from both streams (FFT Attention and Trend) are at last added and linearly transformed to produce the predictions.
After some experimentation, it was found that not adding temporal or positional encoding to the detail components (i.e. inputs to the FFT-Attention) yielded the best results. This seemed sensible, as adding positional encoding in the time domain would not translate to the frequency domain. Instead, the concatenation of frequency values to the FFT outputs, served somewhat the same purpose as the positional encoding did for the trend components. Nevertheless, since this method was fairly trivial, further research studying better encoding strategies for the frequency domain could be desirable.

Spatio-Temporal Framework
All models were constructed in an encoder-only setting, i.e. without decoders, and used as update functions, φ (·) , in two-layer GNNs. To facilitate the GNN framework, the dataset was constructed as graphs. Measurement stations were represented by nodes, with the historical time series of the eight meteorological variables assigned to the input node features. Since there would be a variable number of available measurement stations at different times, the input graphs would not be the same at different times, but with a variable number of nodes. This was desirable, as it meant that we did not have to discard the data for all stations or interpolate missing values for a particular time interval where a few stations had missing data, as would be the typical case for a CNN. Considering a node, i, its input features were v i ∈ R 1×(S+P )×8 , where S is the historical look-back window, i.e. the number of previous time-steps used to predict the next P time-steps. Placeholders were used for the last P indices of the node input features to the first graph block, since we do not have information on recorded values in the future. These values were set to Zero-values were used as placeholders for the seasonal inputs to the Autoformer, as well as for the detail components in the FFTransformer model. It should be noted that placeholders were added subsequent to the 'MDWD' and prior to the 'Embed + Encoding' block in Fig. 6, meaning that inputs to the MDWD were v i ∈ R 1×S×8 . Differences in latitude and longitude between stations were assigned to the input edge features as e ij = [(lon j − lon i ), (lat j − lat i )], for two stations, i and j. A pseudocode summarising the Spatio-Temporal forecasting framework is given in Algorithm 1.

Experimental Set-up
All features were scaled separately using a standard scaler, with zero mean and unit variance. Three forecasting horizons were considered for the multi-step forecasting problem, namely 1-, 6-and 24-step forecasts, corresponding to 10-minute, 1-hour and 4-hour periods. Since the study considered multi-step forecasting, a prediction was made for every 10-minute interval also in the 1-and 4-hour ahead settings, instead of simply average wind speed forecasts over the respective periods. Models were trained to minimise the mean squared error (MSE) and hyperparameter tuning was conducted in Optuna [53]. All experiments were conducted in PyTorch on a single Nvidia RTX 2070 GPU. Every model was trained for 30 epochs using an Adam optimiser. Both look-back windows and modelspecific parameters were treated as hyperparameters and decided based on the tuning procedure and computational constraints, with final model-parameters given in Table 1. From the tuning, it was found that a look-back window of 32 time-steps was suitable for the 10-minute and 1-hour ahead forecasts, increased to 64 for the 4-hour forecasts, which seemed reasonable, as longer contexts might be useful to better predict trends further into the future. Some explanation of the different parameters are given in Table 1, though the authors refer the interested reader to the provided references in the 'Explanation' column for more detailed descriptions of model-specific parameters.
In Table 1, d are the model dimensionalities, while d hidden and 'Activation', refer to the dimensionality and activation function for hidden layers in the ST-MLP model and MLP modules in Transformer-based models. All Transformerbased models employed the same architecture for the edgeupdate functions, φ (e) , namely a single vanilla Transformer encoder layer [29].
For the ST-MLP and ST-LSTM models, both autoregressive and vector-output models were studied, where different training strategies such as mixed teacher forcing was studied for the autoregressive architectures. However, vector-based architectures were found to yield the best results for both the ST-MLP and ST-LSTM models, where the multiple forecasting steps are mapped directly using a linear projection with P outputs corresponding to the forecast locations, as X t are the input features at time, t, respectively. LT represents the linear projection that projects the last hidden features, h t ∈ R N ×d , to the P number of predicted outputs,V ∈ R N ×P , as LT : R d → R P , where d is the hidden dimensionality of the model.
The ST-LSTM model was quite sensitive to model parameters, explaining the different dimensionalities, d in Table 1, used for this model. Instead of a simple linear projection, for a four-layer MDWD (see Sec V trend ← N odeEmbed trend (V trend ) + T empEmbed(T ) + P osEmbed(S + P ) positional embedding as in [29] 11:  it was found beneficial to use a two-layer MLP as the final head for the ST-LSTM model, with parameters given by d, d hidden and 'Activation' in Table 1.
Considering the ST-FFTransformer, the MHA modules in the trend (right-hand) stream in Fig.6, employed convolutional ProbSparse attention, but different to the original ProbSparse Attention, which focuses on dominant queries, it was found marginally beneficial to instead locate dominant keys. Convolutional ProbSparse Attention was also used for the MHA module in the FFT-Attention, but finding top queries, as in the original formulation. For nondominant query locations, the original ProbSparse Attention formulation assigns mean values to the outputs [34]. However, when using the ProbSparse Attention for the FFT-Attention, we instead set the outputs for non-dominant locations to zero to introduce sparsity and only select a subset of the possible frequencies. This was thought desirable, as to avoid outputs with a very large number of frequencies of small or similar amplitudes, which could be considered noisy.
Considering the computational time for each model, it can be seen from the last row in Table 1, that the ST-FFTransformer model was slower than the other models. This was mainly due to the MDWD layer, which was not optimised to run in parallel over the different input features. Furthermore, both the ST-Autoformer and ST-FFTransformer models rely on computing FFTs, which are fairly slow and explain why these took longer to compute forecasts. However, since the time to compute any forecasts was much smaller than one second, and considerably less than the 10-minute sampling rate of the data, this was not considered a significant problem for this study and is therefore left for future work where computational speed might be more important.
The persistence model was used as a benchmark against which to compare all the other models. For the spatiotemporal setting, the persistence model will use the last recorded wind speed for a station as the forecast over the entire horizon, asV (t+1) =V (t+2) = ... =V (t+P ) = V t , using the notation in Sec. 4.1. Even though this is quite a trivial method for making forecasts, the model can achieve fairly accurate results in the short-term and is therefore used as an important baseline to outperform.

Forecasting Error
To evaluate the predictive performance of the different models, we start by comparing the mean absolute (MAE) and squared (MSE) errors, given by the following equa-tions: where n is the total number of samples and,ŷ, the model predictions, which should be close to the targets y. For each forecasting horizon, every model was trained five times, with the average errors on the test set given in Table 2. Before computing the metrics, the predictions and labels were transformed back to a meters-per-second scale for better interpretability of the results. The percentage improvement values in Table 2 are relative to the persistence model and are provided in order to highlight the relative forecasting performances. These are also visualised in Fig. 7, where the shaded regions show the variability from the five training iterations of each model as ± one standard deviation.
Looking at the ST-Transformer model in Table 2, it was evident that the model did not report any remarkable advancements.  Fig. 7, the ST-MLP model showed significantly more variability between training iterations, compared to the other models. Since this variability was uncontrollable, it meant that one could not reliably conclude on the ST-MLP model's exact performance, which was undesirable.
The ST-LogSparse and ST-Informer performed consistently better than the ST-Transformer model across all forecasting horizons in terms of both MSE and MAE, which showed the potential improvements brought by the ProbSparse and convolutional attention mechanisms for wind forecasting. Even though the ST-LogSparse and ST-Informer models reported slightly inferior forecasting performance in terms of MAE for the single-step forecasts, compared to the persistence model, both showed approximately a five percent improvement in MSE. Since MSE penalise large errors more heavily than the MAE metric, it meant that for the single-step forecasts, the persistence model had on average fewer slightly smaller errors, but  In general, all Transformer-based models attained better results than the ST-MLP and ST-LSTM models for the multistep forecasts. The ST-Autoformer and ST-FFTransformer performed remarkably well for the 1-and 6-step forecasts, achieving more than three times the improvement reported for the third best model in terms of MSE for the 1-step forecasts, and nearly a doubling compared to the ST-MLP for the 6-step forecasts. Even though the ST-Autoformer model performed very well for the 1-and 6-step forecasts, its performance was seen to degrade for the 24-step setting, where it was inferior to all other Transformer-based models, clearly visualised in Fig. 7.
The ST-FFTransformer on the other hand, continued to improve, achieving superior results compared to all other models also for the 4-hour forecasts. Both the ST-FFTransformer and ST-Autoformer achieved the best accuracy for three settings each, likely because of data decomposition. For the 1-and 6-step ahead forecasts, the difference between the two models was very small, while the ST-FFTransformer substantially outperformed the ST-Autoformer for the longer forecasting horizon of four hours. In the short term, wind time series fluctuates rapidly, while for longer time periods, it might become more important to learn trend components as well as oscillations. One reason for the superior performance of the FFTransformer architecture for the longer forecasting setting might be that it separately processes the frequency and trend components, using the two streams described in Sec. 4.2, with attention and feed forward modules applied to both. On the other hand, the Autoformer model is mainly focused on processing periodic components using the Auto-Correlation module and does not perform significant processing on extracted trend components after series decomposition, making it potentially struggle to learn longer-term trends that lack periodicity. Such a hypothesis was strengthened by the fact that all other models seemed to perform comparatively better for the longer forecasting setting, compared to the ST-Autoformer. Because of the two streams used in the ST-FFTransformer, it is able to perform well for both short-term forecasts, where periodic behaviour is thought more important, as well as for the four-hour forecasts, where non-oscillating trend components become increasingly important. Even though the ST-FFTransformer achieved good results across different horizons, it did not outperform the ST-Autoformer in the short-term. This was despite more advanced decomposition using MDWD, which was initially thought better at extracting trend and periodic components at different frequencies, compared to the simple moving average operation used in the Autoformer. In terms of decomposition, this could indicate that the MDWD is not necessarily superior to the 'Series Decomposition', or more likely, that the repeated use of decomposition in the Autoformer might work slightly better than the single decomposition performed on the inputs in the ST-FFTransformer. For future research it would be interesting to see how the FFTransformer architecture would perform with Auto-Correlation modules in place of the FFT Attention blocks in Fig. 6 and experimenting with some data decomposition applied to every encoder or decoder layer.
Considering the 1-step forecasts, the ST-Autoformer model exhibited considerable variability between training iterations, compared to the ST-FFTransformer, seen by the shaded regions in Fig. 7, which was undesirable. We therefore argue for the competitive forecasting performance of our proposed FFTransformer architecture, which performed consistently well across all forecasting horizons and with limited variability. To investigate the physical interpretation of the forecasting results in relation to wind energy production, two additional MAE metrics were computed and provided in Table 3, which correspond to the estimated errors in kW and kWh. Powers were estimated based on the NREL 5 MW reference wind turbine for offshore system development [54]. By transforming the true and predicted wind speeds to powers using the reference turbine's power curve and then calculating the MAEs, the results show crude estimates for the average power errors using the different models. For the first metric in Table 3, results were fairly similar to those discussed in Table 2, but arguably more interpretable, in terms of understanding the consequence of differing predictive performances and potential risks associated with the proposed models.
Instead of looking at the point-wise power difference between the true and predicted values, an operator might be primarily concerned with the total overall energy for a future time interval. For the Interval MAE, predictions were again transformed to power values, before the values associated with a particular forecast interval were summed. Finally, since each step was associated with a 10-minute interval, summed values were divided by six, to convert them to the total energy estimates in kWh. Therefore, the Interval MAE provides an estimate of the difference between the predicted and true total energies produced for the relevant forecast horizon. The performance of different models was similar to previous results, but with all models showing greater improvements over the persistence model. Considering the ST-FFTransformer, it managed to reduce the error by 300 kWh for the 4-hour forecasts, corresponding to a 19% improvement over the persistence model. This was fairly notable, as it also meant an additional ≈ 5% improvement compared to most other models, and illustrated the potential cost savings and benefits of more accurate forecasting models.  Figure 9: MAEs for the different available measurement stations in the test set under the 6-step forecasting setting. The error bars are the ± 2 standard deviations from the five training iterations of each model.

Station Predictability
Since the models made forecasts for all 14 stations in Fig. 1, simultaneously, it was thought interesting to inspect the average errors for each station independently. Fig. 9, shows the average MAEs for every station under the 6-step forecasting setting, along with error bars of ±2σ, where σ is the standard deviation computed based on five separately trained models. It was seen that there was significant variability in how well the models were able to make forecasts for the different stations, with MAEs ranging from 0.53 m/s for some stations to 0.74 m/s for others. Even though there were distinct differences in the performance of different models, as discussed in the previous section, stations associated with higher or lower MAEs seemed consistent across all models. The data only contained historic information on the specific measurement stations in Fig. 1, with a fixed physical layout, meaning that spatial features would not change and that the meteorological data was likely to follow a particular distribution, specific to the area considered. As a result, a station might be inherently easier to forecast than others, due to its location relative to surrounding stations and due to the dominant wind fields for this specific geographical area potentially being preferable for forecasting at a particular location.

Graph Connectivity
For the results discussed so far, all input data was constructed as complete graphs, meaning that all nodes had edge features sending to all other nodes in the network and itself. This trivial method for formulating the graphs could result in very large inputs, significantly increasing the computational and memory costs. For instance, if a graph had 14 nodes, it would result in 14 2 = 196 edges. Some of the edges might be redundant, meaning that the relevant spatial information could be provided by a subset of the edges, in addition to excess information potentially making training more challenging. Even though this study did not focus on discussing better connectivity strategies for wind forecasting or using learnable adjacency matrices, we conduct a brief investigation into whether some of the connections could potentially be removed. Fig. 10 plots the different MAEs on the test data, as we increase the number of connections in the input graphs. 'n_closest' refers to the number of closest neighbours to which we allow a node to connect. Looking at Fig. 1, this meant that if 'n_closest' was set to three, Valhall A would only have edges sending to it from Sleipner B, Granefeltet and Oseberg Sør. The first row in Fig. 10, shows the MAEs for different 'n_closest' values, while the second row contains the same information, but normalised by MAE 0 , which was the MAE when 'n_closest' = 0, i.e. the MAE for predictions made without any spatial information. While the first row provides information on the relative differences between models, the second row visualises the percentage improvements gained from including additional edges for a particular model.
First, MAEs were seen to rapidly decrease as we increased the number of edges, before converging to constant values when more than around five neighbours were considered for the edges. The sharp decrease indicated that the models were able to successfully leverage spatial correlations to improve forecasts, proving the effectiveness of the proposed GNN architectures. Nevertheless, since MAEs converged to constant values for non-complete graphs, it indicated that a number of connections could potentially be removed without impairing predictive performance. Further work would therefore be desirable to investigate better methods for which to construct the graphs or learn optimal connectivity for spatio-temporal wind forecasting.
Looking at the second row of Fig. 10, it was seen that the percentage reduction in MAEs was greater for the longer forecasting horizons. For the immediate short-term (i.e. prediction length of 1), spatial correlations were thought less important than for the 1-and 4-hour forecasts, due to the large physical distances between nodes resulting in wind fields not having time to propagate in the immediate short-term. The added benefit of having long-range connections between nodes far apart, was therefore greater for the 6-and 24-step settings, than for the 1-step ahead forecasts. Comparing the 6-and 24-step forecasts, the latter also converged slightly later, which was likely due to the longer-term forecasts taking advantage of information from nodes further away from the target. The percentage change in MAEs was also greater for the 24-step setting than for the 6-step, even though the difference was not as big as between the 1-and 6-step settings, which indicated that the 24-step forecasts might have benefited from even larger geographical information than was available. Overall, it was concluded that all models were able to take advantage of spatial information in making forecasts, with the Transformer-based architectures generally showing slightly larger improvements than the ST-MLP model.

Conclusion
In recent years, Transformer-based models have presided over sequence-based deep learning, often superseding recurrent or convolutional models. Nevertheless, research employing these architectures for wind forecasting has been scarce. This study considered different Transformer architectures as the main predictor for spatio-temporal multi-step forecasting, focusing on the LogSparse Transformer, Informer and Autoformer. This is the first time many of these have been applied to wind forecasting and placed in a spatio-temporal setting using GNNs. Additionally, the novel FFTransformer was proposed, which is based on signal decomposition using wavelet transform and an adapted attention mechanism in the frequency domain. Results show that the FFTransformer architecture was very competitive, achieving results on par with the Autoformer-based model for the 1-and 6-step forecasts, while significantly outperforming all other models for the longer 24-step forecasts. Even though the vanilla Transformer architecture generally did not yield significant improvements over an MLP model, it was seen that the convolutional attention in the LogSparse Transformer and the ProbSparse Attention of the Informer, were able to slightly improve prediction performance. By estimating the associated prediction errors in kW and kWh, we showed the potential physical effects of different forecasting performances with regards to the power grid, with the FFTransformer model showing an additional 5 % improvement over all other models for the 4-hour forecasts. Nevertheless, obtaining the powers based on the NREL 5 MW reference turbine, the method was fairly trivial and it would be desirable to further test the different models on real wind power datasets. By removing graph connections in the input data, we showed that the proposed GNN architectures were successful in leveraging spatial correlations to improve local forecasts. However, it was also seen that some connections in the input data might be redundant, calling for additional research into more efficient approaches for graph connectivity in the context of wind forecasting. Since the FFTransformer model is not restricted to a particular signal decomposition technique or attention mechanism, slight alterations from the particular set-up used in this study might also be relevant and could be easily implemented and tested to facilitate different applications. We therefore hope that this study will spark further research into modifications and other applications of the FFTransformer, as well as investigation into the applicability of different Transformer-based architectures for use in wind forecasting.