Temporal collaborative attention for wind power forecasting

Wind power serves as a clean and sustainable form of energy. However, its generation is fraught with variability and uncertainty, owing to the stochastic and dynamic characteristics of wind. Accurate forecasting of wind power is indispensable for the efficient planning, operation


Introduction
Wind power is a renewable and clean energy source that has been growing rapidly in recent years, reaching a global capacity of 934 GW in 2022 [1].However, wind power generation is also fluctuating and intermittent, due to the stochastic and dynamic nature of wind speed and direction, as well as other meteorological factors [2].These characteristics pose a threat to the stability of electric power systems and hinder the effective application and management of wind energy [3].Therefore, forecasting wind power accurately is crucial for the planning, operation, and integration of wind energy systems into the power grid [4].Wind power forecasting (WPF) is the task of estimating the amount of wind power that can be produced by one or more wind turbines or wind farms in a given location and time period [5].Wind power forecasting can help wind farm operators and grid managers to plan ahead, adjust the power supply and demand, reduce the risk of power outages or curtailments, and optimize the economic benefits.rudimentary statistical methods often fall short in capturing these nonlinear relationships, leading to forecasts that are either suboptimal or biased.To mitigate this issue, the adoption of nonlinear modeling techniques capable of learning complex data patterns is imperative.The temporal variations and seasonal changes in wind power generation introduce another layer of complexity, termed nonstationarity.The study [8] indicates that such variations can alter the data distribution and the underlying dynamics over time, thereby affecting the reliability of models that assume constant parameters or stable conditions.Consequently, there is a pressing need for adaptive modeling approaches that can adjust to these changing conditions to yield consistent forecasts.High-dimensionality is another significant challenge, as wind power generation is influenced by a multitude of factors that operate across different spatial and temporal scales.These include weather conditions, geographical locations, turbine configurations, and grid conditions.As Karamichailidou et al. [9] have noted, the sheer volume of data generated from these multiple sources complicates data processing and analysis.Therefore, models capable of handling high-dimensional data are essential for effective forecasting.Wind power generation is affected by various sources of uncertainty, such as measurement errors, model errors, parameter errors, and forecast errors [10].These sources of uncertainty may introduce errors or deviations in the forecasts, affecting the decision-making and risk management of wind power integration.To address this difficulty, uncertainty quantification and propagation techniques can be used to provide probabilistic forecasts or confidence intervals, such as [11,12].
In recent decades, a plethora of methods have been developed to tackle the challenges associated with WPF, as evidenced by a range of seminal works [13][14][15][16].These works can be broadly categorized into three categories: physical methods, statistical methods, and hybrid methods.Physical methods are based on numerical weather prediction (NWP) models that simulate atmospheric dynamics and physics using mathematical equations [8].Physical methods primarily rely on numerical weather prediction (NWP) models, which employ mathematical equations to simulate atmospheric dynamics and physics [8].While these physically-based models, such as NWP systems [17,18], are adept at providing detailed long-term wind forecasts, they are computationally intensive and may suffer from inaccuracies due to model simplifications.Moreover, the computational burden becomes particularly pronounced during the downscaling process [19].Statistical methods, on the other hand, employ data-driven models to learn empirical relationships between input variables, such as historical wind and meteorological data, and output variables like future wind power generation [5].These models are generally efficient for short-term forecasts but are susceptible to overfitting and may exhibit reduced accuracy over extended prediction horizons [20].Traditional statistical models like the Persistence Model (PM) and various Autoregressive (AR) types, including ARMA and ARIMA, have been adapted to account for the non-stationary and complex nature of wind speeds.For example, Fractional ARIMA models have demonstrated significant improvements in 24-hour and 48-hour wind speed forecasts compared to the PM [21].However, these models often make the simplifying assumption of Gaussian distributions in wind speed data, which may not always hold true [22].Hybrid methods aim to amalgamate the strengths of both physical and statistical approaches, thereby providing robust and reliable forecasts across various time horizons [8].Despite their potential, these methods may encounter challenges in effectively integrating disparate models or data sources [23].
Nevertheless, existing methods still have some limitations and drawbacks.One of them is that most of the methods do not consider the importance and relevance of each input variable or time step.They treat all the input data equally or use fixed weights, which may lead to poor or biased forecasts.For example, ARIMA [2], ELM [3] and SVR models [5], do not differentiate the importance of different wind fields, time steps, or meteorological factors, respectively.Another notable limitation is the inadequate exploitation of collaborative and directional information present in the input data.Many existing methods overlook the interactions and correlations among different variables or time steps, leading to forecasts that may be either incomplete or redundant.For instance, both the ARIMA and SVR models fail to capture the directional nuances of wind speed and direction, as well as the interactions among different input variables or geographical locations.Moreover, a majority of the methods exhibit a lack of effective integration between long-term and short-term information from the input data.They either focus on one of these aspects to the exclusion of the other or resort to simplistic concatenation techniques, potentially resulting in information loss or inconsistencies in the forecasts.For example, ELM networks and SVR methods primarily concentrate on long-term information derived from NWP equations or short-term information from Multi-Temporal Scale (MTS) models, or they attempt to combine both types of information through rudimentary concatenation.Furthermore, most of existing methods do not consider the spatial dependencies among different wind turbines or wind farms, which may affect the accuracy and reliability of the forecasts.These methods often rely on fixed or predefined weights for each input variable or time step, which may not reflect their contextual relevance for forecasting.Therefore, there is a need for a novel method that can address these challenges and provide accurate and robust forecasts for WPF.
In this paper, we aim to answer the following research problem statement: How to accurately forecast wind power generation using a data-driven method that can capture the temporal and spatial dependencies, as well as the long-term and short-term patterns in the data, and that can dynamically adjust the weights of each input variable and time step based on their contextual relevance for forecasting?The research objectives of this paper are: (1) To develop a novel data-driven method for WPF that can accurately forecast wind power generation by capturing the temporal and spatial dependencies, as well as the long-term and short-term patterns in the data.(2) To explore how the proposed method can dynamically adjust the weights of each input variable and time step based on their contextual relevance for forecasting, and how this can improve the representation and interpretation of the data.
To address these objectives, we propose Temporal Collaborative Attention (TCOAT), a novel method for WPF that advances the stateof-the-art in several ways.First, it is the first method that integrates temporal collaborative attention with temporal fusion for WPF, which can capture both the temporal and spatial dependencies, as well as the long-term and short-term patterns in the data.Existing methods either use fixed or predefined weights for each input variable or time step, or use simple attention mechanisms that do not consider the directional information or the global information in the data.Second, it introduces collaborative attention units (CAUs), which can transform the input data into a tensorial representation capable of capturing directional dependencies, and computing attention scores and memory weights for each tensor direction.CAUs can model the interactions and correlations among different variables or time steps using self-attention and crossattention, and can enhance the representation and interpretation of the data.Existing methods either do not use attention mechanisms, or use single-directional or single-dimensional attention mechanisms that do not capture the complex relationships in the data.Third, it designs a temporal fusion layer, which can effectively integrate the long-term and short-term information from the data, and fuse them using concatenation and mapping operations and hierarchical feature extraction and aggregation.The temporal fusion layer can capture both global and local data characteristics, and can extract hierarchical features for WPF.Existing methods either do not use temporal fusion, or use simple concatenation or averaging techniques that may result in information loss or inconsistencies in the data.TCOAT is an end-to-end model that can learn directly from raw wind power data without any preprocessing or post-processing steps.
The main contributions of this paper are: • We propose TCOAT, a novel method for WPF that uses attention mechanisms to capture the temporal and spatial dependencies in wind power generation data, and to dynamically adjust the weights of each input variable and time step according to their relevance for forecasting.• We introduce CAUs, a novel component of TCOAT that can learn the directional information and the global information from the data, and model the interactions and correlations among different variables or time steps using self-attention and cross-attention.• We design a temporal fusion layer, a novel component of TCOAT that can effectively integrate the long-term and short-term information from the data, and fuse them using concatenation and mapping operations and hierarchical feature extraction and aggregation.• We evaluate TCOAT's performance and generality on two realworld wind power generation datasets from different climate zones.One dataset is from Greece, which has a Mediterranean climate.The second dataset, while the precise location is not disclosed, is derived from a wind farm situated in a flat, inland terrain.We compare TCOAT with twenty-two state-of-the-art methods on various forecasting tasks and metrics.The results demonstrate that TCOAT outperforms existing methods in terms of accuracy, generality and robustness for WPF.
The rest of this paper is organized as follows: Section 2 presents related work.Section 3 describes the study materials.Section 4 presents the proposed TCOAT model.Section 5 describes the experimental settings, reports, and discussions on the experimental results.Section 6 concludes this paper and outlines some future research directions.

Wind power forecasting models
Wind power forecasting is essential for the integration and operation of wind energy in power systems.It can help optimize the scheduling and dispatch of power generation, reduce the uncertainty and variability of wind power, and enhance the reliability and security of the grid.Wind power forecasting can be classified into four categories according to the forecasting horizon: very short-term (up to 6 h ahead), short-term (6 to 48 h ahead), medium-term (2 to 10 days ahead), and long-term (more than 10 days ahead) [24].Different forecasting methods have different strengths and weaknesses, depending on the forecasting horizon, the spatial and temporal resolution, the input data, and the evaluation metrics.In this section, we review some of the main approaches for wind power forecasting, namely physical, statistical, and deep neural networks (DNNs) methods.We also discuss their advantages, disadvantages, and challenges.

Physical approaches
Physical models, employing numerical weather prediction (NWP) techniques, are crucial in wind power forecasting.These models are based on atmospheric physics and solve equations of fluid dynamics and thermodynamics.They can account for complex terrain and environmental factors, and provide forecasts for multiple variables, such as wind speed, direction, temperature, and pressure.These variables can affect the power output and the fatigue and damage of wind turbines, as well as the power flow and congestion in transmission lines [25].One of the most prominent examples of physical models is the Weather Research and Forecasting (WRF) model, which excels in medium to long-term forecasting.However, these models also have drawbacks, such as requiring high computational resources and extensive meteorological data, which can hamper their real-time applications.
Several studies have applied and compared the performance of physical models for wind power forecasting, using different input data, forecasting horizons, and evaluation metrics.For example, Jacondino et al. [26] compared different physics schemes for wind forecasting in Brazil using WRF.They found that the best setup used a local closure PBL, a single-moment microphysics, a two-layer land surface, a profile adjustment cumulus, and cloud-aerosol radiation.Wang et al. [27] proposed a method to correct the wind forecast of the WRF model using random forest (RF) and machine learning (ML) techniques.They used WRF output, GTS observation and ERA5 reanalysis data as inputs, and evaluated the forecasts using RMSE and spatial distribution.They found that the RF-based method improved the average forecast accuracy of 10 m wind, 2 m temperature and sea level pressure by 40%, 36%, and 50%, respectively, compared to the original WRF model.They also found that adding a MLP-based feature selector to the RF model further improved the accuracy by 5%.Zheng et al. [28] used the WRF-RF model for short-term wind power prediction at a wind farm in China.They used data from a NWP model and a wind tower as inputs, and evaluated the forecasts using RMSE and MAPE.They found that the WRF-RF model improved the accuracy of wind power prediction, especially at higher wind speeds.Zhao et al. [29] created a hybrid method that combines a WRF ensemble, a fuzzy system, and a cuckoo search algorithm to forecast wind speed for wind farms.The method reduces NWP errors, selects and weighs the best ensemble members, and performs better than other models in different regions.
One of the advantages of physical models is that they can provide forecasts for any location, even where historical data is not available or sufficient.This is especially useful for remote or offshore wind farms, where data collection can be challenging or costly [30].However, physical models also face some challenges and limitations.One of them is that they depend on the quality and availability of input data, such as initial and boundary conditions, which can introduce uncertainties and errors in the forecasts.For example, errors in the initial wind speed or direction can propagate and amplify over time, leading to inaccurate forecasts [31].Another challenge is that they require high-resolution spatial and temporal grids, which can increase the computational complexity and cost of the models.This can limit the applicability of physical models for short-term or very short-term forecasting, where fast and frequent updates are needed [32].Moreover, physical models may not account for local effects, such as topography, land use, and vegetation, which can influence wind power production at specific sites.These effects can be difficult to model or parameterize, and may require site-specific calibration or validation [33].Finally, physical models may not be able to capture the stochastic and nonlinear nature of wind power fluctuations, especially in short-term horizons.These fluctuations can be caused by random or chaotic phenomena, such as gusts, ramps, or cut-offs, which can be hard to predict or simulate [34].

Statistical approaches
Statistical models, such as the Auto Regressive Moving Average (ARMA) and its extension, the Auto Regressive Integrated Moving Average (ARIMA), are widely utilized in wind power forecasting.These models are based on linear regression of observed values, and can handle time series data effectively.The ARIMA model, in particular, addresses the non-stationarity of wind data, making it more adaptable to varied forecasting scenarios.However, these models require highquality, stationary historical data to maintain accuracy, which may limit their applicability in some conditions.
Several studies have applied and compared the performance of ARMA and ARIMA models for wind power forecasting, using different input data, forecasting horizons, and evaluation metrics.For example, Cao et al. [35] combined the ARMA model for forecasting up to one hour ahead, and the pattern-matching method for forecasting up to six hours ahead.They found that the ARMA model had better accuracy in shorter time scales, while the pattern-matching method was more accurate for longer time scales.Milligan et al. [36] tested various alternative ARMA models for up to six hours forecasting horizon, and found that the ARMA (1,24) model had the best performance of all the models tested.They also observed that the forecasting accuracy decreased significantly for greater forecasting horizons, and that the ARMA models managed to surpass the persistence model in most cases.Ahn et al. [37] proposed a short-term wind power forecasting method using an ensemble model based on wavelet transform and ARIMAX techniques.They claimed that their method outperformed the single ARIMAX model and other benchmark models in terms of accuracy and reliability.Zhang et al. [38] proposed a hybrid model based on DWT, SARIMA and LSTM to forecast short-term offshore wind power.They used DWT to decompose the power signal into linear and nonlinear components, and applied SARIMA and LSTM to capture the seasonal and dynamic patterns, respectively.They achieved lower NMAE and NRMSE than using single models or other hybrid models.Sheoran and Pasari [39] forecasted wind speed from four Indian locations.They updated model parameters dynamically and compared with standard ARIMA and persistence models.They showed that window-sliding ARIMA was more accurate, robust, flexible, and efficient.
Statistical models have some advantages and drawbacks for wind power forecasting.One of the advantages is that they are simple, fast, and easy to implement.They can provide reliable forecasts for shortterm horizons, such as minutes or hours ahead, which are useful for operational planning and scheduling.They can also capture the autocorrelation and seasonality of wind power data, which are important features for forecasting.Moreover, statistical models can be combined with other methods, such as physical models, machine learning models, or ensemble methods, to improve their performance and robustness.For example, Singh et al. [40] proposed a hybrid method that used ARIMA and artificial neural networks (ANNs) to forecast wind power for different time scales.Bazionis et al. [41] presented a critical review of various forecast models, including statistical models, and compared their performance indices.However, statistical models also have some drawbacks and challenges.One of them is that they ignore external factors that affect wind power generation, such as weather conditions, terrain features, or turbine characteristics.These factors can introduce uncertainties and errors in the forecasts, especially for longer-term horizons, such as days or weeks ahead.Another drawback is that statistical models are sensitive to outliers and noise in the data, which can distort the model parameters and reduce the forecast accuracy.Furthermore, statistical models may fail to capture complex and nonlinear patterns in the wind power data, such as ramps, gusts, or cut-offs, which can cause significant deviations from the expected values.These patterns can be influenced by random or chaotic phenomena, which are hard to model or predict by linear regression.For instance, Messner et al. [42] conducted a comprehensive review and statistical analysis of errors in wind power forecasts, and found that the error dispersion factor, which measures the variability of errors, depended on the size of the wind farm, the forecasting horizon, and the class of the forecasting method.Olson et al. [43] developed an empirical model that used inputs from a numerical weather prediction (NWP) model to forecast wind power, and compared it with a statistical model.

DNN-based approaches
Deep Neural Networks (DNNs) are composed of multiple interconnected layers of artificial neurons that can learn complex and nonlinear mappings between the input data and the output data [2,44].DNNs can be classified into different types and categories based on their network structures and functions, such as Recurrent Neural Networks (RNNs), Convolutional Neural Networks (CNNs), Attention Mechanisms (AMs), Transformers, and Graph Neural Networks (GNNs) [45].These types or categories of DNNs have different characteristics, strengths, and weaknesses for wind power forecasting, depending on the input features, the forecasting horizon, and the uncertainty quantification [46].
Several DNN-based methods have been proposed for wind power forecasting, which can be categorized according to their input features, network architectures, and forecasting horizons.Huang et al. [47] used LSTM, CNN, and fully connected layers to capture the longterm dependencies and local features in wind power data, achieving high accuracy and robustness for short-term forecasting.In contrast, Alcantara et al. [48] used CNN and fully connected layers to extract local and global features from spatial data and capture the spatial dependencies in wind power generation, achieving high accuracy and efficiency for medium-term forecasting.Liu et al. [49] used AM, LSTM, and fully connected layers to assign different weights to different inputs or time steps according to their relevance and importance for wind power forecasting, achieving high accuracy and scalability for long-term forecasting.Moreover, Sun et al. [50] used spatio-temporal correlations and transformer neural networks for short-term multi-step wind power forecasting, evaluating the quality of spatial information using distance-and correlation-based metrics, and modeling the wind power using multi-head attention mechanism.They outperformed several baseline and state-of-the-art methods in two case studies, but they ignored the wind direction, seasonal variation, and weather factors.Wu et al. [51] integrated multidimensional data and spatial correlations for wind speed forecasting, using a Wind Transformer to capture temporal features and a GNN [52] to aggregate spatial features.They outperformed existing methods in accuracy and stability.Liu et al. [53] forecasted the ultra-short-term power of wind farm cluster based on power fluctuation pattern recognition and spatio-temporal graph neural network, segmenting the power series into different patterns and training a separate model for each pattern, and capturing the dynamic spatio-temporal correlation between adjacent wind farms under different patterns.However, they had limitations such as high computational cost, fixed pattern partition, and lack of uncertainty quantification.
Compared to other methods, DNN-based wind power forecasting methods have both advantages and drawbacks.On the one hand, they can learn complex and nonlinear patterns and dependencies from historical data, without requiring explicit physical or statistical assumptions.They can also handle noisy, incomplete, or high-dimensional data, and adapt to changing conditions and scenarios.Moreover, they can be combined with other methods, such as physical models, statistical models, or optimization methods, to improve their performance and robustness.For instance, Lu et al. [54] proposed a hybrid method that used IVMD-SE data preprocessing, MC-LSTM predictor and PSO optimization to forecast short-term wind power, which can handle complex data characteristics and improve forecasting accuracy and robustness.Wu et al. [55] presented a comprehensive review on DNNbased approaches in wind forecasting applications, and categorized the existing methods into four types: RNN-based, RBM-based, CNNbased and AE-based models.They also discussed the advantages and disadvantages of each type, as well as the future research directions.On the other hand, deep learning models also have some limitations and challenges in wind forecasting.One of them is that they require high-quality and large-scale data, which may not be easily obtained or sufficient in some scenarios.Another limitation is that they are susceptible to overfitting and underfitting, which may affect their accuracy and robustness.Moreover, deep learning models are computationally expensive and time-consuming, and they need careful tuning of hyperparameters and network architectures.Furthermore, deep learning models are hard to interpret and explain, as they lack transparency and physical meaning.Therefore, it is important to address these issues and improve the performance and reliability of deep learning models in wind forecasting.

Summary
Statistical models, physical models, and DNN-based methods are three main categories of existing methods for wind power forecasting.However, each category has its own advantages and drawbacks, such as data requirements, computational complexity, accuracy, robustness, and interpretability.In this paper, we propose a novel method for wind power forecasting, TCOAT, that integrates temporal collaborative attention and temporal fusion, which can capture both the temporal and spatial dependencies, as well as the long-term and short-term patterns in wind power generation data.TCOAT also introduces collaborative attention units (CAUs) and a temporal fusion layer, which can enhance the representation and interpretation of the data, and extract hierarchical features for wind power forecasting.TCOAT differs from the existing DNN-based methods in terms of the design and the components, and demonstrates the unique contributions of this work by providing accurate and robust forecasts for wind power forecasting.TCOAT consists of four main components: a temporal encoder, a spatial encoder, a collaborative attention module, and a transformer decoder.TCOAT can handle both sequential and spatial data, capture longterm and short-term dependencies, provide attention maps, and achieve state-of-the-art performance.TCOAT is an end-to-end model that can learn directly from raw wind power data without any preprocessing or post-processing steps.

Attention mechanisms in time series forecasting
Attention mechanisms have been widely adopted in natural language processing tasks [56,57], as they can effectively capture longrange dependencies in sequential data.This is especially beneficial for energy forecasting, where temporal patterns and relationships span across various time scales.In this subsection, we review the recent developments in applying attention mechanisms for energy forecasting, and critically examine their strengths and weaknesses.
We categorize the existing works that use attention mechanisms for energy forecasting into two groups: single-energy forecasting and multi-energy forecasting.Single-energy forecasting aims to forecast one type of energy source or load, such as wind power [58,59] or electrical load [60].These works employ various deep learning architectures with attention mechanisms, such as CNNs, RNNs, LSTMs, or dual-attention mechanisms, to achieve high accuracy and robustness for single-energy forecasting tasks.However, they also face some limitations, such as neglecting external factors that may influence energy generation or demand, requiring large amounts of data, or being computationally expensive.Multi-energy forecasting targets to forecast multiple types of energy sources or loads simultaneously, such as electrical and thermal loads or wind and solar power [61][62][63].These works propose novel methods for short-term or day-ahead multi-energy load forecasting based on a CNN-BiGRU architecture, a CNN-Seq2Seq model, or an attention mechanism-based transfer learning model.These works demonstrate high efficiency and accuracy for multi-energy forecasting tasks, but they also encounter some challenges, such as handling complex and large-scale datasets, accounting for seasonal variations in energy data, or ensuring the generalizability of the model to different energy systems.In addition to these works that directly focus on energy forecasting tasks, there are also some works that apply attention mechanisms for related tasks that could indirectly benefit energy forecasting applications.For instance, Tekin et al. [64] use attention mechanisms on convolutional LSTMs for spatio-temporal weather forecasting; while Khan et al. [65] present a dual stream network with an attention mechanism for photovoltaic power forecasting.These works could provide valuable insights and techniques for energy forecasting applications, as weather conditions are important factors that affect energy generation and demand.However, these works also have some drawbacks, such as ignoring the nonlinear and nonstationary characteristics of energy data, or failing to capture both long-term and short-term patterns.
Unlike existing approaches, in this work, we employ collaborative attention units that consist of self-attention and cross-attention mechanisms to model intricate interactions among variables and time steps.Moreover, we utilize a temporal fusion layer for the integration of long-term and short-term information, deploying concatenation and mapping operations for hierarchical feature extraction.TCOAT is end-to-end trainable, eliminating the need for preprocessing or postprocessing steps, and can be learned directly from raw wind power data.To the best of our knowledge, TCOAT is the first method that combines temporal collaborative attention with temporal fusion for wind power forecasting.

Materials
This study uses two datasets of wind power generation data and associated meteorological data with different characteristics and challenges.The first dataset, referred to as the Greece dataset, obtained from the European Network of Transmission System Operators of Electricity (ENTSO-E), contains hourly data from 18 locations in Greece from 2017-01-01 to 2020-12-31.ENTSO-E is an online platform that aggregates energy data from 42 participants in the European centralized energy market.The second dataset, known as the WSTD2 dataset, can be accessed at (https://zenodo.org/records/5516550)[66].This dataset includes hourly data from 200 randomly chosen turbines situated in a flat terrain inland wind farm, covering the period from 2010-09-01 to 2011-08-31.Wind power generation data exhibit time-varying and nonlinear characteristics, as they are influenced by meteorological factors such as wind speed, wind direction, temperature, humidity, etc., which change over time and affect the efficiency and stability of wind power generation [2].Moreover, wind power generation data show periodic patterns due to diurnal, seasonal, and climatic variations [49].These datasets are suitable for evaluating the performance and generalizability of the proposed model.
Fig. 1(a) shows the daily wind power production changes over time for one location in Greece from the Greece dataset.Three main patterns can be observed: (1) The power output varies significantly from month to month, with higher values in May and September, and lower values in April and August.This reflects the long-term changes in wind power generation due to climatic factors such as temperature, precipitation, and pressure; (2) The power output also varies within each day, with lower values around 8 to 12 o'clock, and higher values around 16 to 20 o'clock.This reflects the diurnal changes in wind power generation due to solar radiation and atmospheric stability; (3) The power output does not exhibit obvious seasonal patterns, such as higher values in winter and lower values in summer.This reflects the uncertainty and randomness of wind power generation, as it may be affected by weather, equipment, policy, and other factors that cause anomalies or fluctuations.Fig. 1(c) shows the heatmap of hourly production within a year for the same location.It can be observed that the power output has a clear diurnal pattern, with higher values in the afternoon and lower values in the morning.It can also be observed that the power output has some seasonal patterns, with higher values in spring and autumn, and lower values in summer and winter.However, these patterns are not consistent or regular, as there are some outliers or deviations that indicate the uncertainty and variability of wind power generation.
Fig. 1(b) shows the daily wind power production changes over time from the WSTD2 dataset.Similar patterns can be observed: (1) The power output varies significantly from month to month, with peaks in April and November and troughs in July and August; (2) The power output also fluctuates within each day, with lower values between 14:00 and 18:00 and higher values between 07:00 and 15:00; (3) The power output does not show clear seasonal trends, such as higher values in winter and lower values in summer.These patterns reflect the influence of various factors such as climatic elements, solar radiation, atmospheric stability, and other unpredictable factors like weather conditions and equipment performance.Fig. 1(d) shows the heatmap of hourly production within a year for the same location.It can be observed that the power output has a similar diurnal pattern as the Greece dataset, with higher values in the afternoon and lower values in the morning.It can also be observed that the power output has some seasonal patterns, with higher values in spring and winter and lower values in summer and autumn.However, these patterns are also inconsistent and irregular, as there are some outliers or deviations that reflect the inherent uncertainty and variability of wind power generation.
Two datasets are strategically employed in this paper to assess the proposed TCOAT model, due to the space constraints of the article.The Greece dataset is dedicated to verifying the performance of the model, validating its accuracy and robustness, and facilitating a comparative analysis with twenty-two state-of-the-art methods across various forecasting tasks and metrics.The WSTD2 dataset, on the other hand, examines the model's generality and its ability to handle diverse data characteristics and scenarios.This approach ensures a comprehensive validation of the model's reliability and wide applicability.
The data preparation phase tackles the prevalent issue of missing values in wind power data with tailored strategies for each dataset.The Greece dataset replaces missing data points with average values calculated from similar dates in the past.This method preserves the integrity, precision, consistency, and temporal patterns of the data, enhancing the robustness of the subsequent forecasting tasks.The WSTD2 dataset, which has only a single year's data and no additional data for imputation, adopts a different approach.It fills missing values with zeros, ensuring the completeness of the dataset and making it suitable for evaluating the adaptability of the proposed model.

Methodology
This section delineates the research problem, outlines the data preprocessing steps, and provides a detailed description of the proposed TCOAT model.

Problem formulation
We consider the problem of predicting the future values of wind power generation using multivariate time series data.Wind power generation is a renewable energy source that depends on both temporal and spatial factors, such as weather conditions, seasonal patterns, and geographical locations.Therefore, forecasting wind power generation is a challenging task that requires capturing the temporal and spatial dependencies of the data, as well as the long-term and short-term patterns.Formally, let  ∈ R × be the input data, where  is the number of time steps and  is the number of variables.The input data consists of wind power generation data and associated meteorological data for a given region at a given resolution.Let  be the output data, where ℎ is the prediction horizon.The output data is the wind power generation for the next period.The goal is to learn a mapping function  ∶  →  that minimizes a loss function L( , Ŷ ), where Ŷ =  () is the predicted data.
The mapping function  can be decomposed into four sub-functions: where  1 is the data preprocessing function,  2 is the long-term temporal representation function,  3 is the collaborative attention unit and fusion function, and  4 is the short-term temporal representation function.Each sub-function can be expressed as follows: •  1 ∶  → , where  ∈ R × × is the normalized data, and  and  are the batch size and the length of input time steps, respectively.•  2 ∶  → , where  ∈ R × × is the tensor representation of the data.
The loss function L( , Ŷ ) is defined as the mean squared error (MSE) between the true data and the predicted data.The goal is to minimize this loss function by learning the optimal parameters of the mapping function  .The problem can be formulated as an optimization problem as follows: ⊳ Normalization (Eq.( 2)) L ← extract the long-term temporal representation of the input data Z ⊳ LTR (Eq.( 9)) foreach direction  ← 0 to 2 do C  ← calculate the collaborative attention representation using CAU(L, d) ⊳ Temporal fusion (Eq.( 23); (24); ( 25)) S ← extract the short-term temporal representation of the input data Z ⊳ STR (Eq.( 26)) Ŷ ← F ⊕ S ⊳ Prediction (Eq.( 27)) Loss  ← Y and Ŷ using MSE ⊳ MSE error (Eq.( 28)) Backward using Adam optimizer [67] return   where  denotes the parameters of the mapping function  .This optimization problem can be solved by using Adam, which is a variant of stochastic gradient descent (SGD) with momentum and weight decay.Adam can adaptively adjust the learning rate, making the updated step size suitable for each parameter.

Overview
Fig. 2 presents a schematic illustration of the proposed Temporal Collaborative Attention (TCOAT).The pseudocode detailing the training process of TCOAT is provided in Algorithm 1.The primary objective of TCOAT is to forecast future values in multivariate time series data, with a specific focus on wind power generation -a domain influenced by both temporal and spatial variables.The entire process comprises four sequential stages: data preprocessing, long-term temporal representation, collaborative attention unit and fusion, and short-term temporal representation.To encapsulate the intricate temporal and spatial dependencies, as well as the long-term and short-term patterns inherent in the data, we introduce TCOAT.The TCOAT model is structured around four core components: a Long-term Temporal Representation (LTR), Collaborative Attention Units (CAUs), a Temporal fusion Layer, and a Short-term Temporal Representation (STR).
The LTR component aims to extract a long-term temporal representation from the input data.The extracted data is then processed by CAUs, which are capable of learning directional information and global information from the data.The CAUs then obtain a collaborative attention representation in multiple directions.Subsequently, the Temporal Fusion Layer integrates these multi-directional collaborative representations, generating fusion data that encapsulates the global characteristics of the original data.Simultaneously, the STR component extracts a short-term temporal representation from the original data.The final prediction is produced by integrating the output of the STR and the Temporal Fusion Layer using a residual network.
The architecture is designed to synergize the strengths of Recurrent Neural Networks (RNNs) and attention mechanisms in the context of multivariate time series forecasting.While RNNs are adept at modeling sequential data dependencies, attention mechanisms excel at discerning the significance and relevance of individual data elements.The integration of these two methodologies enables the generation of predictions that are both accurate and robust.A comprehensive discussion of each stage will be described in the following subsections.

Data preprocessing
The data preprocessing function  1 is responsible for normalizing the input data  ∈ R × and splitting it into windowed multivariate time series (MTS) with a prediction horizon ℎ.To mitigate the impact of outliers on the learning process of the model and encourage faster convergence, Min-Max normalization is utilized.This technique scales all values of X into a range between 0 and 1.In comparison to Z-Score normalization, Min-Max normalization offers a more straightforward computational process and maintains the original distribution of the data, which can enhance the training process of the model and reduce the influence of outliers.The normalization formula is given by: where min() and max() are the minimum and maximum values of , respectively.The de-normalization formula is applied to the outputs of the model in the post-processing stage to recover the original scale of the data.The de-normalization formula is given by: The splitting process uses an ℎ-horizon split to transform the data into a supervised learning problem.Given a time series  ′ ∈ R × with  consecutive time intervals and  variables, the ℎ-horizon split is formulated as: where  is the window size, which determines how many past observations are used as inputs for each prediction.The left part is the normalized inputs of the model, denoted by  ∈ R (− −1)× × , and the right part is the normalized outputs of the model, denoted by  ∈ R (− −1)× .Several consecutive instances in (, ) are denoted by (,  ) ∈ R × × × R ×1× , where  is the batch size,  is the window size, and  is the number of variables.The splitting process can be generalized to multiple steps ahead by changing the value of ℎ.

Long-term temporal representation (LTR)
To capture the impact of long-term temporal variables on future wind power generation, we use a gated recurrent unit (GRU) module to extract the hidden representation of the input data.A GRU is a type of recurrent neural network (RNN) that can model the sequential dependencies of the data and handle the vanishing gradient problem [8].A GRU consists of two gates: a reset gate and an update gate, which control the information flow and the memory state of the network [9].Given an input tensor  ∈ R × × , we first apply a linear transformation to each slice of the tensor along the window dimension, denoted by   ∈ R × × , where  = 1, 2, … ,  .Then, we feed each transformed slice to a GRU cell and obtain the hidden state   ∈ R × at each time step.The GRU cell is defined as follows: ) where where   and   are the weight matrix and the bias vector of the linear transformation, respectively.The output of this module is then fed to the next module for collaborative attention and fusion.
We employ the GRU module to extract the impact of long-term temporal variables on future time.Based on the previous  −1 , it is possible to derive the hidden representation   of GRU:  15)) V ← L  ⋅ W  ⊳ Multiply learnable weight (Eq.( 16)) ⊳ Calculate directional score (Eq.( 17), ( 18), ( 19)) // Calculate the symmetric attention representation ⊳ Multiply learnable weight (Eq.( 21)) ⊳ Calculate directional score (Eq.( 17), ( 18), ( 19)) C  ← H  ⋅ W  ⊳ Collaborative attention representation (Eq.( 22)) return C  get the final temporal representation, the GRU layer productions of  groups are concatenated and then given a linear transformation: where   is the final output of GRU module,   and   are the weighting matrix and biases parameters of a linear transformation, respectively.

Collaborative attention unit (CAU)
The CAU is a key component of the TCOAT model, designed to capture both directional and global information from the input data.It consists of two steps: a Directional Transformation (DT) and a Symmetric Attention (SA).Fig. 3 and Algorithm 2 show the detailed structure and pseudo-code of the CAU, respectively.The TCOAT model integrates multiple CAUs, enabling the representation of Collaborative Attention from various directions.This multi-directional approach enhances the model's ability to capture complex patterns in the data, ensuring a smooth and logical flow of information.

Directional transformation (DT)
In the directional transformation step, we first apply a rectified linear unit (ReLU) function to the input tensor  to ensure the reliability of long-term temporal series feature extraction.Then, we multiply the rectified input tensor with a learnable weight matrix   to transform it and learn the temporal patterns during the training process.The process can be expressed as follows: where  ∈ R × × is the result of multiplication, and   ∈ R  × is the weight matrix.Next, we feed  into a softmax layer to enlarge the difference in several aspects.These aspects are instance dimension, temporal dimension, and variate dimension.The instance attention is employed to observe and highlight the connections between consecutive instances.It enhances the key look-back windows by highlighting the outbreak values.The attention mechanism on instance dimension can be formulated as follows: The temporal attention is employed to observe and highlight the connections between input and output time steps.It enhances the key time steps by using the feedback of the output time step.The attention mechanism on the temporal dimension can be formulated as follows: The variate attention is employed to observe and highlight the connections between the input time series and the output target.It enhances the key factors by using the feedback of the output.The attention mechanism on the variate dimension can be formulated as follows: The highlighted score tensor   is then multiplied with the processed input tensor   to generate the transformed tensor   dynamically.The multiplication process can be described as follows: where   ∈ R × × is the transformed tensor,   is the highlighted score tensor and   is the input tensor processed by a ReLU function.

Symmetric attention (SA)
In the symmetric attention step, we enhance the temporal features of the transformed tensor from different directions.We multiply the transformed tensor   with a learnable weight matrix   to produce a score tensor  .The multiplication process can be described as follows: where  ∈ R × × is the score tensor,   is the transformed tensor and   is the learnable weight matrix.
Then we apply a softmax function to  to obtain a highlighted score tensor   , which assigns weights to each element in  according to its importance for future prediction.The softmax function is applied along different dimensions, depending on the direction .The specific operation processes are the same as formula (17), (18) and (19).
Finally, we multiply the highlighted score tensor   with a learnable weight matrix   to calculate the final attention tensor   , which represents the collaborative attention representation from direction .The multiplication process can be described as follows: where   ∈ R × × is the attention tensor in th direction,   is the highlighted score tensor and   is the learnable weight matrix.

Temporal fusion
To effectively benefit from those attention representations, a temporal fusion layer was proposed to aggregate the outputs from CAU.The graphical process of the temporal fusion layer is plotted in Fig. 4.
The temporal fusion layer is constituted by a global autoregression (GAR) layer and a linear layer.The outputs from CAUs are first concatenated together.The input tensor is also concatenated to capture short-term temporal dynamics.The process can be formulated as follows: where  ∈ R × ×4 *  is the concatenated tensor,  is normalized time series from data processing,  0 ,  1 and  2 are attention tensors viewed from dimensions 0, 1, and 2, respectively, [; ] is concatenation operations.
The concatenated tensor is passed through a GAR layer to learn the various temporal patterns.The process can be described as follows: where  ∈ R ×1×4 *  is the learned temporal patterns,   ∈ R  ×4 *  is the weight matrix,   is a bias.The learned pattern tensor is passed through a linear fusion layer to generate consecutive model outputs.The process can be described as follows: where  ∈ R ×1× is the model outputs,   ∈ R 4 * × is the weight matrix,   is a bias.

Short-term temporal representation (STR)
In general, future wind power generation will be more influenced by short-term temporal data than by long-term temporal data.Simple models, such as linear models or RNN-based methods, can be used to capture short-term time series features.Fig. 4's lower part presents the processing.Different short-term time series feature capture models can be applied to various horizon split data sets to get various outcomes: where  ∶ ∈ R ×× is the input data of the short-term temporal part, and it represents the last  time steps of input temporal data,  ∈ R ×1× is the short-term temporal process result, and R is the shortterm temporal process function, linear and GRU and other models are optional.Then, in order to obtain the final forecast data, we combine the short-term data features with the long-term data features: where Ŷ ∈ R ×1× is the final output of TCOAT, and  represents the final output of temporal fusion,  represents the final output of short-term temporal process, and ⊕ is the symbol of the residual operation.

Experiments
In this section, we describe the data, experimental settings, model implementations, and results.We also compare the proposed model with other methods.

Data and experimental settings
In the experiments, the Greece dataset served as the primary source of data.To further validate the generalizability of the TCOAT model, the WSTD2 dataset was also incorporated.This auxiliary dataset offers a unique set of data characteristics for analysis.Detailed introductions to both datasets can be found in Section 3.
The preprocessing of the primary dataset was meticulously executed as follows: (1) Aggregation: Hourly data points were aggregated into daily averages to reduce data noise and complexity; (2) Standardization: The daily values were standardized using their annual mean values to account for changes in wind turbine installations over time; (3) Min-Max Scaling: We applied min-max scaling to the standardized values, normalizing them into a range between 0 and 1, which is optimal for neural network training; The primary dataset was divided into training and testing sets in a 4:1 ratio to ensure ample data for training while preserving the temporal sequence.For the additional dataset, we followed a similar preprocessing methodology and divided it into training and testing sets using an 80:20 ratio, considering its unique characteristics and the need to validate the model across varied conditions.
The TCOAT model was implemented using PyTorch v2.0.0 [68].The computational resources included a server with an Intel ® Xeon ® Gold 5218R CPU (2.10 GHz), 256 GB memory, and four Tesla V100-PCIE-16 GB GPUs.The prediction horizon was set to one day for the primary dataset, a standard in wind power forecasting [69], with input intervals of 1, 3, 5, and 7 days.These intervals were chosen to evaluate the model's performance over different time scales, reflecting the trade-off between accuracy and complexity.

Comparison baselines
In this section, we present a comprehensive comparison of the proposed TCOAT model with twenty-two state-of-the-art methods for multivariate time series forecasting.We summarize the main features and characteristics of these methods in Table 1, such as the model type, the components, the advantages, and the disadvantages.The model type indicates whether the method is based on statistical, machine learning, or deep learning techniques.The components describe the main modules or layers of the method that are used to capture the temporal and collaborative patterns in the data.The advantages highlight the strengths or benefits of the method for multivariate time series forecasting.The disadvantages point out the limitations or drawbacks of the method that may affect its performance or applicability.

Model configurations
We conducted five repeated experiments on the wind power time series data to evaluate the performance of each method.We used this approach instead of cross-validation to preserve the temporal order of the data, which is essential for forecasting tasks.We trained the models using the Adam optimizer [67] with the mean squared error (MSE) as the loss function, following previous studies that showed their effectiveness for wind power forecasting [89].We applied the grid search method to optimize the hyper-parameters for each method over a predefined range of values.The optimal hyper-parameters for each baseline method obtained by the grid search method are shown in Table 6.

Evaluation metrics
To evaluate the performance of the proposed TCOAT model and compare it with other methods, we use three evaluation metrics that are commonly used in the field of energy prediction.These metrics are Mean Square Error (MSE), Mean Absolute Error (MAE), and Coefficient of Variation of Root Mean Square Error (CVRMSE).These metrics can measure the accuracy and reliability of the prediction models, as well as reflect the characteristics and challenges of wind power data.
MSE is a scale-dependent metric that measures the average squared difference between the predicted and actual values.MSE is sensitive to outliers and large errors, which means that it penalizes large deviations more than small ones.MSE is defined as follows: where  is the number of samples,   is the actual value, and ŷ is the predicted value.MAE is another scale-dependent metric that measures the average absolute difference between the predicted and actual values.MAE is less sensitive to outliers and large errors than MSE, which means that it treats all errors equally.MAE is defined as follows: CVRMSE is a scale-independent metric that measures the normalized root mean square error relative to the mean of the actual values.CVRMSE can compare the performance of different models or datasets with different scales or units.CVRMSE is defined as follows: where ȳ is the mean of the actual values.
The lower the values of these metrics, the better the performance of the prediction model.However, these metrics also have some limitations.For example, MSE and MAE do not consider the temporal correlation or order of the time series data, which may affect the prediction accuracy.CVRMSE may not reflect the absolute error or deviation of the prediction model, which may affect the reliability of the indicate that these linear models are not suitable for wind power forecasting because they cannot handle the short-term variations and long-term dependencies of wind power.• Recurrent neural networks (RNNs) such as LSTM, GRU, and ED generally achieve second-best or third-best results across various horizons.Notably, the GRU model consistently showed the best performance among these RNNs, effectively capturing the temporal dependencies in wind power data.This is indicative of the strength of its gating mechanism and its ability to leverage historical information for current data predictions.However, when compared to our proposed TCOAT model, both LSTM and GRU models, despite their merits, exhibit limitations.The LSTM, known for its ability to handle long-term dependencies, falls short in terms of predictive accuracy and flexibility when dealing with the complex, non-linear patterns characteristic of wind power forecasting.This is evident from the performance metrics in Table 3, where TCOAT consistently outperforms LSTM, especially in terms of MSE, MAE, and CVRMSE across all forecast horizons.The LSTM model's limitations in inference power and reliance on substantial training data quality and quantity become apparent when juxtaposed with the advanced capabilities of TCOAT.
Our model incorporates the novel integration of dynamic attention mechanisms, collaborative attention units for assimilating multi-dimensional data, and a temporal fusion layer for effective long-term and short-term pattern analysis.These contribute to its performance and address the gaps observed in traditional LSTM models.This comparative analysis underscores the novelty and effectiveness of the TCOAT model in wind power forecasting, offering a more accurate, reliable, and nuanced approach than existing LSTM-based methods.• CNN-and RNN-based models (CNN1D, CRNN, CRNNRes, and LSTNet) perform better than linear and linear variants models, but worse than RNN-based models.CNN1D has a similar performance to GAR, which means it does not extract useful features from input sequences effectively.CRNN models use both local features and historical information to enhance prediction accuracy, unlike pure convolutional neural networks that only rely on local features.For short-term horizon prediction tasks (ℎ=1), CRNNRes models are more stable than CRNN models, indicating that the residual connection helps to capture the low horizon features.However, CRNNRes performs worse than CRNN for ℎ=5 and ℎ=7 prediction tasks.This may be because the residual window is not large enough to support long-term forecasting with sufficient residual information.LSTNet adds a skip window to CRNNRes, which splits the input sequence into small segments and models them using GRU.The performance of LSTNet is similar to CRNNRes, which means that skipping windows does not help to learn useful representations.Overall, the hybrid RNN models (i.e., CRNN and CRNNRes) are not as effective as the RNN model alone, which suggests that convolutional models are not sufficiently accurate to represent temporal dependencies by capturing regional features.• Self-attention-based models (Transformer, Informer, Autoformer, and Fedformer) have slightly better results than linear models but worse than RNN models.Among them, Informer has the best performance but only slightly better than Transformer.Autoformer and Fedformer are only better than Transformer when ℎ=1 but slightly worse than Transformer when ℎ=3, ℎ=5 or ℎ=7.This suggests that their attention mechanism and series decomposition are not effective in finding periodic patterns and dependencies in wind power data.• CNN-RNN and attention models as the two core components of the hybrid attention model (DSANet and TPA-LSTM) perform better than CNN-RNN models.TPA-LSTM achieves the second-best performance among all methods at ℎ=3 and ℎ=7, demonstrating that its temporal pattern attention can capture long-term dependencies by two core components.In the wind power prediction experiment conducted in the Greece dataset, four time horizons were considered: 1 day ahead, 3 days ahead, 5 days ahead, and 7 days ahead.From a single time horizon perspective, the GRU model emerged as the best-performing baseline method.However, TCOAT outperformed all methods to achieve the best results across all metrics and time horizons, particularly in lower time horizons.To provide a comprehensive performance measure, the average of the four time horizons was calculated for each method's MSE, MAE, and CVRMSE.In this context, TCOAT demonstrated significant improvements.Specifically, when compared with the GRU model -the top-performing baseline method among the twenty-two state-ofthe-art methods evaluated -TCOAT achieved a maximum reduction of 5.62%, 2.59%, and 2.85% in MSE, MAE, and CVRMSE, respectively.This highlights the effectiveness of TCOAT in enhancing the accuracy of wind power prediction.This demonstrates that its novel components (LTR, CAUs, STR, and temporal fusion) can effectively capture the temporal dependencies and collaborative patterns in wind power data across different time scales.Moreover, TCOAT is more robust to the increase of horizon than other methods.Fig. 5 shows the comparison of the actual and predicted wind power by TCOAT, GRU, and MSL.TCOAT performs the best in tracking the wind power variations, especially when the prediction horizon ℎ is medium (3), as it can identify part of the peak-trough trend in Fig. 5(b), while the two other benchmarks fail to respond.However, as ℎ increases, all the methods tend to underestimate the peak values and lose accuracy.GRU can capture the time-dependent relationship at ℎ=1 by using the RNN component to learn features from historical information, but it fails to predict the peaks and valleys of the time series accurately at ℎ=3, ℎ=5, and ℎ=7.TCOAT leverages a long-term pattern framework to capture long-term representations, a directional collaborative attention mechanism to focus on relevant features, and a short-term pattern framework to capture short-term representations, which enables it to produce more accurate and realistic predictions.
Fig. 6 displays the normalized results from actual and predicted values.The Pearson correlation (PCC) between the actual value and the predicted value of different models is also annotated in the figure.The prediction error increases as the observation size increases, which is consistent with Fig. 5.This indicates the difficulty of predicting wind power accurately during high fluctuation periods, especially for longterm forecasting.Most of the data points are above the diagonal line, which means that the prediction model tends to overestimate the actual values when they reach the peak.This could be due to the instability of the wind power data during high fluctuation periods, which makes it hard for the prediction model to find stable patterns.The wind power data has a high standard deviation and a low autocorrelation during high fluctuation periods, which indicates a high degree of randomness and unpredictability.It could also be due to the prediction model's limitation in capturing the sudden changes in the data, which leads to a premature reaction in forecasting the peak values.Nevertheless, TCOAT still outperforms other methods in terms of Pearson correlation coefficient (PCC), which reflects the linear relationship strength between the actual and predicted values.PCC is an important indicator of the prediction model's performance, as it measures how well the model can capture the trend and pattern of the data.TCOAT has a higher PCC than other methods for all four forecasting horizons.

Model ablation study
We designed the TCOAT model to capture the complex temporal and collaborative patterns in wind power data.To evaluate how each component contributes to the accuracy of the model, we conducted an ablation study by comparing TCOAT with six variants that remove one or more components.We tested the models on four different prediction horizons (ℎ = 1, 3, 5, 7) and reported the results in Table 4.
The results show that TCOAT outperforms all the variants on all metrics and horizons, demonstrating the effectiveness of its novel components.Each component plays an important role in improving the performance of the model, and removing any component leads to a significant drop in performance.We discuss the impact of each component in detail below.
• LTR: This component captures the long-term changes in wind power generation by using a recurrent neural network to learn features from historical information.Removing LTR (w/o LTR) results in poor performance, especially for longer horizons.This indicates that LTR can capture the long-term dependencies in wind power data and help the model make more accurate predictions.Therefore, the ablation study confirms that each component of TCOAT is effective and necessary for predicting wind power generation.The TCOAT model structure design considers not only the wind power influences but also the short-term and long-term time dependence and collaborative attention in the time series.

Generalization study
This subsection evaluates the generalization ability of our proposed TCOAT model by utilizing an auxiliary dataset, known as WSTD2.We compare TCOAT with twenty-two state-of-the-art methods for wind power forecasting, using three metrics: MSE, MAE, and CVRMSE.We consider a single time horizon: 1 day ahead (h = 1).
Before comparing TCOAT with other methods, we conducted some experiments to find the optimal settings for our model.We fixed the prediction horizon (h) and the output window size (H) at 1, the batch size (B) at 32, and the CAUs settings at 0, 1, and varied the window size (T) from 1 to 25.We used the same evaluation metrics and experimental settings as before, running each experiment five times and reporting the average results.The results indicated that the best performance was achieved when  = 22, so we set  to 22 for subsequent experiments.Next, we fixed the prediction horizon (h), the output window size (H), and the window size (T) to 1, 1, and 22, respectively, and varied the batch size (B) from 2 0 to 2 7 .The results showed that the best performance was achieved when B = 2 7 , so we set B to 2 7 for the final comparison with other methods.
We group the methods into nine categories based on their main techniques: linear models, linear model variations, recurrent neural networks (RNNs), CNN-and RNN-based models, self-attention-based models, CNN-RNN and attention models, graph attention-based models, shapelet learning model, and our proposed model.The results in Table 5 show that: • Linear models (GAR, AR, and VAR) perform poorly, indicating that the wind power data has complex nonlinear and temporal patterns that cannot be captured by simple linear models.Among them, AR slightly outperforms GAR and VAR, suggesting that the wind power data has some autocorrelation structure.• Linear model variations (DLinear, NLinear, FiLM) perform slightly better than linear models, indicating that the wind power data has some nonlinear patterns that can be captured by adding nonlinear activation functions or feature-wise linear modulation.However, FiLM performs the worst among all methods, suggesting that this technique is not suitable for wind power data.• RNNs (LSTM, GRU, and ED) perform well, achieving the second and third-best results among all methods.This indicates that the wind power data has strong temporal dependencies that can be captured by RNNs.• CNN-and RNN-based models (CNN1D, CRNN, CRNNRes, and LSTNet) perform worse than RNNs, indicating that the wind power data does not contain much spatial information that can be captured by CNNs.CRNNRes performs worse than CRNN, suggesting that the residual module does not find effective features.CNN1D, CRNNRes, and LSTNet have similar performance, suggesting they have similar limitations in modeling wind power data.• Self-attention-based models (Transformer, Informer, Autoformer, and Fedformer) outperform the linear models but not the RNNs, indicating that the wind power data has some long-range dependencies that can be captured by self-attention, but also some short-term dependencies that are better captured by RNNs.Among them, Autoformer achieves the best MSE and the second-best MAE among all methods, indicating that it can learn effective features from the wind power data.Informer performs slightly worse than Autoformer, and Transformer slightly worse than Informer, suggesting that information attention and probabilistic time series modeling techniques are beneficial for wind power forecasting.Fedformer performs poorly, suggesting that the feature-enhanced dual transformer architecture is not suitable for wind power data.• CNN-RNN and attention models, the two core components of the hybrid attention model (DSANet and TPA-LSTM), perform differently.DSANet outperforms the CNN-and RNN-based models but not the RNNs, indicating that the dual self-attention network can capture both local and global dependencies in wind power data.TPA-LSTM performs poorly, similar to FiLM and Fedformer, indicating that the temporal pattern attention technique is not effective for wind power data.• Graph attention-based models (StemGNN and GAIN) perform similarly, outperforming the linear models, CNN-and RNN-based models, and self-attention-based models, but not the RNNs.This indicates that the graph attention technique can capture the spatial-temporal dependencies in wind power data.StemGNN performs slightly better than GAIN, suggesting that the spatialtemporal embedding technique is beneficial for wind power forecasting.• MSL, which learns shapelets from the wind power data, outperforms the self-attention-based models and graph attention-based models, but still lags behind the RNNs.This indicates that the shapelet learning technique can capture some local patterns in wind power data, but not the global patterns.• TCOAT, our proposed model, achieves the best results on all metrics, demonstrating the generalization ability of TCOAT.Compared to the best baseline GRU, TCOAT improves the MSE, MAE, and CVRMSE by 2.5%, 0.4%, and 1.26%, respectively.
In summary, we have shown that TCOAT can generalize well to different wind power datasets, and outperform the existing methods for wind power forecasting.This indicates that TCOAT can effectively capture the complex nonlinear and temporal patterns in wind power data, and provide accurate and reliable forecasts for wind power generation.

Conclusion and future work
Wind power forecasting stands as a pivotal task for the effective integration and management of wind energy systems.Accurate forecasting not only optimizes the operation and maintenance of wind turbines but also mitigates the uncertainty and risk associated with power supply, thereby amplifying both the economic and environmental advantages of wind power generation.In this research, we introduced Temporal Collaborative Attention (TCOAT), a data-driven approach designed to capture the intricate temporal and spatial dependencies inherent in wind power generation data.TCOAT employs attention mechanisms to dynamically adjust the weights of each input variable and time step based on their contextual relevance for forecasting.Furthermore, the model incorporates collaborative attention units to assimilate both directional and global information from the input data.It also employs self-attention and cross-attention mechanisms to explicitly model the interactions and correlations among different variables or time steps.Additionally, TCOAT features a temporal fusion layer that effectively integrates long-term and short-term information through concatenation and mapping operations, as well as hierarchical feature extraction and aggregation.
To evaluate the performance of TCOAT, we conducted extensive experiments on two real-world wind power datasets from different regions with distinct climate conditions.Our empirical results, compared with twenty-two state-of-the-art methods, show that TCOAT surpasses them in terms of both accuracy and robustness, especially for short-term and very short-term forecasting horizons.A model ablation study further confirms the effectiveness of each component of TCOAT, while a parameter sensitivity analysis reveals the influence of various hyperparameters on the model's performance.The experiment using the second dataset as an additional dataset verifies the generality of the proposed model.
However, TCOAT also has some limitations and challenges that need to be addressed in future work.First, TCOAT does not provide any uncertainty quantification or probabilistic forecasts, which may affect the decision-making and risk management of wind power integration.Second, the current implementation of TCOAT does not consider the real-time scenario, which adapts to the changing dynamics or patterns of wind power generation over time, and thus may require periodic retraining or updating of the model.Third, we have only tested our method on the predictions using onshore datasets, while more complex remote or offshore wind farms need to be considered for testing, such as those that consider the sea state conditions.Furthermore, data privacy and security issues may also arise when sharing or transferring data across different parties or regions.
For future work, we aim to address these limitations and generalize our method to other forms of renewable energy, such as solar and hydropower, and to explore multi-source or multi-region forecasting.We also intend to enrich our model by incorporating external factors like grid load or market price, with the goal of enhancing forecasting accuracy and reliability.Moreover, we plan to delve into more advanced attention mechanisms, such as transformer or graph attention, to further improve the model's representation learning and feature extraction capabilities.

Fig. 2 .
Fig.2.The schematic illustration of the proposed TCOAT model.The model consists of four components: a long-term temporal representation (LTR) module that uses a GRU network to learn features from historical data, a collaborative attention unit (CAU) module that uses attention mechanisms to capture the directional and global information from the data, a temporal fusion module that uses concatenation and mapping operations to integrate the collaborative information with long-term information, and a short-term temporal representation (STR) module that uses a residual network to learn features from local data.

Fig. 3 .
Fig. 3.The schematic illustration of the CAU.The CAU transforms the input data into a tensorial representation and computes attention scores and memory weights for each tensor direction in the DT step.The SA step uses symmetric self-attention and attention mechanisms in three directions to collaboratively enhance the interactions and correlations among different variables or time steps.

Fig. 4 .
Fig.4.The schematic illustration of the conditional fusion procedures.The layer concatenates the input data and the outputs from the CAUs, and then applies a global autoregression (GAR) layer and a linear layer to fuse them.The layer also splits the input data into different parts and feeds them into a short-term temporal representation (STR) module that uses a residual network to capture the short-term variations of the data.The final predictions are obtained by combining the outputs from the temporal fusion layer and the STR module.

Fig. 5 .
Fig. 5.The visualized comparisons on the real values with the other three methods.Unit of ℎ: day.

Fig. 6 .
Fig. 6.The correlation visualizations of TCOAT predictions and three other benchmark predictions; Unit of ℎ: day.

Table 1
Benchmark methods for wind power forecasting.

Table 2
The prediction results of attention combinations on the wind datasets in terms of MSE, MAE and CVRMSE, whereCAUs{ 1 }, CAUs{ 1 ,  2 }, or CAUs{ 1 ,  2 ,  3 } means directional attentions on  1 ,  2 ,  3 -th aspect.The best results are shown in bold, the second-best results are underlined, and the worst results are in wavy lines; Unit of ℎ: day. ⁓⁓⁓⁓⁓⁓

Table 3
Performance comparison in wind power prediction.The best results are shown in bold, the second-best results are underlined, and the worst results are in wavy lines; Unit of ℎ: day.
MSL learns shapelets from historical data to represent wind power patterns.Its performance is low at all horizons, suggesting that shapelets are not effective features for wind power forecasting.

Table 4
Model ablation study.The best results are shown in bold, the second-best results are underlined, and the worst results are in wavy lines; Unit of ℎ: day.

209052 7.390437 0.464018 165.119263 10.396696 0.634868 173.061798 10.564649 0.6499585 179.194824 10
formation data in a certain direction dimension and ensures a balanced alignment.Removing CAUs(SA) (w/o CAUs(SA)) performs worse than w/o CAUs(DT), suggesting that CAUs(SA) can learn meaningful attention representation from different directions and improve the performance of the model.•CAUs: This component is composed of CAUs(DT) and CAUs(SA).Removing the entire CAUs module (w/o CAUs) results in a significant drop in performance, indicating that CAUs can capture also results in a significant drop in performance, showing that STR can capture the short-term dependencies in wind power data and help the model make more accurate predictions.

Table 5
Performance comparisons on wind power prediction.The best results are shown in bold, the second-best results are underlined, and the worst results are in wavy lines; Unit of ℎ: day.