A multivariate time series graph neural network for district heat load forecasting

Heat load prediction is essential for energy efficiency and carbon reduction in district heating systems. However, heat load is influenced by many factors, such as building characteristics, consumption behavior, and climate, making its prediction challenging. Traditional methods based on physical models are complex and insufficiently accurate, whereas most data-driven statistical methods ignore customer energy consumption behaviors and their correlation, and do not account for the temporal inertia of consumption. This paper proposes a graph ambient intelligence (GAIN) method for heat load prediction, which classifies customers based on their load profiles and uses collaborative attention on temporal graphs to capture their associations and the weather impact on heat loads. GAIN also incorporates recursive and autoregressive methods to model the temporal inertia of consumption. The proposed method is evaluated on four metrics and compared with fifteen baseline methods. The results show that GAIN achieves the lowest daily forecasting errors in terms of RMSE, MAE, and CV-RMSE, with values of 6.972, 4.442, and 0.191, respectively. Besides, the proposed method achieves a maximum reduction of 25%, 29%, and 25% in RMSE, MAE, and CV-RMSE, respectively, compared to other methods when taking meteorological factors into account.


Introduction
Today, the energy crisis and environmental issues have become global concerns.The building sector accounts for 30%-40% of energy consumption worldwide, and in Europe, it accounts for 40%-50%, with thermal energy demand comprising 13% of this figure [1].District heating systems have been strongly advocated in numerous countries, becoming essential in achieving energy-saving and emission-reduction targets.These targets include the European Union's net-zero carbon emission goal by 2050 [2] and China's aim to reach peak carbon emissions by 2030 and carbon neutrality by 2060 [3].
District heating systems serve as essential infrastructure components, generating and supplying heat to buildings in urban areas by employing a diverse range of energy sources, such as biomass, natural gas, solar energy, industrial waste heat, and geothermal energy.Geothermal energy, as a renewable and clean source of heat and power, can be harnessed to drive combined heat and power (CHP) systems for various applications, including district heating.This results in improved overall system efficiency by utilizing combined heat and power [4], which can contribute to reduced emissions and increased energy security in urban areas.The main advantage of district heating systems lies in their ability to harness combined heat and power, thus enhancing the overall efficiency of the system [5].Fig. 1 illustrates a simplified schematic of a district heating system, which consists of a CHP plant, supply and return pipes, and buildings.Heat from the CHP plant is distributed to different end-users, such as residential households, via supply pipes, while water is returned to the CHP plant for reheating through the return pipes.The difference between the supply and return water generally reflects the efficiency of the heating devices within a building.The heating network typically comprises a hierarchical structure, including primary and secondary networks, and heat exchangers facilitate heat transfer between the two networks [6].In most existing district heating systems, the heating temperature and supply are manually controlled based on outdoor temperatures and subjective human judgment or experience, which can lead to considerable heat waste, especially during sudden changes in weather conditions or the emergence of new heat demands.https://doi.org/10.1016/j.energy.2023  For district heating systems, optimal control should aim for efficient operation, which can only be achieved by matching heat production to the actual demand of end users [7].Estimating heat load is crucial and is usually a prerequisite for the operations of a district heating system.In particular, short-term predictions, ranging from a few hours to a few days, are used to predict the heat load.Accurate forecasting allows heat production (i.e., heating temperature and flow rate) to be matched to actual demand, thereby supplying heat on demand, reducing production costs and distribution losses, and lowering return temperatures.This is particularly important for CHP plants to improve coordination between the plant and the grid [8].In recent years, with the digitization of energy systems, intelligent systems have realized fine-grained control over major thermal devices, which help to provision energy supply and improve energy efficiency.For example, Adamski et al. [9] propose a predictive control system that achieves 20.4% thermal energy savings in the studied building compared to the traditional weather-based control method.Kapalo et al. [10] point out that an accurate prediction of heat load is essential for the district heating system in optimizing the supply and demand structure, and also to improve automatic boiler control systems, such as timely and efficient control of each unit to ensure efficiency and reliability.In addition, accurate load forecasting is essential for demand-side management.Recent studies [11] show that with the implementation of demand-side management, including short-term demand forecasting (2-3 h), the thermal load of individual buildings can be reduced by an average of 25%.Therefore, forecasting should be a key component for district heating management.
Heat load prediction is at the core of the district heating system, and accurate prediction plays a key role in energy efficiency and carbon reduction [12].However, heat load can be affected by many factors, such as the physical characteristics of buildings, consumption behavior, and climate, and its prediction is a significant challenge [13].Traditional prediction methods based on thermodynamic models are complex and insufficiently accurate, whereas most data-driven statistical methods ignore the temporal inertia of consumption itself, nor do they capture the potential associations between similar customers or the impact of weather factors on heat loads.All of these limit the prediction capability.To address these issues, this paper proposes a deep learning-based graph ambient intelligence (GAIN) method for heat load prediction.Graph neural networks (GNNs) are a class of deep learning models that can learn from graph-structured data, such as molecules, knowledge and social networks [14].However, there is still a big lack of application and research of GNNs for time series prediction, especially for district heat load forecasting.GNNs are suitable for this problem because they can model the complex relationships between customers, heat load and meteorological factors in a graph representation.Moreover, they can overcome the limitations of existing methods by capturing both local and global patterns in time series data.The main contributions and significance of this paper are as follows: • We propose a novel GAIN model for district heat load prediction.
This model takes into account the intrinsic correlation between customers and the causal relationship between time steps in a time series.• To integrate the external factors affecting the prediction performance, we design another neural network branch for extracting patterns from meteorological external factors and a collaborative temporal graph attention to fuse heat load and meteorological observations.This novel fusion mechanism has a powerful feature selection capability that leads to a significant improvement in accuracy.
• We conduct a comprehensive evaluation for the proposed model over four prediction time horizons using a real district heating dataset with correspondent meteorological data.The results show the efficiency and desirable performance of the proposed model, which outperforms other state-of-the-art methods by 25%, 29%, and 25% in terms of RMSE, MAE, and CV-RMSE, respectively.
The remainder of this paper is organized as follows.Section 2 discusses related work.Section 3 describes the dataset.Section 4 presents the proposed model.Section 5 conducts the experiments to evaluate the model.Section 6 concludes the paper and presents future work.

Physical models for heat load prediction
Research themes on heat load prediction can be divided into two categories: thermodynamic and data-driven models.Thermodynamic modeling is a traditional approach, also known as ''white box'' modeling, which relies on complex mathematical construction methods.It requires a large number of physical parameters for heating, such as flow rate, temperature, and volume, as well as building parameters, such as material, area, and orientation.For example, Zhang et al. [15] estimate the heat load of a residential building using the white-box model that takes into account the parameters including indoor climate, building characteristics, and solar radiation.The model is calibrated using the particle swarm optimization (PSO) algorithm to search for the optimal combination of the parameters in the heat load estimation.The second physical model is also called ''gray box'' model, which often used to address the difficulty in determining optimal parameters for physical models [16].Therefore, grey-box models typically use physical knowledge to define the model structure and then use historical data to estimate model parameters [17].For example, Thilker et al. [18] create a non-linear gray box model to estimate the heat demand of a waterheated school building in Denmark, where meteorological weather observations were used as input.However, a significant limitation of the above physical models is their sensitivity to the physical building parameters [19], which can render them less robust and necessitate more frequent recalibration or updates.
Z. Wang et al.

Feature selection for data-driven heat load prediction models
Data-driven models refer to machine learning or deep learningbased models, which require a large amount of data for training [20].In addition to historical heat load data, meteorological data is the most used auxiliary dataset for heat load prediction [21].The selection of appropriate exogenous input variables is crucial to the accuracy of heat load prediction, which has been extensively studied.Among others, Song et al. [22] conducted a Pearson correlation study on different available variables, and selected the positively correlated variables, heat supply, and return temperatures; and the negatively correlated outdoor temperature as features for training the deep learning model.Gong et al. [6] used Pearson and least absolute shrinkage and selection operator (LASSO) methods to optimize the feature set, including system parameters, meteorological parameters, and time steps; and resulted in four feature sets, a total of 28 features as the final input.The meteorological parameters, including outdoor temperature, solar radiation intensity, wind speed, and humidity, have been widely used for studying the impact on heat load (e.g., [23,24]).The early study [25] shows that the outside temperature can affect the heat load of a building by 60%, the wind speed by 1% to 4%, and the solar radiation intensity by 1% to 5%.Moreover, the study in [26] reveals that the higher the intensity of solar radiation, the greater the indoor temperature.Therefore, the intensity of solar radiation can affect the heat consumption for those heat users with control equipment.In our study, we use the available meteorological data to perform correlation analysis and select four meteorological parameters as additional features of our model.Additionally, we evaluate the individual contribution to the model performance through a feature ablation study.

Data-driven heat load prediction models
Currently, significant efforts have been made to data-driven heat load prediction, mainly due to their excellence in non-linear modeling capabilities and the availability of fine-grained heat load smart meter data [27].The most popular data-driven models for heat load predictions can be grouped into three categories: traditional statistical methods, classical machine learning methods, and advanced machine learning methods.
Traditional statistical methods have gained widely employed in predicting heat load due to their high interpretability and computational efficiency [28].Bujalski et al. [29] proposed a data-driven model for hourly heat load predictions based on the Generalized Additive Model during the off-season.Jagait et al. [30] include autoregressive integrated moving average (ARIMA) for online load prediction and alleviate the concept drift problem.However, most of these methods rely on stringent assumptions that may not be feasible in practical scenarios [20], consequently leading to worse performance.
Classical machine learning methods have superiority in capturing non-linear relationships over traditional statistical methods.Cui [31] verify the effectiveness of bidirectional long short-term memory network (BiLSTM) for short-term heat load forecasting.Zhao et al. [17] proposed a residential district heating prediction model based on an improved convolution neural network (CNN).Ding et al. [32] designed multi-input artificial neural network, which achieve the well performance in long and short-term heat load prediction.Song et al. [33] proposed a model combining a convolutional neural network and a long short-term memory algorithm, namely CNN-LSTM, for heat load prediction.They found that the combined model can achieve better prediction accuracy in district heating systems with thermal inertia problems than others, including SVM and ensemble learning algorithms.However, due to the limitation in capturing the global and long-term dependencies, these may not achieve promising accuracy in complex scenarios.
Advanced machine learning methods, such as transformer and graph neural networks, possess a remarkable ability to capture complex associations and have been highly effective in energy prediction tasks [24].
Gong et al. [34] have designed a novel framework for heating load forecasting based on the Informer architecture, which is a variant of the Transformer that employs self-attention mechanisms to capture long-term dependencies [35].Wang et al. [36] proposed a multi-task multi-energy load forecasting model based on Decoder and Transformer component.This model further verifies the effectiveness and generalization capability of self-attention mechanisms.Hu et al. [37] put forward a spatiotemporal graph convolutional network to track building energy consumption.The results demonstrate that the GNN enables modeling the interdependency of features and capturing dynamic nonlinear representations.In [38], an improved graph neural network is employed for controlling wind energy and shows good performance.However, most advanced machine learning methods neglect the diversity of customer behavior patterns in energy consumption prediction, which can limit the accuracy of prediction models.Moreover, the potential of graph neural networks with attention mechanisms in heat load prediction has not yet been verified.
Most of the above studies are considered conventional methods for heat load prediction, and some of them have applied the current deep neural network-based approaches, which aim to improve the prediction performance.However, most of these methods ignore the customer energy consumption behaviors and their correlation and do not account for the temporal inertia of consumption itself.In contrast, in this study, we design a GNN-based model structure that captures the intrinsic correlation between customer, heat load, and meteorological data, as well as the temporal dependencies in time series.Moreover, few studies have considered the economic analysis of the district heating system and its integration with solar desalination.A comprehensive framework for assessing the economic feasibility and sustainability of water-related interventions, including desalination, has been proposed by [39,40], which considers the costs and benefits from multiple perspectives such as financial, environmental, social, and institutional.

Study materials
The data used in this study include district heating consumption data, related building registration data, and meteorological data.The heating consumption data were from Danish residential buildings in Aalborg [41], which are publicly available on the Zenodo repository at https://doi.org/10.5281/zenodo.6563114.The original dataset consists of data from 3127 smart heat meters for the period 2018-01-01 to 2020-12-31, with hourly resolution.The data were anonymized, cleaned, and provided as comma-separated CSV files.The background data includes dwelling type, building year of construction, and energy level.Heat load data were collected for dwelling types including apartments, townhouses, single-family houses, and nonresidential buildings, numbered 88, 474, 2460, and 8.In this study, we focused on heat load prediction for single-family houses.The meteorological data were obtained from https://confluence.govcloud.dk,which were collected from weather stations located near the district heating areas in Aalborg.
The heat load of a building can be affected by a variety of factors, including customer activities, indoor and outdoor climate, and building characteristics.Therefore, in a data-driven model, it is necessary to include appropriate external variables as model inputs in addition to historical heat load data to improve accuracy.As mentioned earlier, considerable research has considered meteorological factors for improving heat load prediction accuracy.Similarly, in this study, we select appropriate meteorological variables based on the following analysis.We first conduct a correlation analysis for the nine meteorological factors in the dataset and obtain the results shown in Fig. 2.
As can be seen, grass temperature and outdoor temperature are two temperature features that demonstrate a high correlation with heat load observations and a strong inter-correlation with each other.To reduce the redundancy of temperature information in the predictive model, we choose the latter in our study.This is reinforced by prior studies that highlight outdoor temperature as the foremost meteorological factor impacting building heating loads, e.g., [42,43].Besides, solar radiation intensity is also a critical factor that can impact the indoor environment, e.g., if the building has an opaque envelope, the solar radiation can increase the building surface temperature, thus reducing heat loss; if the building has a translucent envelope, the sunlight entering the interior can increase the indoor temperature, thus reducing the heating load demand [43].Therefore, we have included solar radiation intensity as a meteorological factor in our study.In addition, Fig. 2 highlights a robust correlation between relative humidity and cloud cover, but these two factors exhibit a relatively minor impact on heat load.Since the well-established connection between relative humidity and building heat transfer [44], we have chosen relative humidity as the third meteorological factor.The wind speed, air pressure, leaf moisture, and precipitation have weak correlations with heat load data.To improve the diversity of meteorological features without being overly inclusive, we have also included wind speed as a factor, as it is relatively less correlated with the other three selected factors.
Therefore, outdoor temperature, relative humidity, solar radiation intensity, and wind speed have been chosen as the final additional features for our modeling.This selection is appropriate because while additional variables can typically improve model accuracy, they can also increase model complexity, potentially leading to overfitting.Fig. 3 shows four selected outdoor meteorological parameters, and Fig. 4 shows the percentiles of the heat load for the entire year 2018, respectively.From the overall pattern of heat consumption, we can visually observe a negative correlation between outdoor temperature (and the correlated solar radiation intensity) and heat load.In other words, a higher outdoor weather temperature corresponds to a lower heat load, for example in summer, while a lower outdoor temperature corresponds to a higher heat load, for example in winter.As for the other two factors, wind speed and humidity, a slight correlation with the heat load can be visually observed.The quantitative results of the correlation studies between the heat load and the four meteorological parameters, as well as the basic statistics of the heat load and meteorological data are presented in Table 1.

Methodology
This section formulates the problem definition of heat load prediction and illustrates the detailed techniques of the proposed GAIN.The main notations used in this paper are listed in Table 2.

Problem formulation
The prediction problem can be formulated using historical heat load observations, and exogenous observations, meteorological data, as the input to predict the future load with a specific time horizon.The district heat load observations are denoted by where  is the number of time steps and  ′ is the number of districts.The meteorological observations are represented as  = { 1 ,  2 , … ,   } ∈ R × # , where  # is the number of factors (i.e., solar radiation intensity, outdoor temperature, humidity and wind speed in this study).At a time step , the heat load observations and the meteorological observations can be formulated as respectively.To learn the short-term temporal dependencies of time series, we introduce a sliding window with a size of  to generate the input of the model, representing the consecutive observations of a period.The resulting heat load and the meteorological time series can be formulated as  +1∶+ ∈ R  × ′ and  +1∶+ ∈ R  × # , respectively.Therefore, the prediction can be seen as a multivariate time series prediction problem, and the learning processing with the two time series can be formulated as follows: where Ŷ+ℎ ∈ R 1, ′ are the prediction values of the ℎth time step ahead of ;  − ∶ ∈ R  × ′ and  − ∶ ∈ R  × # represent using historical observations with a window size of  for the prediction; [; ] represents the concatenation operation; and  (⋅) is the linearity/non-linearity mapping function that will be learned in this study.
Since the original time series may contain noise, such as information redundancy and nonstationary fluctuations, this will impair the predictive stability and accuracy of the model.To mitigate this effect, we first use a representation learning function to extract the key patterns from the original time series, and then use the extracted key patterns as the input for prediction.Thus, the prediction model is further formulated as follows: where  1 (⋅) and  2 (⋅) are the representation learning functions to capture key patterns from the heat load and the meteorological time series, respectively.To incorporate the effect of the correlation between exogenous meteorological variables and heat load, we add an additional representation learning component  3 (⋅) to the input of the model and obtain the following prediction function: (3)

Proposed model
In this subsection, we will describe in detail the proposed GAIN model and the metrics for the model evaluation.The mapping function (⋅) The representation learning component

Overview
Fig. 5 presents an overview of the model, which is a deep neural network trained from historical heat load data and meteorological data.Note that in our experiments later in Section 5, we use GAIN(+) to denote the model trained from both types of data and GAIN to denote the model trained from heat load data only.The upper left corner of the figure indicates the processing of the heat load data, while the lower left corner indicates the processing of the weather data.For heat load data processing, we first perform global clustering to obtain different clusters of customers with similar consumption behaviors, mainly to allow the model to capture the potential relationship between the customers within each cluster.Subsequently, we perform normalization and apply a convolution operation on the obtained results to extract key features from the heat load observations (the convolution operation is adept at local feature engineering [45]).The convolutional output of the heat load and meteorological observations is fed to a collaborative temporal graph attention component to learn the local associations between time steps.An enhanced temporal representation (ETR) with a recursive component is then used to establish the relationship between heat load and weather observations and to improve the ability to learn long-term temporal dependencies.Multiple ETR outputs are combined and then processed by a linear layer, where we add autoregressions of both linear representations of the heat load and meteorological data to adjust the output of the linear layer.Lastly, the final prediction results are obtained by de-normalization (see the lower right corner of the figure).In the following, we describe the components of the proposed neural network structure in more detail.

Global clustering
The living habits of the inhabitants can lead to different preferences in the heat load.To learn the association between the behavior of the inhabitants and the fluctuations in the heat load, we first employ a Kmeans algorithm to cluster the customers into  groups based on the heat load observations in the training set.
Heat load observations are inherently time series samples, which cannot be learned directly by the clustering algorithm.To address the limitation, we utilize the DTW Barycenter Averaging (DBA) [46] approach to average a set of time series and find centroids for the Kmeans method.Dynamic Time Warping (DTW) [47] is the core of the DBA approach and is widely used as a similarity measure between time series.The optimization process of DTW can be expressed as: where  = [ 0 , … ,   ] is the optimal alignment between two time series;  1 and  2 both represent the time step; The  satisfies where  is the length of the input samples;   ∈   and   ∈   denote the heat load observations.The DBA designs an averaging algorithm to calculate the distance between each time series and the candidate barycenter.The K-means algorithm based on DBA is designed to minimize the following objective function: where   represents the observations of the th household belonging to  and  represents the number of households;  denotes the candidate barycenter, and   ∈  indicates the th candidate barycenter;   , is a binary variable denoting if the time series   belongs to the th cluster,  ∈ {1, 2, … ,  },  representing the number of clusters.

Time series transformation
Time series transformation is a preprocessing step of the GAIN model.Due to the different magnitudes of heat consumption, this will lead to learning bias and reduce the accuracy of prediction.Therefore, in order to eliminate the impact of numerical differences, it is necessary to apply normalization to scale the input samples.
The z-score and min-max are two main normalization techniques.The z-score normalization is suitable for processing the data with extreme values.But, the relative difference between the samples is adjusted after z-score normalization.The min-max normalization maintains the relative difference and scales the inputs in a range of [0, 1].Due to the relatively stable fluctuations of the heat load and meteorological observations, we employ min-max normalization in the proposed GAIN to normalize the original input data.The normalization processing and recovery processing are formulated as follows: where  ∈ R × denotes the input data,  is the number of input samples, and  is the dimension of input; X is the normalized data, max(⋅) and min(⋅) are the maximal and minimal values of , respectively.After normalization, the heat load observations are adjusted as Ỹ , and the meteorological observations are adjusted as M.Then, the normalized data are split into multiple pairs for the prediction.That is, using the normalized values of [,  +  ] to predict the values of the ( + ℎ)th time step.The split process is also referred as the ℎ-ahead-step split [48].

Collaborative temporal graph attention (CTGAT)
In the GAIN, we introduce the Collaborative Temporal Graph Attention (CTGAT) module, which is the core design in the model.The CTGAT contains two multi-head graph attention (GAT) components, which can learn the temporal dependencies from heat load and meteorological observations in parallel (represented as the multiple layers in Fig. 6).The learning process of multi-head GAT is plotted in Fig. 6.This design considers a potential correlation between time steps within a look-back window, and the relationship between time steps is formulated as a complete graph, as shown to the left in Fig. 6.
To obtain representative patterns and reduce information redundancy, CTGAT first employs a convolutional layer in the variable dimension to highlight the key features of heat load observations.The kernels of the convolutional layer with a size of  in the variable dimension.The convolutional operation of the th filter can be formulated as: where  , ∈ R × ×1 denotes the output of the th filter from the th cluster group, and the final convolution output of the th group is   ∈ R × × ; Ỹ  is the heat load observation of the th cluster group; The symbol * represents the convolutional operation,   is the learnable weighting matrix, and   is the bias term.
Z. Wang et al.Subsequently, we construct two graph structures by the outputs of the convolutional layer and original meteorological observations, respectively, and then these graph structures are delivered into the GAT layer to learn the potential association between heat loads, and between heat load and meteorological factors.The graph has  nodes, described as   = { ,1 ,  ,2 , … ,  , }, where  , ∈ R  denotes the feature representation of the th cluster group at time step, .The adjacent nodes include all of the other time steps in a look-back window.
The correlation between the features of two nodes,  and , can be expressed as: where  , denotes the importance of node  to node ; LeakyReLU indicates a nonlinear activation function;   ∈ R 2× and   ∈ R × are both learnable weight matrix, and  is the output dimension of node features.
After obtaining the correlation between nodes, the softmax function is employed to normalize the coefficients and obtain the attention score  by: where  ∈  represents the adjacent nodes for node ,  , indicates the normalized attention score of node  to node , and exp(⋅) denotes the exponential function.
Then, the attention score is utilized to transform the input vectors and obtain the final output features for every node: where ñ, ∈  ×1× represents the weighted feature of node  in the th attention mechanism,   is the learnable weighting matrix used to linearly transform the input graph, and (⋅) is the sigmoid activation function.Finally, the outputs of multiple attention mechanisms are concatenated to generate the final representation: where   ∈ R ×1× is the weighted features for node ,  represents the number of attention mechanism, and  is the output feature dimension of node ;  ′ ∈ R × × and  # ∈ R × × is the final output of the CTGAT, which represent heat load observations and meteorological observations, respectively.

Enhanced temporal representation (ETR)
As mentioned earlier, temporal graph attentions have superiority in capturing the short-term temporal dependencies, but the long-term time dependencies are commonly ignored.To enhance the learning ability for long-term temporal characteristics, we feed the outputs of CTGAT into the gated recurrent unit (GRU).The heat load observations are also integrated into GRU to capture the temporal dynamics between attention representation and raw heat load observations.The hidden representation   of GRU can be calculated based on the prior  −1 , CTGAT outputs, and the heat load observations: where Then, the GRU layer productions of  groups are concatenated, and subsequently make a linear transformation to obtain the final temporal representation: where   is the output temporal representation, generated by extracting the short and long-term temporal dependencies;   and   are the learnable weighting matrix and biases parameters, respectively.

Autoregressive representation concatenation (ARC)
The nonlinear learning components have the capability to extract high-level potential relationships, but excessive non-linearity may lead to the neural network outputs being insensitive.To improve the predictive robustness of the proposed GAIN, two autoregression components are used to capture the linear characteristics of heat load and meteorological observations.
For the heat load observations, the autoregression component makes a linear mapping for the temporal dimension, which can be expressed as: where   is the output linear representation of target time series, and   is the length of the look-back window for autoregression component;   and   are both the learnable parameters.
For the meteorological observations, we first use a convolutional layer to upscale the dimension of features.This process not only enhances non-linearity by replacing input features with nonlinear combinations, but also improves the expressiveness of neural networks.Then, similar to the processing for target time series, an autoregression component is used to capture the temporal information linearly.Finally, the output dimension of the meteorological observations is made consistent with the dimension of the heat load observations through a linear transformation: where   denotes the output linear representation of exogenous time series;   3 ,   2 , and   1 are the weighting matrices; and   is the bias term.
The final prediction of the proposed GAIN is obtained by integrating the ETR and ARC outputs: where  +ℎ ∈ R ×1× ′ is the output of proposed GAIN,  ′ is the input dimension of district heat load data, Ŷ +ℎ is the final prediction results after de-normalization.
The Mean Square Error (MSE) is adopted as the loss function in the training process, which can be formulated as: where  is the length of input time steps in the training process.The  +ℎ,, denotes the th training sample's actual heat load of th district at  + ℎ time step.ŷ+ℎ,, is the predictive values.

Evaluation metrics
The root mean square error (RMSE), mean absolute error (MAE), and coefficient of variation of RMSE (CV-RMSE) are combined to measure the prediction performance.Each metric has a different purpose: MAE and RMSE are both scale-related metrics, but MAE depends on the absolute errors, while RMSE is based on squared errors.CV-RMSE is a scale-independent metric commonly used in energy prediction, which eliminates the RMSE's dependence on the scale of the data.These metrics are formulated as follows: • Root Mean Squared Error: • Mean Absolute Error: • Coefficient of Variation of Root Mean Square Error: where  , and Ŷ, represent real and predictive heat load value at time  of the  household, respectively;    represents the test set and (⋅) is used to calculate the mean value.The smaller the values of RMSE, MAE and CV-RMSE, the better the performance.

Experiments
This section carries out experiments to evaluate the proposed model and presents experimental settings, model implementations, data, and results.

Experimental settings and data
The GAIN model was implemented using the deep learning framework, Pytorch v1.12.1.All experiments were carried out on a server equipped with an Intel(R) Xeon(R) Gold 6226R CPU (2.90 GHz) with 128G memory and were accelerated by two NVIDIA RTX A6000 GPUs.The data used for the experiments were described in Section 3. Due to the difference in the data distribution in different types of households, we select the heat load observations of the single-family household to conduct experiments, which has the largest number of samples in the dataset.

Baseline methods
To evaluate the proposed GAIN model, we select 15 multivariate time series prediction methods as the baselines for comparison.Their brief descriptions are listed as follows.
• Global Autoregression (GAR) [28] uses an autoregressive component to capture global features.• Long and Short-term Memory (LSTM) [49] captures the temporal dependence by cycling four gate units.
• Gated Recurrent Unit (GRU) [50] is a variant of the recurrent neural network with a more concise gate architecture.• Encoder-Decoder (ED) [50] employs LSTM components in the encoding stage and the decoding stage, respectively.• Convolution Neural Network (CNN) [51] is a two-layer learning structure using one-dimensional convolution operations.• Convolution Recurrent Neural Network (CRNN) [52] combines the recurrent neural network with the CNN component.

• Convolution Recurrent Neural Network with Residual (CRNN-
Res) [52] uses residual components to avoid the loss of detailed information caused by convolution operations.• TPA-LSTM [53] uses the recurrent neural network with an attention mechanism to capture nonlinear interdependencies between time steps and series.
• LSTNet [54] captures long-term dependencies and periodic patterns by redesigned convolutional and recurrent structures.
• MTNet [55] employs memory components, encoders, and autoregressive components to model complex temporal patterns or dependencies.• Multivariate Shapelet Learning (MSL) [48] learns multiple crucial subsequences from historical observations.• Dense [32] is a non-linear artificial neural network with fewer parameters and performs well in district heating load prediction.• BiLSTM [31] is a variant of the LSTM that has been used to predict the district heating load.• Hybrid CNN-LSTM (HCLSTM) [33] is a spatio-temporal prediction algorithm that has shown promising performance in short-term heating load prediction.• Informer [35] is a variant of vanilla Transformer architecture that had been used in district heat load prediction [34].

Model configurations
We use the grid search method to adjust the hyperparameters for each method.Due to the chronological order nature of the heat load time series, we use five repeat experiments instead of cross-validation.The Adam optimizer [56] is used to obtain the training models, and the mean squared error (MSE) is selected as the loss function.The training epoch and the learning rate are tuned to the optimal states for each method, while the other training-related constant parameters are kept the same.Table 7 describes the details of the possible hyperparameters of the baseline methods.

Results and analysis
This subsection presents the experimental results for the evaluation of hyperparameters, comparison with the baseline methods, correlation analysis of prediction, model ablation study and feature ablation study, respectively.

Hyperparameter evaluation
The sliding window size indicates the association between the past  days and the future  + ℎ days.An appropriate window size can help the prediction model to better capture potential energy consumption patterns.It is worth noting that, in order to explore the temporal relationship between time steps, the window size should be greater than 1.Therefore, to determine the optimal window size for the proposed model, we make predictions for a time horizon of ℎ = 1, and keep the other parameters constant while increasing the window size from 2 to 21.To make fair comparisons, we repeat each experiment five times to avoid the effect of accidental values.The experimental results are shown in Fig. 7.
As shown in Fig. 7, the model achieves the best performance for a window size of 11, while for small or large window sizes, the model shows sub-optimal performance.For example, when the window size is set to 1-5 days, it is difficult to achieve good accuracy.This is due to the greater randomness of small windows, including many unexpected events, inflection points, and small fluctuations, which have an impact on short-term energy and thus on prediction performance.When using a relatively large window size, e.g.15-21 days, although the large window contains more temporal information, it has more serial fluctuations.Thus, relatively small or large window sizes can affect key temporal information extraction capability and increase prediction errors.Therefore, for the rest of the experiments, we will use the medium window size  = 11, which has a good capability of capturing important fluctuations while reducing the effect of stochastic fluctuations, with less parameter complexity.
Furthermore, the number of clusters,  , is another key hyperparameter that determines the number of heat load groups and the graph learning components.To find an optimal number of clusters, we conduct the experiments by varying the number of clusters from 2 to 6 under the optimal window size, a prediction time horizon of ℎ = 1, and fixing other uncorrelated structural parameters.The results in Fig. 8 show that when the number of clusters is set to 3, the overall performance is relatively superior.
The obtained results can be explained by the fact that the number of clusters selected for the model plays a crucial role in the accuracy and efficiency of the proposed method.Specifically, if the number of clusters is set too low, the model may not be able to capture sufficient details in heat load observations, which results in excessive heterogeneity within the clusters.Conversely, if the number of clusters is set excessively high, it may cause some data points that should be grouped together to be separated into different clusters, leading to excessive homogeneity between the clusters.Both of these scenarios can have a significant impact on the accuracy and efficacy of the proposed method.When the number of clusters  is set to 3, the prediction model can effectively extract the potential behavior patterns of customers while avoiding excessive model complexity.

Comparison with baselines
Table 3 summarizes the performance results of all methods based on three metrics and four time horizons, ℎ = 1, 3, 5, 7, for short-term prediction.The best results are highlighted in bold font.In general, the performance of all methods decreases with increasing time horizons.First, when meteorological factors are not considered, we can observe the following findings.The proposed GAIN model demonstrates superior performance for all three evaluation metrics across all four time horizons, with particularly strong results for low horizons.Specifically, GAIN achieves a maximum reduction of 2%, 3%, and 2% in RMSE, MAE, and CVRMSE, respectively.This result demonstrates the good generalization capability of the proposed GAIN.The Informer model exhibits superior performance in general and achieves the second-best results for forecasting horizons of 5 and 7.This is attributed to its utilization of multiple attention mechanisms, which enable the model to capture multi-scale temporal representations.MTNet has the second best performance when ℎ = 1 and ℎ = 3, which capture the temporal dependencies with a recurrent learning component and local attention mechanism.However, with the increase of horizon, merely relying on the information of local relationships for MTNet is insufficient to make an accurate prediction.When ℎ is set to 1 and 3, GAR outperforms most of the RNN variants (i.e., LSTM, GRU, ED, and BiLSTM) and hybrid RNN models (i.e., CRNN, CRR-Res, TPA-LSTM, LSTNet, and HCLSTM).This phenomenon highlights the importance of linear representation in short-term daily heat load prediction.The trend and fluctuation of heat load observation at this data granularity are relatively stable.The CRNN model outperforms the vanilla RNN or CNN architecture in low horizon forecasting tasks.This suggests the potential benefits of integrating a CNN component to encode input features and enhance prediction accuracy.The combination of convolution and RNN is the core component of TPA-LSTM, LSTNet, and MTNet, but these methods further incorporate the auto-regressive linear characteristics and exhibit more stable performance than traditional CRNN architectures, Fig. 7.The GAIN performance by varying the window size  in terms of three metrics.For each metric, the optimal value is found at red dash line.The best result for each metric is highlighted in bold.''(+)'' denotes the prediction method that integrates meteorological factors.Unit of ℎ: day.e.g., HCLSTM, CRNN, and CRNN-Res.The MSL and Dense models exhibit relatively good performance in the low horizon forecasting task.However, they are unable to effectively extract consecutive temporal dependencies and display significant performance fluctuations as the forecasting horizon ℎ increases.Then, in the presence of meteorological factors, we compared our proposed method, GAIN(+), against 10 benchmark models, excluding TPA-LSTM, LSTNet, MSL, and BiLSTM, as they do not take exogenous factors into account in their prior studies.According to the experimental results presented in Table 3, GAIN(+) outperforms all other benchmark methods in all four time horizons, achieving a maximum reduction of 25%, 29%, and 25% in RMSE, MAE, and CVRMSE, respectively.Notably, GAIN(+) also achieves a maximum reduction of 3%, 4%, and 3% in RMSE, MAE, and CVRMSE compared to GAIN, highlighting the effectiveness and necessity of considering meteorological factors in our model for short-term heat load prediction tasks.In our design, we utilize the CTGAT module to extract the key information of heat load and meteorological observations, while the linear autoregressive properties of meteorological factors are also taken into account, which also improves the prediction performance.To further compare GAIN and GAIN(+), we randomly select the heat load of a household for prediction over different time horizons and obtain the results shown in Fig. 9. Intuitively, GAIN(+), which takes into account meteorological factors, fits the real heat load observations better than GAIN, especially when the time horizon is set to 1 or 3.This is because, with consideration of meteorological factors, GAIN(+) can be more effective in capturing the fluctuation of the peak load.During the highlight period, the prediction accuracy of GAIN(+) is significantly better than GAIN.We also note that as the time horizon increases, the prediction performance of the two methods gets closer.Several fluctuations are observed during periods of high volatility.This indicates that there is uncertainty in the effect of meteorological factors on heat load prediction for large time horizons.
Regarding the benchmark methods, the major observations are summarized as follows.The prediction accuracy of GAR(+) and Dense(+) is significantly worse than that without weather factors.This result suggests that these models have limitations in extracting potential representations from exogenous inputs, which can cause issues in the weight allocation of these features, ultimately affecting the prediction performance.HCLSTM(+) benefited from the weather factors and shows better performance than HCLSTM under the four time horizons.The phenomenon reflects the potential of the CRNN architecture in enhancing the accuracy of predictions by integrating exogenous factors, while also demonstrating the effectiveness of weather factors in heat load prediction.Other methods, such as LSTM, ED, CNN, CRNN, CRNN-Res, and Informer, also show slight improvement when ℎ is set as 1 or 3h.This result further emphasizes the important role of weather factors in short-term heat load prediction tasks.

Analysis of predictions
Fig. 10 shows the comparison of the actual and predicted heat load values by GAIN, Informer, HCLSTM, and MTNet.Among the four methods, GAIN performs the best in tracking the heat load variations, especially when the prediction horizon ℎ is short (1 or 3).However, as ℎ increases, all the methods tend to underestimate the peak values and lose accuracy.Informer, which mainly uses the attention mechanism, shows more fluctuations in the forecasting than MTNet, HCLSTM, and GAIN, which incorporate RNN.This suggests that RNN is more effective in capturing the temporal dependencies and reducing forecasting errors.GAIN leverages the graph neural network framework to capture the complex and nonlinear relationships between the customers, heat load and meteorological factors, which enables it to produce more accurate and realistic predictions.
Fig. 11 shows the normalized results and Pearson correlation (PCC) between the actual and predicted values by GAIN.The prediction error increases as the observation size increases, which is consistent with the previous figure.This indicates the difficulty of predicting the heat load accurately during high fluctuation periods, especially for longterm forecasting.Most of the data points are below the diagonal line, which means that the prediction model tends to lag behind the actual values when they reach the peak.This could be due to the instability of the heat load data during high fluctuation periods, which makes it hard for the prediction model to find stable patterns.It could also be due to the prediction model's limitation in capturing the sudden changes in the data, which leads to a delayed reaction in forecasting the peak values.Nevertheless, GAIN still outperforms other methods in terms of PCC for all four forecasting horizons.
The effect of weather factors on the prediction performance is shown in Figs. 12 and 13.Informer(+) exhibits unstable and fluctuating results, especially for ℎ = 7.This could be due to its complex learning architecture, which may not be able to filter out the noise from the external data that affects the prediction target.HCLSTM(+) shows stable but inaccurate results, especially for the peak values.On the contrary, GAIN(+) performs the best among the three methods in terms of fitting and   values for all four horizons, especially for ℎ = 1.However, adding weather features also increases the uncertainty and worsens the prediction during high volatility periods, as seen in Figs.13(c) and 13(d).This confirms the challenge of predicting the heat load with meteorological factors for long-term forecasting.

Model ablation study
To better understand the contribution of each GAIN component to the accuracy of the final model, we performed ablation experiments for four time horizons and obtained the results presented in Table 4.      remove the ARC component (w/o ARC) and the performance shows a significant drop.This implies the importance of the autoregressive linearity property.The fluctuations of daily heat load observations are more stable than hourly observations.Therefore, simple autoregression can make a proper linear adjustment for prediction [54].
In conclusion, the ablation study validates the efficacy of the GAIN model structure design, which takes into account not only the heat load and meteorological influences, but also the short-term and longterm time dependence in the time series and the local correlation between time steps.In addition, linear features are incorporated into the proposed model design.

Feature ablation study
The proposed GAIN(+) considers four meteorological factors, including outdoor temperature, solar radiation intensity, wind speed, and relative humidity.To investigate their contribution to model performance, we conducted a feature ablation study and present the results in Table 5.
First, we remove the outdoor temperature factor (w/o OT), which is the dominant ingredient related to the heat load [60].The GAIN(+) exhibits the worst prediction accuracy in all four time horizons.A potential explanation is that the outdoor temperature is the most perceptible by the human body.Prediction models can easily establish the causal relationship between outdoor temperature and heat load.Without outdoor temperature, the prediction model lacks an important causal relationship, leading to an increase in forecast error.Second, we remove the feature of solar radiation intensity (w/o SRI), and the prediction accuracy slightly decreases.The intensity of solar radiation is one of the causes of temperature variations, which also indirectly influences relative humidity [61].As shown in Fig. 3, the trends in the intensity of solar radiation and outdoor temperature are similar, indicating their causal relevance.Therefore, ignoring solar radiation does not have a significant effect on prediction accuracy.Third, we remove the feature of wind speed (w/o WS), and the obtained model has the second-best performance in terms of RMSE in the four time horizons.This result suggests that wind speed has some contribution to heat load prediction accuracy but is much lower than the outdoor temperature.This may be due to the fact that wind speed is an air property that is indirectly related to weather temperature.Previous studies [36] have verified that wind speed is valuable auxiliary information.Lastly, when the relative humidity (w/o RH) is not considered, the performance is better than w/o SRI but worse than w/o WS.Previous research [62] has shown that relative humidity plays an important role in both the indoor environment and energy conservation.
The above ablation studies of the four meteorological factors show that their contributions to the model prediction performance are of varying validity, which has been fully considered in our study.

Conclusions and future work
Heat load prediction plays a pivotal role in the operations of energy stations and assists with energy demand side management.In this paper, we propose a graphical ambient intelligence algorithm for district heating load forecasting.We first applied clustering to identify different customer groups based on their load profiles.Then, we designed a collaborative temporal graph attention mechanism for extracting features from heat load and meteorological observations, enabling the discovery of causal relationships between time steps.To improve the capability of capturing temporal dependencies, we introduced a recurrent neural network into the proposed GAIN structure, allowing it to

Table 6
Performance comparison based on two metrics and four time horizons in three category collections on the JERICHO-E-usage dataset.The best results are marked in bold and the second best results are underlined; Unit of ℎ: day.Note that the RMSE and MAE values are with an increase of 6 orders of magnitude, i.e., ×10 6 .For the commerce and industry datasets, the number of clusters  in GAIN is set to 3, while for the residential dataset,  is set to 2. Several directions for future work exist.First, the interpretability of the proposed GAIN model will be further investigated.Second, further consideration will be given to the application of graph neural networks to multiple energy loads and household types.Third, we will extend the prediction from short-term to long-term time horizons and improve generalization performance.

Fig. 2 .
Fig. 2. The Pearson correlation coefficient between meteorological factors and heat load.

Fig. 8 .
Fig. 8.The GAIN performance with varying the number of clusters  in terms of three metrics.

Z
.Wang et al.

Fig. 10 .
Fig. 10.The visualizations comparing the real values with the GAIN predictions and the four other benchmark predictions; Unit of ℎ: day.

Fig. 11 .
Fig. 11.The correlation visualizations of GAIN predictions and four other benchmark predictions; Unit of ℎ: day.

Fig. 12 .
Fig. 12.The visualizations comparing the real values with the GAIN(+) predictions and the four other benchmark predictions; Unit of ℎ: day.

Table 1
The statistics of heat load and meteorological factors.

Table 2
Symbols and semantics.
,   and   represent the reset gate, update gate, and cell state at timestamp , respectively; ⊙ is the Hadamard product, and (⋅) denotes the sigmoid activation function;   ,   , and   are weights;   ,   , and   are the corresponding biases.

Table 3
Performance comparison based on three metrics and four time horizons in heat load prediction.

Table 4
Model ablation study.The best results are shown in bold, and the worst results in wavy lines; Unit of ℎ: day.

Table 5
[59]ure ablation study.The best results are marked in bold and the second-best results are underlined.Unit of ℎ: day.First, we remove the K-means structure (w/o K-means), and then feed the unclustered data directly into CTGAT.The results show that w/o K-means does not perform well, especially for relatively large time horizons.This is because the clustering component allows the prediction model to learn potential similarities and regularities of the load profiles for each cluster.Second, we remove the convolution component in CTGAT (w/o convolution), and the results show considerable prediction error.Convolution filters can incorporate or separate correlations among different dimension features and discover local dependency patterns[57].When the convolution component was removed, the ability to learn local heat load patterns will be compromised.Local patterns can typically affect more on the accuracy of the prediction of small time horizons.Third, we remove TGAT (w/o TGAT) and feed the convolution output directly into the ETR structure.The results show that the w/o TGAT performs worse than the original GAIN, especially for large time horizons.Temporal graph attention treats each time step as an individual node in a time window and establishes correlations between different time steps[58].The correlations of local time steps can capture the temporal dynamics between load profiles.For large time horizon predictions, the key ingredient is to capture the causal relationship between time steps.TGAT has superiority in capturing global temporal dependencies and enhancing the stability of the model with large time horizons[59].Similarly, to verify the validity of the recurrent layer, we use the GAR component instead of the ETR component (w/o ETR).Interestingly, the prediction accuracy of w/o ETR decreases only slightly for small time horizons, while the prediction error becomes prominent for large time horizons.This suggests that recurrent neural networks are more suitable for establishing temporal dependencies over large time horizons, while short-term local dependencies can be obtained with the TGAT component.Finally, we correlate heat load and meteorological data in the temporal dimension.Moreover, we considered linear characteristics of time series in our model to increase the diversity of feature representation and to enhance the robustness of the model.Finally, we conducted comprehensive experiments to evaluate our model, including comparisons with fifteen baseline methods, correlation analysis of predictions, model structure, and feature ablation studies.The experimental results have shown that the proposed model outperforms all the baseline methods and demonstrate the effectiveness of our model design.

Table 8
Abbreviations and meanings.