ELFNet: An Effective Electricity Load Forecasting Model Based on a Deep Convolutional Neural Network with a Double-Attention Mechanism

: Forecasting energy demand is critical to ensure the steady operation of the power system. However, present approaches to estimating power load are still unsatisfactory in terms of accuracy, precision, and efficiency. In this paper, we propose a novel method, named ELFNet, for estimating short-term electricity consumption, based on the deep convolutional neural network model with a double-attention mechanism. The Gramian Angular Field method is utilized to convert electrical load time series into 2D image data for input into the proposed model. The prediction accuracy is greatly improved through the use of a convolutional neural network to extract the intrinsic characteristics from the input data, along with channel attention and spatial attention modules, to enhance the crucial features and suppress the irrelevant ones. The present ELFNet method is compared to several classic deep learning networks across different prediction horizons using publicly available data on real power demands from the Belgian grid firm Elia. The results show that the suggested approach is competitive and effective for short-term power load forecasting.


Introduction
The rapid development of society has led to a significant rise in energy consumption and the depletion of traditional energy sources.This has created a growing demand for the efficient use of energy, particularly in our daily lives.The absence of a consistent pattern in power demand could lead to an imbalance between supply and demand, resulting in energy losses [1].Electricity load forecasting plays a crucial role in informing long-term policy decisions aimed at addressing these challenges.It also provides valuable information to the electrical sector, suppliers, and market regulators [2].
Numerous researchers have proposed various forecasting approaches to enhance the accuracy of electricity load forecasting [3,4].These approaches can be broadly categorized into two groups.The first group comprises classic statistical models, including linear regression models [5,6], autoregressive moving average models [7,8], and autoregressive integrated moving average models [9].
Despite their long-standing use, speed of calculation, and practicality, traditional methods often struggle with accuracy and effectiveness when analyzing electrical load data that contain significant random elements and high levels of nonlinearity [10].The second category of prediction techniques involves machine learning algorithms, such as artificial neural networks, support vector regression, decision trees, and XGBoost.
With the increasing improvement in the available computational resources, deep learning (DL) has become a highly successful data-driven technology in the field of forecasting electricity load [11].The ability of deep neural networks (DNNs) to forecast electricity loads with strong nonlinearity has been exploited [12].Talaat et al. introduce a medium-to short-term load forecasting (MTLF; STLF) model that can be used to forecast load at different times of the month and on different days [13].However, DNNs have the drawback of not being able to simulate some delicate changes in time series, and the training process is vulnerable to gradient disappearance and easily causes the prediction results to reach the local optimum [14].Fekri et al. propose an online adaptive RNN, which is a load forecasting method that can continuously learn from newly arrived data and adapt to new patterns [15].Jagait et al. use RNN and ARIMA for load forecasting under concept drift [16].Bui et al. propose a multi-scale RNN model with short-term and long-term memory for load forecasting [17].While RNNs struggle with long-term time dependencies and cannot effectively represent large time series, LSTMs have emerged as a solution to this issue [18].Zang et al. combine LSTM with the self-attention mechanism (SAM) to develop a hybrid model with two input channels [19].Bashir et al. proposed a hybrid approach using Prophet and LSTM models to predict accurate loads.The Prophet model predicts the raw load data using both linear and nonlinear data, and the nonlinear data are trained using LSTM [20].Memarzadeh et al. proposed a new hybrid forecasting model for short-term power load and price forecasting.The proposed method consists of three modules: wavelet transform to eliminate the fluctuation behavior of the power load and price time series, feature selection based on entropy and mutual information, and finally LSTM to train the model [21].The incorporation of gating units like forgetting gates and memory gates has significantly enhanced the ability of these models to address long-term time dependencies and spatial complexity.Historically, LSTM and its variations have been widely utilized in power load forecasting models.
Due to the advantages of CNNs in processing image data with strong nonlinearity and the ability to extract more intrinsic data features, the convolutional neural network (CNN) was adopted to process electricity load data [22].Singh et al. proposed a novel STLF model based on 2D CNN.The discussion focused on an overview of available prediction techniques, the implemented CNN architecture, the feature selection process, and the performance of the model on a test dataset [23].Imani used CNN to extract nonlinear relationships between load values.In addition, a load-temperature cube was composed of hourly load and temperature values for a week.Another CNN was trained using the load-temperature cube to learn the hidden nonlinear load-temperature features.Finally, SVR was used for load forecasting [24].Wang and Oates first encoded univariate time series to images using the Gramian angular fields (GAFs) and Markov transition fields (MTFs) methods.The images were then utilized as inputs for a CNN.This image-based framework pioneered a new class of deep learning algorithms for time series analysis [25,26].Since then, many techniques have been introduced to convert time series data into images for use as the inputs of CNNs.The application of a relative position matrix (RPM) time-series encoding approach to convert raw time series into 2D images is investigated to develop an efficient CNN architecture for autonomously learning a higher-level representation of raw time-series data [27].To feed each CNN, a multi-resolution imaging technique based on Gramian angular fields (GAF) is utilized, allowing for the analysis of diverse time-related periods for a single observation [28].A 2D representation approach called Relative Position Matrix (RPM) is presented to transform raw time series data to 2D pictures, and an enhanced CNN model is provided to categorize these 2D images.The conversion approach for time series is applied based on the Hue saturation value space, which makes it simple to compare colors since it can transmit the brightness, hue, and vividness of a color very naturally [29].Hong et al. proposed a new method for predicting solar radiation by encoding time series data into images using Gramian angular fields and convolutional LSTM (ConvLSTM) networks.The preprocessed data become a fivedimensional input tensor that is well-suited to ConvLSTM.The ConvLSTM network uses convolution operations in its input-to-state transition and state-to-state transition [30].A local phase binary encoding operation was executed to create the histogram of a 2D phase encoding of power signals incorporating neighborhood information, and the suggested encoding method greatly reduced the dimensions of the appliance signals, in addition to improving the discriminating capacity of classifier models [31].Multivariate time series data were converted from 1D signals to 2D visuals using a variety of encoding approaches, including the Gramian Angular Summation Field (GASF), Gramian Angular Difference Field (GADF), Markov Transition Field (MTF), and Recurrence Plot (RP) [32].
In order to further leverage the machine learning capabilities of CNNs, more improvements have been developed.The Alexnet algorithm [33], the first deep convolution application, the Inception structure [34], which expands on the classical CNNs, the Resnet strategy [35,36], which was proposed to solve the problem of the deep convolutional vanishing gradient, and the Resnext method [37], which unifies the concepts of group convolution and residual networks, are all well-known improved forms or variants of CNNs.The transformer model put forth by Vaswani et al. [38] utilized the attention mechanism to process natural language.The attention mechanism model provides an advantage in the area of natural language processing due to its simplicity and lack of model parameters.The channel attention module and the spatial attention module were proposed by Woo et al. [39] and implemented in the CNN.Each channel serves as a feature detector for the channel attention module.The spatial attention module makes use of the space of the feature map, while the channel attention module concentrates on the most significant input images and improves or suppresses various channels for various tasks to optimize the networks' capacity for representation.In addition to being a useful addition to channel attention, the relationship creates a spatial attention map that identifies the most crucial regions of the network for processing and may be utilized for adaptive feature optimization with the input feature map.
Inspired by previous works, this article presents a new deep prediction network frame for predicting the electricity load.Taking advantage of the powerful image data-processing capabilities of CNN, the proposed model first converts the time series into images via the method of the statistically interpretable GAF, which substantially increases the prediction performance of the power load data with strong nonlinear features and environmental fluctuations, and then uses a deep CNN to extract nonlinear features from the electricity load data.The types of attention mechanism, i.e., the channel attention (CA) and spatial attention (SA) modules, are applied to extract the data's hierarchical features to minimize information loss, and residual connections are adopted in the appropriate convolutional layers to ensure the proper convergence of model training and parameter updates.The novel concepts and main contributions of this paper are as follows: (1) A novel deep convolutional attention mechanism model is proposed to solve the issue of electricity load forecasting with strong nonlinear features, which optimizes the deep characteristics of the power load data using the convolution layer, residual connections, and the CA and SA mechanisms; (2) The proposed deep learning structure is designed to reduce the randomness of the data and guarantee the robustness of machine learning.It can also easily extract the intrinsic properties of the data thanks to the Gram matrix principle, which is used to convert time series data into figure data; (3) From different time scales, the proposed model can directly output the multi-step prediction results, reduce the prediction error, and achieve more excellent prediction performance results.
The remainder of the paper will be structured as follows: the background theory of the proposed prediction approach is provided in Section 2, the fundamental structure of the proposed model is described in Section 3, the experimental data, evaluation metrics, and results are presented in Section 4, and the final conclusion is given in Section 5.

Background Theory
ELFNet is a deep convolutional residual network that is proposed for electricity load prediction.It works by first converting time series into pictures and then using deep convolutional residual networks with a double-attention mechanism.The next subsections provide a detailed introduction to the theory underlying the network structure.

Gramian Angular Field
Typically, we present the time series in a Cartesian coordinate system, with the vertical axis representing the related real values and the horizontal axis representing the timestamp.In order to reduce information loss, we create a bidirectional mapping between the onedimensional time series and the two-dimensional space, as proposed by Wang et al. [25], to replace the traditional Cartesian coordinate system with a polar coordinate system.Given a time series with actual observations X = (x 1 , x 2 , . . . ,x n ), time series X is first normalized into the interval 1  2 , 1 : Then, we present the normalized time series in the polar coordinate system using the following mathematical formula, encoding the values of the time series as angles in polar coordinates and the time stamps as polar radii in the polar coordinate system: where t i represents the time stamp of the time series, and N is a constant factor used to standardize the range of the polar coordinate system.The mapping transformation (2), (3) will only provide one result in the polar coordinate system.Differing from other methods of converting time series to images, this mapping transformation has a uniquely accurate inverse mapping and, unlike the Cartesian coordinate system, the polar coordinate system preserves the absolute time relationship of the time series.
When we deflate the time series to the interval 1 2 , 1 , the corresponding inverse cosine function values will fall in the interval 0, π 3 .After converting the time series to a polar coordinate system, this article will identify the temporal correlation at different time intervals by considering the angular sum between each point.The Gram summation angular field (GASF) is defined as follows: From the time series X, the corresponding Gram matrix can be obtained by Gramsumming the angular field, and the elements in the Gram matrix G(X) are obtained via Equation (4): where G(X) contains the time dependency, which increases successively from the upper left of the matrix to the lower right corners.G ij indicates the relative correlation for the time series with point i and point j, and the elements of the diagonal G ii contain the angular information corresponding to the original values of the time series.Meanwhile, the actual values of the original time series can be reconstructed from the principal diagonal values of the matrix by using an inverse transformation.The length of the time series to be converted determines the size of the transformed Gram matrix, as is shown in Figure 1, and from there we can determine the size of the converted image.The image training set used in this study was set to be 64 × 64 pixels in size.

Residual Convolutional Structure
The concept of residual networks was developed to highlight the nonlinear representation capabilities of deep CNNs, where deep sub-convolutional networks combined with residual connections may efficiently avoid the degradation issues in the network [40].The basic residual convolutional module used in this study is shown in Figure 2. To achieve the desired output feature map size and downsampling accuracy, we can adjust the size of the convolution kernel and the convolution stride.After extracting the image's feature information using a 2D-CNN, we can improve the nonlinear expression of the convolution using an ReLU nonlinear activation function, and then apply a MaxPool layer to reduce the network's computation and obtain the feature map we are looking for.In this procedure, the following mathematical formula is used to represent the 2D-CNN convolutional module:   = .(  ) +  =   2(  ) +  (6) where   denotes the output feature of the previous layer, 2 denotes the 2D convolution operation,  denotes the ReLU activation function,  denotes the residual connections, and  is the maximum pooling operation.

Channel Attention Mechanism
The channel attention mechanism was introduced to modify the features extracted by convolution; the modified features can keep the valuable features and suppress the non-valuable ones [41].The main idea of the mechanism is to use some network structures to calculate the attention weights, which are combined with the feature map to build an improved attention feature map.The channel attention module has a dual-channel feature that uses the global average pooling and global maximum pooling, respectively, to obtain different feature information through two different pooling channels.The obtained features then are input into the same MLP, where they are applied to generate channel attention weights using sigmoid, which are finally multiplied by the input feature map to

Residual Convolutional Structure
The concept of residual networks was developed to highlight the nonlinear representation capabilities of deep CNNs, where deep sub-convolutional networks combined with residual connections may efficiently avoid the degradation issues in the network [40].The basic residual convolutional module used in this study is shown in Figure 2. To achieve the desired output feature map size and downsampling accuracy, we can adjust the size of the convolution kernel and the convolution stride.After extracting the image's feature information using a 2D-CNN, we can improve the nonlinear expression of the convolution using an ReLU nonlinear activation function, and then apply a MaxPool layer to reduce the network's computation and obtain the feature map we are looking for.In this procedure, the following mathematical formula is used to represent the 2D-CNN convolutional module: where F in denotes the output feature of the previous layer, Conv2d denotes the 2D convolution operation, σ denotes the ReLU activation function, Residual denotes the residual connections, and MaxPool is the maximum pooling operation.
Appl.Sci.2024, 14, x FOR PEER REVIEW 5 of 19 Figure 1, and from there we can determine the size of the converted image.The image training set used in this study was set to be 64 × 64 pixels in size.

Residual Convolutional Structure
The concept of residual networks was developed to highlight the nonlinear representation capabilities of deep CNNs, where deep sub-convolutional networks combined with residual connections may efficiently avoid the degradation issues in the network [40].The basic residual convolutional module used in this study is shown in Figure 2. To achieve the desired output feature map size and downsampling accuracy, we can adjust the size of the convolution kernel and the convolution stride.After extracting the image's feature information using a 2D-CNN, we can improve the nonlinear expression of the convolution using an ReLU nonlinear activation function, and then apply a MaxPool layer to reduce the network's computation and obtain the feature map we are looking for.In this procedure, the following mathematical formula is used to represent the 2D-CNN convolutional module: where   denotes the output feature of the previous layer, 2 denotes the 2D convolution operation,  denotes the ReLU activation function,  denotes the residual connections, and  is the maximum pooling operation.

Channel Attention Mechanism
The channel attention mechanism was introduced to modify the features extracted by convolution; the modified features can keep the valuable features and suppress the non-valuable ones [41].The main idea of the mechanism is to use some network structures to calculate the attention weights, which are combined with the feature map to build an improved attention feature map.The channel attention module has a dual-channel feature that uses the global average pooling and global maximum pooling, respectively, to obtain different feature information through two different pooling channels.The obtained features then are input into the same MLP, where they are applied to generate channel attention weights using sigmoid, which are finally multiplied by the input feature map to

Channel Attention Mechanism
The channel attention mechanism was introduced to modify the features extracted by convolution; the modified features can keep the valuable features and suppress the non-valuable ones [41].The main idea of the mechanism is to use some network structures to calculate the attention weights, which are combined with the feature map to build an improved attention feature map.The channel attention module has a dual-channel feature that uses the global average pooling and global maximum pooling, respectively, to obtain different feature information through two different pooling channels.The obtained features then are input into the same MLP, where they are applied to generate channel attention weights using sigmoid, which are finally multiplied by the input feature map to produce enhanced attention.It is crucial to highlight that the CA module does not change the input data's dimensions; hence, both the input data's dimensions and the output data's dimensions remain the same.Figure 3 depicts the whole process of the channel attention mechanism, where F in denotes the input feature matrix, σ denotes the sigmoid activation function, MLP denotes the multiple linear perceptron, CA F in denotes the attention weight matrix of the input features, F out denotes the output feature map after attention enhancement, and * denotes the matrix multiplication.
produce enhanced attention.It is crucial to highlight that the CA module does not change the input data's dimensions; hence, both the input data's dimensions and the output data's dimensions remain the same.Figure 3 depicts the whole process of the channel attention mechanism, where   denotes the input feature matrix,  denotes the sigmoid activation function, MLP denotes the multiple linear perceptron,    denotes the attention weight matrix of the input features,   denotes the output feature map after attention enhancement, and * denotes the matrix multiplication.

Spatial Attention Mechanism
Using the attention mechanism, Wang et al. [42] transform the spatial information in the original image into a different space while maintaining the key information.A spatial converter module is used to perform the necessary spatial transformation of the spatial domain information in order to extract the key information.The input feature map data are passed through the maximum pooling layer and the average pooling layer to generate feature maps.The two feature maps are then merged by concatenating the operation to construct a feature map with one channel.After this, the spatial attention of the corresponding feature map is obtained after the sigmoid activation function.Similar to the channel attention module, the spatial attention mechanism does not alter the dimensional information of the data.
According to the flow chart in Figure 4, the channel attention is calculated using Equations ( 9) and (10), where  is the two-dimensional convolution operation,  denotes the sigmoid activation function, [; ] stands for the concatenate operation of matrix  and matrix ,    indicates the attention weight matrix of the input features,   represents the output feature map after attention enhancement, and * denotes the matrix multiplication.

Spatial Attention Mechanism
Using the attention mechanism, Wang et al. [42] transform the spatial information in the original image into a different space while maintaining the key information.A spatial converter module is used to perform the necessary spatial transformation of the spatial domain information in order to extract the key information.The input feature map data are passed through the maximum pooling layer and the average pooling layer to generate feature maps.The two feature maps are then merged by concatenating the operation to construct a feature map with one channel.After this, the spatial attention of the corresponding feature map is obtained after the sigmoid activation function.Similar to the channel attention module, the spatial attention mechanism does not alter the dimensional information of the data.
According to the flow chart in Figure 4, the channel attention is calculated using Equations ( 9) and (10), where conv is the two-dimensional convolution operation, σ denotes the sigmoid activation function, [A; B] stands for the concatenate operation of matrix A and matrix B, SA F in indicates the attention weight matrix of the input features, F out represents the output feature map after attention enhancement, and * denotes the matrix multiplication.

Structure of the Proposed Models
Traditional time series forecasting methods have a limited ability to extract nonlinear features since the majority of the current electricity load data contain nonlinear characteristics.We examine the nonlinear properties of electricity load data using the 2D-CNN model, which performs in a variety of fields.Before inputting the model, we transform the time series data into image data using GAF with a statistical interpretation since the 2D-CNN model performs incredibly well using the data input as images and has

Structure of the Proposed Models
Traditional time series forecasting methods have a limited ability to extract nonlinear features since the majority of the current electricity load data contain nonlinear characteris-tics.We examine the nonlinear properties of electricity load data using the 2D-CNN model, which performs in a variety of fields.Before inputting the model, we transform the time series data into image data using GAF with a statistical interpretation since the 2D-CNN model performs incredibly well using the data input as images and has a very strong ability to extract features.The covariance matrix at various time points in GAF is represented by the Gram matrix.The matrix includes both the time series data and the relationships between the data at various time periods.The proposed model receives the time series data as the input, and to further simplify the training procedure, we add a double attention mechanism to the model to enhance the extracted features.
Based on the above discussion, a novel, deep convolutional attention mechanism model for electricity load forecasting is proposed.The deep features of the input data are extracted using a deep convolutional network, and the extracted features are then filtered using channel attention and spatial attention, providing more weight to the features that are valuable and less weight to the features that are worthless.Multi-step ahead forecasting can be expressed as a prediction of {X t+k } based on a given time series {X t }; here, t = 1, 2, . . ., T, k = 1, 2, . . ., K, and k is the forecast horizon.T is the total sample of the time series.Therefore, the proposed method forecasts the ({X t+1 , X t+2 , X t+3 ) electricity load horizons.Figure 5 illustrates the fundamental flowchart of the proposed model.In the preprocessing stage, the set of the time series is first deflated to the interval 1  2 , 1 , and then the deflated time series is transformed into images via GAF and the resulting image dataset is used as the input of the deep convolutional attention network.We then continuously adjust the weight parameters to obtain the final prediction results via backpropagation.More intricate mathematical expressions can be found in the following equations: )) Appl.Sci.2024, 14, x FOR PEER REVIEW 8 of 19 similarly, SA is the spatial attention operation.Finally, Equation ( 13) repeats the same structure four times, corresponding to the four identical tandem structures in Figure 5.

Experiment Study
In this section, we present the dataset used in this paper, the evaluation metrics, and the experimental results in detail; the remainder of this section contains evaluations of the obtained results, a performance improvement analysis, and a comparison of the deep learning methods.Here, X t denotes the time series observations; GAF(•) denotes the transformation of the input matrix X t into a Gram matrix with the size of t × t; σ denotes the activation function; Conv.unit(•) indicates the convolution operation on the matrix from Section 2.2.CA means the attention operation on the input features stated in Section 2.3, and similarly, SA is the spatial attention operation.Finally, Equation ( 13) repeats the same structure four times, corresponding to the four identical tandem structures in Figure 5.

Experiment Study
In this section, we present the dataset used in this paper, the evaluation metrics, and the experimental results in detail; the remainder of this section contains evaluations of the obtained results, a performance improvement analysis, and a comparison of the deep learning methods.

Experiment Data
The PyTorch architecture was applied to all models in this research, and real-time electricity load data gathered by Elia at 15 min intervals were used.The data have 8828 time points and span three months of electricity load from 1 March 2022 to 1 June 2022.The general information and statistical properties of the dataset are shown in Table 1.The model was trained using the first 6180 data points as the training set and its performance was tested using the last 2648 data points (at a ratio of nearly 7:3).The proposed model can estimate electricity load at multiple scales, including at 1 h, 2 h, and 3 h time intervals.Throughout the training process, MSELoss was chosen as the loss function and was optimized using the Adam optimizer.The training batch size was set to 32; the learning rate was set to 0.001.All models were trained on an AMD Ryzen 7 5800H CPU@3.2GHz, GeForce GTX 1060 6G from NVIDIA, Santa Clara, CA, USA.

Evaluation Metrics
A series of evaluation metrics were chosen, including root mean squared error (RMSE), mean absolute error (MAE), symmetric mean absolute percentage error (SMAPE), mean absolute percentage error (MAPE), and correlation coefficient (R), to test the suggested model and better assess its effectiveness.The related mathematical expressions are shown below: Appl.Sci.2024, 14, 6270 9 of 19

Ablation Study
In this paper, we enumerate three different combinations to compare the experimental results with prediction scales of 1 h, 2 h, and 3 h, respectively.We confirm that the use of different modules in ELFNet has a positive impact on the final experimental results through a comparison between the errors obtained by ELFNet structures with those of four different modules: The prediction results of the above structures with different modules are shown in Table 2. Figure 6 shows that, for the proposed ELFNet model, the five evaluation metric values for RMSE, MAE, SMAPE, MAPE, and R are 0.0316, 0.0254, 0.2246, 0.3323, and 0.9920, respectively, at the 1 h forecast.The values of the evaluation measures when utilizing only the CNN structure without the CA and SA modules were 0.0465, 0.0373, 0.2806, 0.3769, and 0.9899.The evaluation metric values for the CNN-CA model were 0.0349, 0.0274, 0.2259, 0.3179, and 0.9916, respectively.The evaluation metric values were 0.0409, 0.0321, 0.2652, 0.4021, and 0.9892, corresponding to the CNN-SA model.Based on these metrics, it can be seen that the proposed ELFNet delivers the best prediction results on the four-evaluation metrics of RMSE, MAE, SMAPE, and R, and can provide accurate and effective prediction results.They all show that the CA module has a significant impact on the optimization of intrinsic features, while only the CNN-CA structural model achieves the best results for the MAPE measurements.The results of the 2 h ahead electricity load forecast for different deep learning structures are shown in Figure 7.The values of the evaluation metrics RMSE, MAE, SMAPE, MAPE, and R of the proposed model ELFNet with the 2 h ahead forecast are shown in Table 2 and are, respectively, 0.0346, 0.0270, 0.2402, 0.3689, and 0.9911, and the corresponding error metrics increase when the CA and SA modules are not used.The RMSE drops from 0.0442 to 0.0402 when the CA module is applied to the CNN framework, increasing forecast accuracy by 9%; other evaluation indicators also show varying degrees of improvement, demonstrating the beneficial effects of the CA and SA modules.How the CA and SA modules affect the electricity load forecasting accuracy is also shown.  2 and are, respectively, 0.0346, 0.0270, 0.2402, 0.3689, and 0.9911, and the corresponding error metrics increase when the CA and SA modules are not used.The RMSE drops from 0.0442 to 0.0402 when the CA module is applied to the CNN framework, increasing forecast accuracy by 9% ; other evaluation indicators also show varying degrees of improvement, demonstrating the beneficial effects of the CA and SA modules.How the CA and SA modules affect the electricity load forecasting accuracy is also shown.The results for the 3 h ahead forecasting are shown in Figure 8.As can be seen in Table 2, the values of the RMSE, MAE, SMAPE, MAPE, and R of the proposed ELFNet model are 0.0417, 0.0333, 0.2813, 0.4118, and 0.9892, respectively, under the 3 h ahead forecasting value of 0.0440, and the MAE value is reduced from 0.0440 to 0.0377 after using the CA module, showing an improvement of 13.32%; similarly, after using the SA module, the MAE value is decreased from 0.0440 to 0.0379m with an improvement of 13.86%; after using both the CA and SA modules, the improvement reaches 24.32%.The results for the 3 h ahead forecasting are shown in Figure 8.As can be seen in Table 2, the values of the RMSE, MAE, SMAPE, MAPE, and R of the proposed ELFNet model are 0.0417, 0.0333, 0.2813, 0.4118, and 0.9892, respectively, under the 3 h ahead forecasting value of 0.0440, and the MAE value is reduced from 0.0440 to 0.0377 after using the CA module, showing an improvement of 13.32%; similarly, after using the SA module, the MAE value is decreased from 0.0440 to 0.0379m with an improvement of 13.86%; after using both the CA and SA modules, the improvement reaches 24.32%.
It can be shown from the aforementioned analysis and experimental data that both the CA and SA modules are crucial to increasing the model's forecast accuracy at various prediction horizons.

Comparative Experiment Results and Analysis
In this subsection, we compare our proposed ELFNet model with several prevalent deep learning algorithms for tackling image data in order to further validate the multi- It can be shown from the aforementioned analysis and experimental data that both the CA and SA modules are crucial to increasing the model's forecast accuracy at various prediction horizons.

Comparative Experiment Results and Analysis
In this subsection, we compare our proposed ELFNet model with several prevalent deep learning algorithms for tackling image data in order to further validate the multi-scale prediction performance of our model.The deep learning methods we selected to take part in this comparison are ResNet-18, ResNeXt-50, and GoogLeNet.Table 3 summarizes the comprehensive results for following the evaluation metrics: RMSE, MAE, MAPE, SMAPE, and R. The results for the 1 h ahead forecasting and the discrepancies between the predicted results and actual values are shown in Figure 9. Table 3 shows that, for the proposed forecasting model ELFNet, the values of the evaluation metrics are 0.0316, 0.0254, 0.2246, 0.3323, and 0.9920, respectively, under the 1 h ahead forecast.The evaluation metrics for ResNeXt-50 are 0.0514, 0.0404, 0.3070, 0.4520, and 0.9812, respectively, whereas those for ResNet-18 are 0.0569, 0.0409, 0.0307, 0.4520, and 0.9812.The evaluation metrics for GoogLeNet have values of 0.0506, 0.0395, 0.3013, 0.3009, and 0.8806.A visualization of the experimental data can be seen in Figure 10.It is apparent from the statistics of these evaluation metrics that ResNet-18 has the poorest prediction results and ELFNet has the best prediction performance for all evaluation metrics.The scatter plot composed of the real and actual value data pairings is presented in Figure 9c.The black dashed line in the figure, which runs from the lower left corner to the upper right corner, is where all points should fall in perfect situations, so the nearer the scatter points to this line, the more accurate the model predictions will be.It is obvious that ELFNet performs admirably in this regard compared to all other models.
The results for the 2 h ahead forecasting are shown in Figures 11 and 12. Table 3 shows that the five evaluation indicators for the prediction ELFNet model developed in this paper are 0.0346, 0.0270, 0.2402, 0.3689, and 0.9911, respectively, under the assumption that the forecast scale is 2 h.Compared to other deep learning techniques, ELFNet produced the greatest results using the RMSE assessment measure, with GoogLeNet coming in second, with a score of 0.0464.The least accurate prediction was made by ResNet-18, which showed that relying only on residual connections to mined features is insufficient to capture all of the available characteristics of the experimental data.In this comparative experiment, ELFNet differed significantly from the other three deep learning methods using the evaluation index of R. The R for the other three techniques varied greatly.The numbers were nearly identical, and the ELFNet R value was as high as 0.9911, suggesting that the model suggested in this paper has excellent nonlinear representation and robust data-fitting capabilities.
GoogLeNet have values of 0.0506, 0.0395, 0.3013, 0.3009, and 0.8806.A visualization of the experimental data can be seen in Figure 10.It is apparent from the statistics of these evaluation metrics that ResNet-18 has the poorest prediction results and ELFNet has the best prediction performance for all evaluation metrics.The scatter plot composed of the real and actual value data pairings is presented in Figure 9c.The black dashed line in the figure, which runs from the lower left corner to the upper right corner, is where all points should fall in perfect situations, so the nearer the scatter points to this line, the more accurate the model predictions will be.It is obvious that ELFNet performs admirably in this regard compared to all other models.The results for the 2 h ahead forecasting are shown in Figures 11 and 12. Table 3 shows that the five evaluation indicators for the prediction ELFNet model developed in this paper are 0.0346, 0.0270, 0.2402, 0.3689, and 0.9911, respectively, under the assumption that the forecast scale is 2 h.Compared to other deep learning techniques, ELFNet produced the greatest results using the RMSE assessment measure, with GoogLeNet coming in second, with a score of 0.0464.The least accurate prediction was made by ResNet-18, which showed that relying only on residual connections to mined features is insufficient to capture all of the available characteristics of the experimental data.In this comparative experiment, ELFNet differed significantly from the other three deep learning methods using the evaluation index of R. The R for the other three techniques varied greatly.The numbers were nearly identical, and the ELFNet R value was as high as 0.9911, suggesting that the model suggested in this paper has excellent nonlinear representation and robust data-fitting capabilities.The results for the 2 h ahead forecasting are shown in Figures 11 and 12. Table 3 shows that the five evaluation indicators for the prediction ELFNet model developed in this paper are 0.0346, 0.0270, 0.2402, 0.3689, and 0.9911, respectively, under the assumption that the forecast scale is 2 h.Compared to other deep learning techniques, ELFNet produced the greatest results using the RMSE assessment measure, with GoogLeNet coming in second, with a score of 0.0464.The least accurate prediction was made by ResNet-18, which showed that relying only on residual connections to mined features is insufficient to capture all of the available characteristics of the experimental data.In this comparative experiment, ELFNet differed significantly from the other three deep learning methods using the evaluation index of R. The R for the other three techniques varied greatly.The numbers were nearly identical, and the ELFNet R value was as high as 0.9911, suggesting that the model suggested in this paper has excellent nonlinear representation and robust data-fitting capabilities.Figure 12c shows that the degree of dispersion is the minimum and that the actual value and anticipated value of ELFNet are more in line with the middle black dotted line.It also shows how effective and precise ELFNet's prediction capabilities are.
The results for the 3 h ahead forecasting are shown in Figures 13 and 14.For the prediction scale of 3 h, the values of the evaluation metrics RMSE, MAE, SMAPE, MAPE, and R of the proposed ELFNet model are 0.0417, 0.0333, 0.2813, 0.4118, and 0.9892, respectively.ELFNet outperforms other deep learning techniques according to the SMAPE evaluation criteria, with ResNet-18 coming in last, with a result of 0.4996.Figure 13c illustrates that the prediction scale is 3 h.
ELFNet still maintains a high prediction accuracy, while the prediction errors of the other three methods are the highest for ResNeXt-50.This also reflects that, under the assumption of some evaluation indicators, the depth of the network cannot further improve the prediction accuracy.When the ResNet-18 error fluctuates the most, the increase becomes progressively bigger.This convincingly demonstrates that the ELFNet model put forward in this research has great prediction stability and continues to produce reliable forecasts at various prediction scales.
A histogram of prediction deviations (Figure 15) was created to show the estimated error margin and distribution properties for all electricity prediction horizons.The normal distribution, along with the mean and variance, is shown by the black dashed line.In general, the position of the normal distribution curve is determined by the mean value, while the shape of the normal distribution curve is determined by the variance.The closer the mean value is to 0, the smaller the variance is and the more accurate the prediction accuracy.In this experiment, for the 1 h prediction horizon, the mean is 21.4271 and the variance is 133.7763; for the 2 h prediction horizon, the mean is −14.9310 and the variance is 140.0519; and for the 3 h prediction horizon, the mean is 33.3051 and the variance is 171.8637.These data all indicate the accuracy of ELFNet's predictions.
It also shows how effective and precise ELFNet's prediction capabilities are.
The results for the 3 h ahead forecasting are shown in Figures 13 and 14.For the prediction scale of 3 h, the values of the evaluation metrics RMSE, MAE, SMAPE, MAPE, and R of the proposed ELFNet model are 0.0417, 0.0333, 0.2813, 0.4118, and 0.9892, respectively.ELFNet outperforms other deep learning techniques according to the SMAPE evaluation criteria, with ResNet-18 coming in last, with a result of 0.4996.Figure 13c illustrates that the prediction scale is 3 h.ELFNet still maintains a high prediction accuracy, while the prediction errors of the other three methods are the highest for ResNeXt-50.This also reflects that, under the assumption of some evaluation indicators, the depth of the network cannot further improve the prediction accuracy.When the ResNet-18 error fluctuates the most, the increase becomes progressively bigger.This convincingly demonstrates that the ELFNet   A histogram of prediction deviations (Figure 15) was created to show the estimated error margin and distribution properties for all electricity prediction horizons.The normal distribution, along with the mean and variance, is shown by the black dashed line.In general, the position of the normal distribution curve is determined by the mean value, while the shape of the normal distribution curve is determined by the variance.The closer the mean value is to 0, the smaller the variance is and the more accurate the prediction accuracy.In this experiment, for the 1 h prediction horizon, the mean is 21.4271 and the variance is 133.7763; for the 2 h prediction horizon, the mean is −14.9310 and the variance is 140.0519; and for the 3 h prediction horizon, the mean is 33.3051 and the variance is 171.8637.These data all indicate the accuracy of ELFNet's predictions.A histogram of prediction deviations (Figure 15) was created to show the estimated error margin and distribution properties for all electricity prediction horizons.The normal distribution, along with the mean and variance, is shown by the black dashed line.In general, the position of the normal distribution curve is determined by the mean value, while the shape of the normal distribution curve is determined by the variance.The closer the mean value is to 0, the smaller the variance is and the more accurate the prediction accuracy.In this experiment, for the 1 h prediction horizon, the mean is 21.4271 and the variance is 133.7763; for the 2 h prediction horizon, the mean is −14.9310 and the variance is 140.0519; and for the 3 h prediction horizon, the mean is 33.3051 and the variance is 171.8637.These data all indicate the accuracy of ELFNet's predictions.

Robustness Experiment
The accuracy and stability of predictions in the face of noise or other signal assaults in the original time series data are typically referred to as the robustness of the time series

Robustness Experiment
The accuracy and stability of predictions in the face of noise or other signal assaults in the original time series data are typically referred to as the robustness of the time series forecasting [43].The data are supplemented with random Gaussian noise with SNRs of 20, 30, 40, 50, 60, and 70; P s reflects the effective power of the time series; P n reflects the effective power of the noise; (20) is the calculation formula for SNR.
The comparison between the time series before adding Gaussian noise and the original time series is depicted in Figure 16.To evaluate the ability of the time series to accurately forecast the future after adding noise, the RMSE evaluation index is utilized.Figure 17 and Table 4 show that the RMSE of the electricity load data at the 1 h prediction horizon is extremely near; for the 2 h prediction horizon, the RMSE of the prediction result of the electricity data with an SNR of 20 is 0.05657 compared to the original data without noise.The RMSE of 0.0316 increased significantly, and when the prediction horizon is 3 h, the SNR errors of 20, 30, and 40 increased to varying degrees, with a maximum increase of 41.87%.The above experimental results demonstrate that ELFNet has strong robustness and a strong anti-noise ability at various prediction horizons.
Appl.Sci.2024, 14, x FOR PEER REVIEW 16 of 19 forecasting [43].The data are supplemented with random Gaussian noise with SNRs of 20, 30, 40, 50, 60, and 70;   reflects the effective power of the time series;   reflects the effective power of the noise; (20) is the calculation formula for SNR.
The comparison between the time series before adding Gaussian noise and the original time series is depicted in Figure 16.To evaluate the ability of the time series to accurately forecast the future after adding noise, the RMSE evaluation index is utilized.Figure 17 and Table 4 show that the RMSE of the electricity load data at the 1 h prediction horizon is extremely near; for the 2 h prediction horizon, the RMSE of the prediction result of the electricity data with an SNR of 20 is 0.05657 compared to the original data without noise.The RMSE of 0.0316 increased significantly, and when the prediction horizon is 3 h, the SNR errors of 20, 30, and 40 increased to varying degrees, with a maximum increase of 41.87%.The above experimental results demonstrate that ELFNet has strong robustness and a strong anti-noise ability at various prediction horizons.forecasting [43].The data are supplemented with random Gaussian noise wi 20, 30, 40, 50, 60, and 70;   reflects the effective power of the time series;   effective power of the noise; (20) is the calculation formula for SNR.
The comparison between the time series before adding Gaussian nois original time series is depicted in Figure 16.To evaluate the ability of the tim accurately forecast the future after adding noise, the RMSE evaluation index Figure 17 and Table 4 show that the RMSE of the electricity load data at the 1 h horizon is extremely near; for the 2 h prediction horizon, the RMSE of the predic of the electricity data with an SNR of 20 is 0.05657 compared to the original da noise.The RMSE of 0.0316 increased significantly, and when the prediction hor the SNR errors of 20, 30, and 40 increased to varying degrees, with a maximu of 41.87%.The above experimental results demonstrate that ELFNet has strong and a strong anti-noise ability at various prediction horizons.

Conclusions
Accurate electricity load predictions are crucial to determine the supply and demand relationship of electric energy.In order to guarantee the sensible and efficient use of electric energy, high-efficiency and high-accuracy electricity load forecasts may provide valuable information to the electric industry, suppliers, etc.In this study, we propose a novel model for forecasting electricity loads called ELFNet using a deep convolutional attention mechanism.The inherent features of the 2D image input are extracted by deep convolutional blocks from the electricity load time series input by GAF, which is then added after the convolutional structure.CA and SA modules are added after the convolutional structure and aim to further optimize the intrinsic features extracted by the convolutional block.
The results show that the prediction accuracy is improved to varying degrees by the addition of CA and SA modules at prediction horizons of 1 h, 2 h, and 3 h, respectively, and the greatest improvement is achieved by adding CA and SA modules simultaneously.We evaluated the prediction performance of the network when different attention modules were added to the deep convolution.Additionally, when compared to the traditional deep learning networks ResNet-18, ResNeXt-50, and GoogLeNet, ELFNet also achieves the highest prediction accuracy at multiple prediction horizons.At the 1 h ahead forecasting, the RMSE value of ELFNet is 0.0316, which is higher by 44.38%, 38.42%, and 37.49% than the 0.0569 value obtained by ResNet-18, 0.0514 value obtained by ResNeXt-50, and 0.0506 value obtained by GoogLeNet.At the 2 h ahead forecasting, the RMSE value of ELFNet is 0.0346, which is 50.49%,40.86%, and 25.27%, higher than the 0.07 value of ResNet-18, 0.0586 value of ResNeXt-50, and 0.0464 value of GoogLeNet.ELFNet's RMSE value for the 3 h ahead forecasting is 0.0417, which is higher than the numerical model of 0.0836 obtained by ResNet-18, 0.0661 obtained by ResNeXt-50, and 0.0541 obtained by GoogleNet, showing an improvement of 50.10%, 36.85%, and 22.85%, respectively.In addition to increasing prediction efficiency while maintaining accuracy, future work will be focused on optimizing the network structure of ELFNet, tweaking the network's hyperparameters, and modifying the network structure in accordance with the data's inherent features.

Figure 1 ,
Figure 1, and from there we can determine the size of the converted image.The image training set used in this study was set to be 64 × 64 pixels in size.

Figure 5 .
Figure 5. Flowchart of the proposed model.

Figure 5 .
Figure 5. Flowchart of the proposed model.

•
CNN: the convolutional structure proposed in Section 2.2, without CA and SA modules; • CNN-CA: the CNN structure with only the CA module added; • CNN-SA: the CNN structure with only the SA module added; • ELFNet: the proposed final model in this paper (shown by Figure 5 in Section 3).

Figure 6 .
Figure 6.ELFNet forecasting results for 1 h ahead.(a) is the prediction result, and (b) is the local enlargement.

Figure 7 .
Figure 7. ELFNet forecasting results for 2 h ahead.(a) is the prediction result, and (b) is the local enlargement.

Figure 6 .Figure 6 .Figure 7 .
Figure 6.ELFNet forecasting results for 1 h ahead.(a) is the prediction result, and (b) is the local enlargement.

Figure 7 .
Figure 7. ELFNet forecasting results for 2 h ahead.(a) is the prediction result, and (b) is the local enlargement.

Figure 8 .
Figure 8. ELFNet forecasting results for 3 h ahead.(a) is the prediction result, and (b) is the local enlargement.

Figure 8 .
Figure 8. ELFNet forecasting results for 3 h ahead.(a) is the prediction result, and (b) is the local enlargement.

Figure 9 .
Figure 9. Forecasting results of ELFNet and deep learning methods for 1 h ahead.(a) is the prediction result, (b) is the local enlargement, and (c) is the correlation diagram between the true value and the predicted value.

Figure 9 .
Figure 9. Forecasting results of ELFNet and deep learning methods for 1 h ahead.(a) is the prediction result, (b) is the local enlargement, and (c) is the correlation diagram between the true value and the predicted value.Appl.Sci.2024, 14, x FOR PEER REVIEW 13 of 19

Figure 12c shows thatFigure 13 .
Figure12cshows that the degree of dispersion is the minimum and that the actual value and anticipated value of ELFNet are more in line with the middle black dotted line.It also shows how effective and precise ELFNet's prediction capabilities are.The results for the 3 h ahead forecasting are shown in Figures13 and 14.For the prediction scale of 3 h, the values of the evaluation metrics RMSE, MAE, SMAPE, MAPE, and R of the proposed ELFNet model are 0.0417, 0.0333, 0.2813, 0.4118, and 0.9892, respectively.ELFNet outperforms other deep learning techniques according to the SMAPE evaluation criteria, with ResNet-18 coming in last, with a result of 0.4996.Figure13cillustrates that the prediction scale is 3 h.

Figure 12 .
Figure 12.Forecasting results of ELFNet and deep learning methods for 2 h ahead.(a) is the prediction result, (b) is the local enlargement, and (c) is the correlation diagram between the true value and the predicted value.

Figure 13 .
Figure 13.Forecasting results of ELFNet and deep learning methods for 3 h ahead.(a) is the prediction result, (b) is the local enlargement, and (c) is the correlation diagram between the true value and the predicted value.

Figure 13 .
Figure 13.Forecasting results of ELFNet and deep learning methods for 3 h ahead.(a) is the prediction result, (b) is the local enlargement, and (c) is the correlation diagram between the true value and the predicted value.
Appl.Sci.2024, 14, x FOR PEER REVIEW 15 of 19 model put forward in this research has great prediction stability and continues to produce reliable forecasts at various prediction scales.

Figure 15 .
Figure 15.Error distributions of ELFNet and other methods.

Figure 15 .
Figure 15.Error distributions of ELFNet and other methods.

Table 1 .
Statistical properties of the dataset.

Table 2 .
Comparison of the evaluation metrics of CNN with different combinations of modules.

Table 2 .
Comparison of the evaluation metrics of CNN with different combinations of modules.

Table 2 .
Comparison of the evaluation metrics of CNN with different combinations of modules.

Table 3 .
Performance comparison of the evaluation metrics for ELFNet and deep learning methods.

Table 4 .
Comparison of evaluation metrics of CNN with different combinations of modules.

Table 4 .
Comparison of evaluation metrics of CNN with different combinations of modules.