A Short-Term Load Forecasting Method Based on GRU-CNN Hybrid Neural Network Model

,


Introduction
Due to the difficulty of large-scale storage of electrical energy and electrical energy changes in power demand, it is required that the system power generation should be dynamically balanced with changes in load [1,2]. Load forecasting plays an important role in power construction planning and power grid operation, accurate load forecasting can minimize the gap between electricity supply and demand, improvings the stability of power systems [3]. A tiny little error increased in the load forecasting may cost millions of dollars lost every year [4], and thus it is essential to build an accurate load forecasting model. According to the predicted time range, the power load forecasts can be divided into long-term, medium-term, short-term, and ultra-short-term forecast [5]. is paper mainly focuses on short-term load forecasting, which predicts future loads from minutes to weeks; accurate STLF can help power system staff to develop reasonable production plans, maintain supply and demand balance, and ensure grid safety while reducing resource waste and electricity costs [6,7].
With the large-scale application of smart meters and smart sensors in the power system, the degree of informatization continues to increase, a large amount of data are generated, and it provides a reliable source of data for accurate load forecasting. Meanwhile, with the continuous improvement of computer computing performance and the application of distributed parallel computing, powerful computing performance for massive data is provided. Under this background, many load forecasting methods based on massive data were emerged, and these methods are mainly divided into two categories, one is traditional statistical methods [8]: they are most frequently used in the early literature, including linear regression (LR) analysis approach and autoregressive moving average (ARMA) approach [9]. Lee C. K. proposed a lifting scheme and autoregressive integrated moving average (ARIMA) models to achieve STLF in [10]. ese methods can achieve short-term load forecasting to a certain extent; however, there are many inherent nonlinear features in the massive data, and traditional statistical methods cannot well learn these nonlinear data [11]; so, it is very challenging for these traditional statistical methods to predict accurately in STLF, and these traditional statistical methods cannot meet the requirements of load forecasting accuracy in the STLF. e other is machine learning methods; they have been widely and successfully used in prediction and classification problems, including artificial neural network (ANN) [12], support vector machine (SVM) [13], and fuzzy inference system (FIS) [14]. For better abstraction of nonlinear features, machine learning methods are good approaches to solve nonlinear problems, and in [15], Niu DX creates a system for power load forecasting using support vector machine and ant colony optimization. In [16], the authors present shortterm load forecasting models developed by using the fuzzy logic and adaptive neuro-fuzzy inference system (ANFIS). In recent years, with the rise of artificial intelligence, ANN methods have been widely used in load forecasting. Backpropagation neural network (BPNN) is the first widely used ANN method for STLF [17]. A combined model, which used the back-propagation neural network (BPNN) with the multilabel algorithm based on K-nearest neighbor (K-NN) and K-means, was proposed for STLF in [18]; however, BPNN is a feedforward neural network, and it cannot well learn time sequence data in the power system [19]. In order to efficiently process the time sequence data, such as holiday, weather, and temperature information in the power system.Recurrent neural network (RNN) [20], a kind of neural network which is specific for processing sequence data, is widely used for STLF [21]. e authors in [22] use local RNN models to deal with the problem of long-term wind speed and power forecasting based on meteorological information. However, due to the excessive depth of time and the simple hidden layer in the traditional RNN structure, when the error back propagation is performed, there will be problems with gradient vanishing. It is impossible for RNN to learn long historical data. In response to the shortcomings of RNN, Hochreiter and Schmidhuber proposed a long shortterm memory (LSTM) recurrent neural network in 1997 [23], which overcame the disadvantages of traditional RNNs by combining short-term memory with long-term memory through the gate control. A novel method which integrates LSTM and genetic algorithm (GA) was proposed for STLF [24], and it yielded a small mean absolute percentage error. Gated recurrent unit (GRU) [25] is a special type of recurrent neural network based on optimized LSTM, and the GRU internal unit is similar to the internal unit of the LSTM [26], except that the GRU combines the input gate and the forgetting gate in the LSTM into a single update gate. In [27], a novel system called multi-GRU (gated recurrent unit) prediction system was developed based on GRU models for electricity generation's planning and operation. And Wang proposed a novel approach to forecast short-term photovoltaic power based on GRU networks [28]. However, there is not only sequence data in the power system, but also other kinds of high-dimensional data, such as spatiotemporal matrix and image information in the power system. e GRU model cannot well handle all these kinds of high-dimensional data, the convolution neural network (CNN) [29] is ideal for processing high-dimensional data, which has been widely used in image recognition and the fields of prediction [30]. When there is a strong relationship between the nearby data point, CNN can capture local trend features and scale-invariant features [31,32]. In [33], the author proposed an end-to-end automatic image annotation method based on a deep CNN and multilabel data augmentation, and the model performs well in automatic image annotation.
In order to make full use of the various data in the power system and achieve accurate STLF, the GRU-CNN hybrid neural network model was proposed, which combines the GRU model with the CNN model. In the proposed model, the GRU module is used to model dynamic changes in historical load sequence data for better learning potential features in time sequence data. e CNN module is utilized to process spatiotemporal matrixes and map spatiotemporal matrixes into the feature vector. e GRU-CNN model combines the output of GRU and CNN to derive the load prediction result through the activation function. To verify the superiority of the GRU-CNN model in short-term load forecasting, the proposed method was compared with BPNN, GRU, and CNN models in a real-world experiment. e four models were trained and tested, and mean absolute percentage error (MAPE) and root mean square error (RMSE) were used as the evaluation indexes. e results of the experiments demonstrate that the GRU-CNN model achieves the best predicting performance in STLF among the four models.
is paper is organized as follows: In Section 2, the proposed GRU-CNN hybrid neural networks and its modules are introduced. e GRU-CNN model was utilized to forecast the electrical load, and also it was compared with BPNN, GRU, and CNN in a real-world case in Section 3. Finally, the conclusion is drawn in Section 4.

e Establishment of GRU Module.
RNN is a kind of artificial neural network which is suitable for analyzing and processing time sequence data, unlike traditional neural networks, which are based on the weight connection between the layers. RNN applies the hidden layers to preserve information from the previous moment, and the output is influenced by the current states and previous memories. For better understanding of RNN, the unrolled structure of RNN is shown in Figure 1 where x 〈t〉 and y 〈t〉 represent the input and output at time t, a 〈t〉 represents the output of one single hidden layer at time t, and ω 〈t〉 aa , ω 〈t〉 ax , and ω 〈t〉 ay represent the hidden layers weight matrixes, the input weight matrixes, and the output weight matrixes, respectively. Figure 1 can be represented as following formulas: where b a and b y represent the bias vectors of one single hidden layer and the output, respectively. and g 1 and g 2 are the nonlinear activation function. RNN performs well when the output is close to its associated inputs; however, when the time interval is long and the number of weights becomes large, the input will have little effect on the output due to the gradient vanishing problem. In order to solve the gradient vanishing and simple hidden layer structure problems of RNN, a special type of RNN called GRU was proposed.
GRU is a variant of LSTM with a gated recurrent neural network structure, and comparing with LSTM, there are two gates (update gate and reset gate) in GRU and three gates (forgetting gate, input gate, and output gate) in LSTM; meanwhile, GRU has fewer training parameters than LSTM, so GRU converges quicker than LSTM during training [34].
e GRU structure is shown in Figure 2, where σ and tanh are the activation functions, c 〈t− 1〉 is the input of the current unit, which is also the output of the previous unit, c 〈t〉 is the output of the current unit, which links to the input of the next unit.
x 〈t〉 are the inputs oftraining data, y 〈t〉 is the outcome of this unit, generated by the activation function, Γ r and Γ u represent the reset gate and the update gate, respectively, and the candidate activation c 〈t〉 is computed similarly to that of the traditional recurrent unit. ere are two gates in GRU, one is the update gate, which preserve previous information to the current state; e value of Γ u ranges from 0 to 1, the closer Γ u is to zero, the more previous information it retains; the other is the reset gate, which is used to determine whether the current status and previous information are to be combined. e value of Γ r ranges from -1 to 1, the smaller the value of Γ r , the more previous information it ignores. According to Figure 2, the formulas of GRU can be shown as where ω u , ω r , and ω c represent the training weight matrix of the update gate, the reset gate, and the candidate activation c 〈t〉 , respectively and b u , b r , and b c are the bias vectors.

e Establishment of CNN Module.
CNN is a kind of artificial neural network which can well process high-dimensional data. It is commonly applied in visual image, video recognition, and text categorization. ere are many smart sensors and devices in the power system. In order to preserve the spatial information of data recorded by smart sensors and devices in the power system, the spatiotemporal matrix was proposed and the spatiotemporal matrixes data are based on the location of the sensors and time sequence.
e spatiotemporal matrix is shown as where k represents the k th smart sensor, n represents the n th time sequence, and X k (n) represents the data recorded by the k th smart sensor at n time. In order to extract the load feature from the spatiotemporal matrix, CNN was used to process the spatiotemporal matrix. e structure of CNN is shown in Figure 3. As shown in Figure 3, firstly, many two-dimensional spatiotemporal matrixes are stacked into three-dimensional matrix blocks, and then these blocks were applied with a convolution operation. e purpose of the convolution operation is to get a highly abstract feature, and after the convolution operation, the outputs of convolution operation are applied with pooling operation. Pooling operation does not change the depth of the input matrix, but it can reduce the size of matrixes and the number of nodes, so as to reduce the parameters in the entire neural networks. After repeated convolution and pooling operations, the highly abstract feature was obtained and flattened to an one-dimensional vector, so it can be connected with the fully connected layer. en, the weights and bias parameters of the fully connected layer can be calculated iteratively. Finally, prediction results are obtained through the output of activation function.  Mathematical Problems in Engineering

e GRU-CNN Hybrid Neural Networks.
To combine the advantages of the GRU module which can well process time sequence data and the advantages of the CNN module which is ideal for handling high-dimensional data, the GRU-CNN hybrid neural networks was proposed, and the structure of GRU-CNN hybrid neural networks is shown in Figure 4. e framework of the proposed GRU-CNN hybrid neural networks consists of a GRU module and a CNN module. e inputs are the information of time sequence data and spatiotemporal matrixes collected from the power system; the outputs are the prediction of the future load value. In aspect of the CNN module, it is good at processing two-dimensional data, such as spatiotemporal matrixes and images. e CNN module uses local connection and shared weights to directly extract local features from the spatiotemporal matrixes data and obtain effective representation through the convolution layer and pooling layer. e structure of the CNN module contains two convolution layers and a flatten operation, and each convolution layer contains a convolution operation and a pooling operation. After the second pooling operation, the high-dimensional data are flattened into one-dimensional data, and the outputs of the CNN module are connected with the fully connected layer. On the other hand, the aim of the GRU module is to capture the long-term dependency and the GRU module can learn useful information in the historical data for a long period through the memory cell, and the useless information will be forgotten by the forget gate. e inputs of GRU module are time sequence data; the GRU module contains many gated recurrent units, and the outputs of all these gated recurrent units are connected with the fully connected layer. Finally, the load predicting results can be obtained by calculating the mean value of all neurons in the fully connected layers. e flow chart of the GRU-CNN method is shown in Figure 5

Datasets Description.
In this experiment, the electric load dataset is provided by a power distribution network in Wuwei, Gansu province. e dataset was collected in August 2018. It contains 44640 samples which were recorded every minute for a total of 31 days. ere are 31680 samples that are selected from these datasets as the training set. Also, Fully connected layer The outputs of GRU

The inputs of GRU
Connect with the next GRU Figure 2: Gated recurrent unit structure. into 8 tests, and each test contains 1440 samples. And the last 1440 samples are selected as the dev set. Each sample contains time sequence data such as temperature, holidays, and weather in Wuwei, and spatiotemporal matrixes data, which were collected by distributed smart electric meters in the Wuwei distribution network. Time sequence data of the samples were selected as the input of the GRU module, and spatiotemporal matrixes data of the samples were selected as the input of the CNN module.

Model Evaluation Indexes.
To evaluate the performance of different predicting models, the absolute percentage error (APE), the mean absolute percentage error (MAPE), and the root mean square error (RMSE) are introduced, and the formulas of APE, MAPE, and RMSE are shown as follows: where n is the size of the training and test samples, y(i) and y(i) are the actual value and the predicted value, respectively. e APE represents the ratio between the error and actual values at one predicting point, the MAPE represents the average of APE in all the test sets, and the RMSE is the sample standard deviation of differences between the predicted value and the actual value; the smaller the values of MAPE and RMSE, the better predicting performance the model achieves. x <2>

Experimental Results and
x <n>   Mathematical Problems in Engineering function was selected as loss function, which characterizes the distance between two probability distributions. e smaller the cross entropy, the closer the two probability distributions. e formula of cross entropy is shown as follows: where n is the size of the training samples and y i and y i ′ are the actual value and the predicted value, respectively. e closer the loss value is to zero, the better the prediction model fits the training set. e loss value of the four load prediction models is shown in Figure 6.
As seen from Figure 6, the BPNN, GRU, and GRU-CNN model converge fast, and the loss values of them are lower than 0.1 after 200 iterations, but the loss value of the CNN model is about 0.2 at the 200th iteration, and there are slight fluctuations in the CNN model. All these methods achieve a very small loss value after 500 iterations, and this means that all the methods can well learn the training set and achieve good performance at the test set. It can be found that training the training set more than 500 iterations, the loss value decreased extremely slowly. And then if they continued to increase the times of iteration, all these models perform well on the test set, but they cannot predict well on the dev set. Obviously, the models were overfitted, and so the times of iteration were settled at 500 times in this experiment.
ere are eight groups of data which were trained and tested in the prediction of 24 hours load forecast for the BPNN, GRU, CNN, and GRU-CNN models, and the MAPE and RMSE of the load predicting results of the four models are shown in Table 1 and Table 2, respectively.
Where, test avg represents the average value of the eight testing results. As seen in Tables 1 and 2, the results of the BPNN model performs a little better than the GRU model, while the CNN model performs much better than the GRU model and the BPNN mode. Because most of the datasets in this paper are spatiotemporal matrixes data, there is a certain loss of information for flattening spatiotemporal matrixes data into one-dimensional time sequence data, and so the BPNN and GRU models cannot fully learn the features from the dataset. However, CNN is really good at handling high-dimensional data, and the spatiotemporal matrixes can be processed fully and rapidly. e proposed GRU-CNN model which is the fusion of the GRU model and the CNN model provides the best forecasting results. e test avg indexes of the GRU-CNN model are the minimum among all the four models. e GRU-CNN model can well learn both the time sequence data and the spatiotemporal matrixes and extract more features from the dataset. e average value of MAPE and RMSE of the GRU-CNN model is 2.8839% and 1203.23, respectively. e proposed model has an improved performance of 1.5842% which is better than the BPNN model, 1.7538% better than the GRU model, and 0.5051% better than the CNN model. For better visualization, the MAPE and RMSE lines of the four models are drawn in Figure 7.
As shown in Figure 7, the MAPE and RMSE lines of BPNN and GRU are very close, but the BPNN model performs better than the GRU model in test 3, test 4, test 5, test 6, and test 8. e lines of the CNN model fluctuate between BPNN lines and GRU-CNN lines, and it proves that the CNN model performs always better than the BPNN model and the GRU model but is inferior to the In addition, the dev set, as the inputs of all the four models, was sent to the four trained models to predict the actual load. Finally, the predicting results of the four models were obtained and compared with actual load, as shown in Figure 8.   Test2  Test3  Test4  Test5  Test6  Test7  Test8  Test1 The tests of datasets  Test2  Test3  Test4  Test5  Test6  Test7  Test8  Test1 The tests of datasets    Mathematical Problems in Engineering As shown in Figure 8, the prediction curves of all the four models fit well; whether in the peaks or troughs, the blue and orange lines almost overlap. Due to the large ranges of load changing, it is difficult to see the details of the predicting results, so the absolute percentage errors of the four models are drawn, respectively, in Figure 9.
As can be seen from Figure 9(b), the APE of GRU has the largest fluctuation, and it fluctuates from 0% to 10%, and the max APE of the GRU predicting model is 9.8578%, and the MAPE is 4.8781%. It can be seen from Figure 9(d) that the APE of the GRU-CNN model fluctuates the least from 0% to 7% and the max APE of the GRU-CNN predicting model is 6.418%, and the MAPE is 2.3034%. e max APEs of the BPNN model and the GRU model are 8.1306% and 7.605%, respectively, and the MAPEs are 4.0942% and 3.4581%, respectively. From the experiment of the dev set, the proposed GRU-CNN model achieves the most accurate and stable predicting result. And all the load forecasting results of the four models and the actual load are drawn in Figure 10.
In Figure 10, all the four models can predict well on the dev set, and the proposed GRU-CNN predicting method fits best. When predicting peaks, BPNN, GRU, and CNN perform not very well. BPNN, GRU, and CNN model cannot accurately analyze the load fluctuation law, resulting in a decrease in prediction accuracy. In contrast, the GRU-CNN model can effectively learn the load changing trend and accurately analyze the influence of the input characteristics on the load during the time of reaching the peaks to ensure the prediction accuracy.

Conclusions
Aiming at improving the accuracy of STLF, the GRU-CNN hybrid neural network was proposed in this paper, which is based on the GRU model and the CNN model. e GRU module of the GRU-CNN model dedicates to process time sequence data, and the CNN module is good at processing spatiotemporal matrixes. e proposed model can predict electrical load quickly and accurately by extracting features from variable data that affect the power system. In the realworld experiments of the Wuwei area electrical load forecasting, the proposed GRU-CNN model was compared with BPNN, GRU, and CNN models. e results show the proposed GRU-CNN model can well process both time sequence data and spatiotemporal matrixes data and can effectively extract the hidden feature of the datasets. e GRU-CNN model has the lowest value of MAPE and RMSE, and it demonstrates that the proposed GRU-CNN model achieves the best performance among all the four models.
In further study, the proposed model can be improved by collecting and adding more relevant factors that may affect the load change. At the same time, in order to train the neural network more effectively, a new loss function training method is designed to reduce the training time of the model. It can even be considered for random load forecasting, such as electric vehicles.

Data Availability
e xls data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
e authors declare that they have no conflicts of interest. Mathematical Problems in Engineering 9