Inductive Transfer and Deep Neural Network Learning-Based Cross-Model Method for Short-Term Load Forecasting in Smarts Grids

In a real-world scenario of load forecasting, it is crucial to determine the energy consumption in electrical networks. The energy consumption data exhibit high variability between historical data and newly arriving data streams. To keep the forecasting models updated with the current trends, it is important to fine-tune the models in a timely manner. This article proposes a reliable inductive transfer learning (ITL) method, to use the knowledge from existing deep learning (DL) load forecasting models, to innovatively develop highly accurate ITL models at a large number of other distribution nodes reducing model training time. The outlier-insensitive clustering-based technique is adopted to group similar distribution nodes into clusters. ITL is considered in the setting of homogeneous inductive transfer. To solve overfitting that exists with ITL, a novel weight regularized optimization approach is implemented. The proposed novel cross-model methodology is evaluated on a real-world case study of 1000 distribution nodes of an electrical grid for one-day ahead hourly forecasting. Experimental results demonstrate that overfitting and negative learning in ITL can be avoided by the dissociated weight regularization (DWR) optimizer and that the proposed methodology delivers a reduction in training time by almost 85.6% and has no noticeable accuracy losses.


Inductive Transfer and Deep Neural Network
Learning-Based Cross-Model Method for Short-Term Load Forecasting in Smarts Grids Méthode de modèle croisé basée sur le transfert inductif et l'apprentissage par réseau neuronal profond pour la prévision de la charge à court terme dans les réseaux intelligents Dabeeruddin  Abstract-In a real-world scenario of load forecasting, it is crucial to determine the energy consumption in electrical networks. The energy consumption data exhibit high variability between historical data and newly arriving data streams. To keep the forecasting models updated with the current trends, it is important to fine-tune the models in a timely manner. This article proposes a reliable inductive transfer learning (ITL) method, to use the knowledge from existing deep learning (DL) load forecasting models, to innovatively develop highly accurate ITL models at a large number of other distribution nodes reducing model training time. The outlier-insensitive clustering-based technique is adopted to group similar distribution nodes into clusters. ITL is considered in the setting of homogeneous inductive transfer. To solve overfitting that exists with ITL, a novel weight regularized optimization approach is implemented. The proposed novel cross-model methodology is evaluated on a real-world case study of 1000 distribution nodes of an electrical grid for one-day ahead hourly forecasting. Experimental results demonstrate that overfitting and negative learning in ITL can be avoided by the dissociated weight regularization (DWR) optimizer and that the proposed methodology delivers a reduction in training time by almost 85.6% and has no noticeable accuracy losses.
Index Terms-Clustering models, inductive transfer learning (ITL), load forecasting, predictive models, smart grids. R ECENTLY, electrical energy forecasting has received significant attention with developments in the areas of computational sciences and machine learning (ML). Accurate energy forecasting is crucial to the long-and short-term capacity planning of an electrical utility. It also provides benefits such as avoiding overgeneration and undergeneration of energy and assisting in efficient and sustainable energy generation. It helps utilities in operational decisions such as load switching, infrastructure development, enhancing reliability, providing predictability, and scheduling maintenance of power systems such that there is minimal effect on the services delivered to the customers.
Data-driven methodologies have been used in different works to forecast energy with different time horizons leading to three branches: long-, medium-, and short-term forecasting [1]. The training of the ML models and achieving high accuracy of predictions requires a huge amount of historical energy consumption data. ML algorithms are mainly categorized into three types: supervised, unsupervised, and reinforcement learning models [2].
In smart grids, the data are generated at a very high frame rate [13]. At the distribution level of a nationwide grid, there are more than hundreds of thousands of distribution transformers. To provide hourly STLF, it is important to train these hundreds of thousands of ML models within the forecasting horizon of STLF. The proposed methodology aims to tackle this challenge with the clustering and inductive transfer learning (ITL) framework. In addition, at newly installed distribution nodes, an adequate amount of historical data may not be available. In cases of unavailability of large amounts of historical energy consumption data, it is required that the prediction models are trained with limited amounts of data to achieve sufficiently high accuracy. Furthermore, it is important to note that the supervised ML algorithms commonly presume that the training points and testing points belong to the same statistical data distribution and that large amounts of historical data are available [14]. However, the statistical data distribution and patterns of energy consumption have high variability between historical and future data points. Hence, it is crucial to transfer the knowledge obtained from models that are trained on historical data to develop and train ML models on current energy consumption data points. In this work, a methodology with the aim of knowledge transfer is presented. The methodology uses inductive transfer ML to transfer the knowledge from existing trained models to newer models or newer applications. Transfer learning (TL), in cases of low data availability, increases data variance and completes the voids due to missing records leading to more accurate predictions.
With the use of TL, a model trained on data following a statistical distribution can be improved to test with high accuracy on data following different distributions, unlike conventional ML models which perform effectively only when training and testing data follow the same statistical data distribution. The TL leverages the knowledge from past experience to use it with a different and new domain or with a new statistical distribution. The capabilities of TL have previously been utilized in diverse fields and have also been introduced in works on time series forecasting [15], [16], [17]. Nevertheless, these works did not consider the possibility of overfitting in TL models. When TL is applied, it is generally observed that the optimizer converges to a local minimum rather than a global minimum or a local minimum that provides near-true solutions. It is known that NN optimization is nonconvex. Although it is not always possible to converge to a global minimum, convergence to a near optimum solution is a must. The application of TL in models is more prone to overfitting and poor generalization. In this work, we have implemented dissociated weight regularization (DWR) in the weight update rule to break out of local minima, which in turn eliminates negative learning between different models. In addition, TL is integrated with an unsupervised clustering technique with a key reason to reduce the model training time by a large factor.
To the best of our knowledge, there has been no previous work that proposed the hybrid multistage approach involving outlier-insensitive clustering and overfitting-eliminating ITL. This work proposes a weight-regularized technique to eliminate negative learning and avoid overfitting while applying TL. The contributions of this work are summarized as follows.
1) STLF in very large electrical systems, such as nationwide grids, requires hundreds of thousands of models to be trained in a very short time. To overcome this challenge, a novel hybrid deep-learning (DL) and clustering-based ITL methodology is proposed to forecast short-term energy consumption at distribution nodes with faster convergence and in short times. This methodology aims to identify the distribution nodes that have similar trends of energy consumption, cluster these nodes together, and execute TL across different clusters. 2) TL models are prone to overfitting. The possibility of curbing the negative transfer of knowledge has been investigated. It was observed that the clustering-based approach and the proposed ITL between similar distribution units within a cluster eliminate the negative transfer of knowledge. 3) Furthermore, to avoid overfitting of TL models and to eliminate negative TL between dissimilar distribution units or across clusters, a novel DWR technique is proposed. DWR during optimizing the cost function while training DL models eliminates overfitting. 4) Different from the conventional method of developing models one each for a large number of distribution nodes, the proposed multistage methodology provides enhanced scalability with reduced training time and no loss in accuracy. The proposed approach decreases the count of models required for forecasting in a large grid network. The methodology can be scaled to any larger sized grid. 5) The ITL-based forecasting approach aims to alleviate the data absence problem that exists only at newly installed electric distribution nodes.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply. 6) The performance evaluation demonstrates that there is an 85.6% reduction in train time accomplished with the presented approach. The remainder of this article is structured as follows. Section II discusses the related work in DL, clusteringbased load forecasting, and TL. Section III presents the proposed methodology for utilizing transfer ML on DL models. Section IV presents a case study that has been conducted to validate the proposed methodology and discusses the results. Finally, Section V presents the conclusion of the research and future work.

II. RELATED WORK
A wide range of approaches used to anticipate short-term load is presented in the literature. They may be roughly categorized into three groups: statistics, ML, and DL techniques. Table I includes a selection of the most noteworthy publications from each category, outlining the proposed forecasting methodologies, as well as the evaluation metrics to validate the proposed methodology and the forecasting accuracies. Table I also shows whether TL was used in the actual research or not in addition to the limitations of the work.
In addition, the subsequent parts of this section present the related work in two divisions. The first division discusses the STLF methods based on DL and clustering approaches. The second division presents the literature on TL for load forecasting applications.

A. Load Forecasting
The need for highly accurate forecasting models and the advent of smart meters led to sensor-based forecasting models. The load forecasting models have been trained at various levels of a grid, such as household level, substation level, feeder level, and distribution nodes level. Different features, such as lag hour values of power demand, season, and weather variables, including temperature, cloud cover, humidity, and precipitation intensity, have been used to forecast energy consumption for short-, medium-, or long-term levels [18]. Since the models are data-driven, they require large volumes of data to generate highly accurate models. With the smart meters being installed and having been installed recently, huge volumes of data are not available at newly installed nodes of the smart grid. Hence, it is required to explore how the knowledge from trained models at nodes with huge historical data can be transferred to models at nodes with fewer data available.
Deep neural networks (DNNs) have been the forerunner in the generation of highly accurate forecasting models. Shi et al. [19] presented that the uncertainty in energy consumption could be modeled by the use of DL. It is also crucial that overfitting is avoided that generally prevails with a high number of layers in the DNNs. The authors proposed a novel pooling-based deep recurrent NN (RNN) to address the overfitting by increasing data variety and size. The case study was performed at the household level after developing a bespoke DL application with the TensorFlow framework and they reported that their proposed model performs up to 6.5% better in terms of root-mean-square error (RMSE) compared to classical deep RNNs.
Kong et al. [12] addressed the issue of uncertainties of load at the household level by the use of DNNs called long short-term memory (LSTM) with inherent long-term memory capabilities. Their work also included the electrical appliances' energy consumption in the training data and found that the accuracy improved curbing the uncertainty in load predictions. In addition, DL models have been used as part of ensemble models in various works to forecast energy consumption with higher accuracy. Cao et al. [20] used a deep belief network with bagging and boosting variants in an ensemble model.
Moreover, clustering techniques have been utilized in previous works to group similar customers, days, or weather conditions. The clustering techniques provide merits of reducing the variance of uncertainties within each cluster and, also, these decrease the count of models to be built for the same number of units when compared to nonclustering techniques. Goehry et al. [21] presented a methodology based on random forests and sequential expert aggregation showing that their proposed methodology performs better than the classical hierarchical clustering strategy. Wang et al. [22] employed a k-means clustering algorithm with better results when compared to nonclustering strategies. Other clustering algorithms employed for load forecasting include k-Medoids clustering [23] for similar day clustering, expectation-maximization clustering [24], Gaussian mixture clustering [25], density-based spatial clustering of applications with noise (DBSCAN) [26], and hierarchical clustering [25].

B. Transfer Learning
In the past decade, TL has gained widespread research interest from researchers in different fields of study due to its inherent capability of transferring the knowledge gained while training from one application to another. Ribeiro et al. [28] used TL with seasonal and trend adjustment to enhance the forecasts of energy used in a building with the aid of models trained on data from similar buildings. An improvement of 11.2% in mean absolute percentage error (MAPE) of predictions was reported after the use of TL. Their work assumes the similarity of buildings in terms of energy consumption to apply TL and did not employ clustering-based techniques to group different buildings. Their case study also limits the application of TL to similar buildings. In this article, the clustering-based techniques are employed, and in addition, TL is applied to similar distribution nodes with an improvement of training time and testing accuracy and between dissimilar clusters using weight regularization with an improvement of time and accuracy.
In [15], energy predictive models based on convolutional NNs (CNNs) and TL are proposed. In this article, energy predictive models were tested on a case study of 23 customers against the seasonal ARIMA (SARIMA) model and fresh CNN model. The results proved that the performance in terms of accuracy is improved when the models are pretrained using TL.
Ye and Dai [16] proposed an ensemble model of online TL kernel-based extreme learning machines. The results presented in their work depict that the use of TL improves the performance in terms of accuracy compared to standard ML models. Their work utilizes extreme learning machines that are basically NNs with one hidden layer. The developed approach using extreme learning machines provided many benefits such as eliminating the need for optimizing the number of hidden layers and optimizing a smaller number of parameters [29]. Evident from diverse and numerous research works, the DL models display high accuracy while dealing with the timedependent energy forecasting problems if the tendency to overfitting is controlled [30]. Hence, in our work, the use of TL is extended to DL models.
Qureshi et al. [17] proposed a two-stage prediction model for wind power based on an ensemble of nine deep autoencoders in the first phase and deep belief networks in the second phase. The work was based on five datasets from wind farms. The TL was utilized in the training of deep autoencoders from two to nine using the knowledge obtained during the training of the first deep autoencoder. Their results indicate that the use of TL overperforms the baseline regression models based on ARIMA and support vector regression (SVR). However, the performance of their ensemble model without the use of TL for autoencoders 2-9 has not been discussed. It is unclear if the improvement in performance is due to the ensemble of the optimized deep autoencoders or due to TL. In our work, the comparison is performed between the same DL model with and without TL to comment and discuss the accuracy and train time improvement due to the technique of TL specifically. Besides, the performance of our proposed methodology is evaluated against several benchmark forecasting models.

III. PROPOSED METHODOLOGY
In this section, a detailed introduction of the proposed methodology for STLF using ITL on clustering-based DL models is presented.
The aim of the methodology is multifold. The main objective of this work is to increase the accuracy of predictions of hourly energy consumption in a reasonable time frame. The methodology is applicable to make many predictions such as Photovoltaic (PV) power forecasting, wind energy forecasting, and energy consumption. Importantly, the aim of the methodology is to apply TL so that the knowledge, network structure, and network parameters are transferred from already existing trained models to newer models or tasks. For tasks with insufficient data to efficiently train a model, TL provides enhanced accuracy in forecasting. For other tasks, TL provides faster convergence of models reducing model training time.
A. Data Acquisition and Processing 1) Data Acquisition: In this work, real-world datasets from the nationwide Spanish Electrical Grid are utilized. The acquired data are power consumption records at the 1000 distribution nodes in the grid. The data consist of 24 072 709 time series hourly energy consumption samples between the period of 1 January 2017 and 28 September 2019. The lag hour values of energy, i.e., past energy consumption values, and season are added as features in the dataset for all of the models that are developed. In addition, the time series features, such as year, month, day, and hour, are appended as data attributes. The feature domain across all the tasks remains the same. The optimal number of lag hour values, to predict the hourly load consumption one day ahead in the future, is realized to be 24 from our previous work [6] on the same dataset. The dataset of the target model or target task is split into 80% for training, 10% for validation, and 10% for testing. Once the sliding window lag hour features and time series features are added to the dataset, it eliminates the autocorrelation between the consecutive recordings and generates a possibility of cross validation on this time series energy consumption data. Tenfold cross validation has been utilized in this work for performance evaluation.
For benchmarking case study, an open-source electricity load diagrams dataset [31] is utilized to provide a comparative evaluation of our proposed methodology against benchmark and state-of-the-art models. The data are available for 370 users and contain 140 256 power usage instances for each user.
2) Data Processing: For the given attributes A 1 and A 2 , the normalization function φ is a linear transformation such that φ(A 1 ) and φ(A 2 ) values are in the same domain and possess a similar scale. The normalization function changes the values recorded at distinct scales to an identical scale and within a uniform domain such that these values can be compared and processed in conjunction. Data normalization enhances the accuracy of ML models, and hence, it is considered a crucial preprocessing technique. In this work, minimum-maximum (min-max) feature scaling is incorporated to bring attribute values within the range [0, 1]. The min-max scaling in the range [0, 1] is represented by the following equation: where a ′ is the transformed value, a is the primitive value, A min is the minimum value, and A max is the maximum value of the attribute. If unnormalized input features are fed to ML models, the loss function is likely to have elongated valleys [32]. Optimizing a cost function raises an issue as the gradient steeps with respect to a few parameters. This in turn causes large oscillations in the search space of weights due to steep slope bounces. One way to compensate for this is optimization with a small learning rate. This in turn raises another problem of slower convergence, larger training time, and disproportionate weight assessment. Normalizing inputs makes loss function more symmetrical and, in turn, makes optimization easier to achieve. The gradients tend to point toward a global minimum even with a larger learning rate, thereby increasing accuracy, achieving faster convergence, and reducing training time.

B. Model Construction Stage
The proposed solution utilizes the outlier-insensitive clustering and ITL-based DNN model with weight regularization to improve the accuracy of predictions. The details of the proposed clustering-based and DWR ITL methodology for load forecasting are presented in Algorithm 1.
1) Clustering Phase: As depicted in Algorithm 1, the first phase of the proposed methodology is the clustering phase. Initially, the energy matrix E is constructed using the energy consumption data available from the smart meter at each distribution transformer. The constructed energy matrix E can be notated as follows: where e i, j is a matrix of size tf × h, h denotes the number of hours for which the data are available, and τ denotes the number of distribution transformers. For the 1000 distribution transformers dataset, the size of the matrix E is 1000 × 24 024. In the next step of Algorithm 1, the dissimilarity matrix is constructed based on the criterion function of Minkowski dissimilarity. The constructed square and symmetric dissimilarity matrix are notated as follows: The pairwise dissimilarity between any two distribution transformers i and k is calculated as follows [6]. As an attempt to reduce the NP-hardness of the optimization, the Minkowski order (q) is fixed to first order Once constructed, the dissimilarity matrix is passed as an argument to the clustering algorithm along with the optimized hyperparameter k opt that represents the optimal number of clusters. The hyperparameter k opt in the k-Medoids algorithm cannot be learned directly, and hence, the elbow curve method is employed to discover the optimal value of k, which yields the least within-cluster error [6]. The elbow curve is the illustration analysis between the number of clusters (k) and the within-cluster sum of squares error and the elbow or the dip in the curve reveals the optimal number of clusters k opt . An outlier-insensitive k-Medoid clustering algorithm is adapted to group similar distribution nodes into clusters. At the low level, the clustered models for the grouped profiles and the individual models for each transformer are devised using a DNN framework.
2) Transfer Learning: TL is a technique of ML in which the knowledge gained during the training of a model on a domain of features is leveraged to improve the performance of training another model or task on the same or different domain of features [33]. TL eliminates the assumption that the training data and testing data observe the same data distribution. The merits of TL are the following: training is done with less or little data, training gets faster, and model accuracy increases.  Consider that feature domain F s , label V s , and task T s correspond to the source application, and feature domain F t , label V t , and task T t correspond to the target application. The TL aims to improve the performance of task T t using the knowledge obtained in task T s , where T s ̸ = T t . Fig. 1 shows the process of traditional ML where the knowledge gained after training one model is not retained or reused in further models. The training of a newer model or task is executed from scratch. Fig. 2 shows the process of transfer ML where the knowledge gained after training one model (trained model 1 in Fig. 2) is transferred to further models (model 2). The weights, knowledge of features, and the network structure are transferred to the training stage for the new task.
The TL process has the benefits of improving the baseline performance of predictions and improving the time to train an ML model [14]. The following are the multiple types of TL algorithms.
1) Transductive TL (Data Features Are Not the Same Between the Different Tasks) [34]: If the tasks T s and T t that are different infer that the source domain F s and the target domain F t are also different, then it is called transductive TL.

2) ITL (Data Features Are the Same Between Different
Tasks) [35]: If the tasks T s and T t that are different infer that the source domain F s and target domain F t are the same, then it is called ITL. If the source label V s exists, then this learning is called multitask learning. The learning is unsupervised in the absence of labels in the tasks, and in such cases, the algorithm is called self-taught TL. 3) Unsupervised TL [36]: In this type of learning, the source tasks T s and T t are different, the domains F s and F t are similar, and the labels are not available in both tasks. a) Theoretical perspective of TL in cross-model load forecasting using neural networks: Consider a trained NN structure with three layers, as shown in Fig. 3. The input layer with I +1 inputs with (I +1)th node as bias node, H +1 hidden units with (H + 1)th node as bias node, and P outputs. Consider that the NN model is already trained on training data with N records, i.e., {(x 1 , y 1 ), (x 2 , y 2 ), . . . , (x N , y N )}. Since the training is complete, it is safe to assume that the optimal weights have been determined with objective function on minimum training error. Consider that the weights between the input-hidden connections and hidden-output connections are w i h and v hp , respectively, where 1 ≤ i ≤ I + 1, 1 ≤ h ≤ H + 1, and 1 ≤ p ≤ P. With TL, it is expected to train the model with training record N + 1 (refers to training record from the new dataset) input x N +1 such that the predicted value from the model is equal to the true value of output, i.e., y N +1 =ŷ N +1 . The transfer of training with data from a new dataset should minimize the effect on training errors E n (1 ≤ n ≤ N ) of previous historical data, i.e., minimize the weight sensitivity. The cost objective for weight sensitivity can be given by T ≜ (1/8) N n=1 P p=1 E 2 np . The goal of TL is to determine the weights 4w i h (N + 1) and v hp (N + 1) such that these do not have any effect on weight sensitivity represented by the objective function (S) that balances the tradeoff between weight sensitivity objective function T and error of prediction, for instance, N + 1. The objective function S is given by the following: where λ is the tradeoff coefficient to balance the evolutionary training error and preevolutionary training error The weight sensitivities of change in error can be given by (10) and (13) δ E np Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.
From (10) and (13), we modify (7) as follows: Let is the activation function of the hidden layer neuron and u * h (n) ≜ I +1 i x i (n)w i h . It is important to note that x I +1 (n) = 1 and u H +1 (n) = 1 since the input node I + 1 and hidden node H + 1 are bias neurons in the artificial NN considered. Therefore, the change of prediction with respect to the weights in the hidden-to-output layer connections is given by the following: where 1 ≤ h ≤ H + 1 and 1 ≤ p ≤ P.
The change of prediction with respect to the weights in the input-to-hidden layer connections is given by the following: where 1 ≤ i ≤ I + 1, 1 ≤ h < H , and 1 ≤ p ≤ P. It implies that In TL, we try to minimize the objective function S that balances the tradeoff between minimizing weight sensitivity T on a historically trained model and the error of predictions on data from a new dataset, i.e., S ≜ T +(λ/2)( P p=1 [(y p (n + 1) − y p (n + 1))].  b) Inductive TL in the proposed methodology: In this work, we use homogeneous ITL by fine-tuning through all layers for target tasks. The homogeneous TL is shown in Fig. 4. As shown in Fig. 4, dataset 1 is employed to train model 1 from scratch, i.e., the weights of hidden layers in the base model are optimized. During the development of model x, the base layers from model 1 are utilized without freezing and the fine-tuning is performed through all layers.
The overall methodology of the construction of load forecasting models is shown in Fig. 5. The data of 1000 distribution nodes are passed through the clustering stage to form a group of similar distribution nodes into clusters. The optimal number of clusters is determined to be 93 clusters [6]. Similar distribution nodes are formed into clusters.
In the next stage of methodology, a forecasting model one each for a cluster is developed using TL, that is, a forecasting model (model 0) is first trained from scratch on the source dataset (cluster 0). Second, the model is retrained on target datasets (cluster 1, cluster 2, . . . , cluster n) through finetuning all the layers in the NN. For convenience, the clustered models formed using TL are denoted as Clus-TL-DNN and the clustered models formed without TL framework are denoted as Clus-DNN where DNN indicates the inherent DL NN model. The accuracies of Clus-TL-DNN are compared with those of Clus-DNN. The next stage of the proposed methodology involves the creation of models within clusters. These are individual models developed for each dataset. Already, the datasets, which are similar in energy consumption patterns, have been clustered together in the previous stage. Now, the knowledge transfer is performed only between the distribution datasets within the same clusters to eliminate any negative transfer of knowledge. In the first subset of experiments, TL is used to construct the subsequent models within a cluster using knowledge transfer from the source domain within the same cluster. For convenience, these models are denoted as Ind-TL-DNN. To develop source domains from cluster 1 onward, we utilize a weight regularization optimizer to transfer knowledge from source domain within cluster 0. The use of weight regularization eliminates negative learning when knowledge transfer occurs between clusters. In another subset of experiments, the individual models are developed without the use of any TL. For convenience, these models are denoted as Ind-DNN.
The models are compared using the RMSE or MAPE for accuracy and training time for execution time. RMSE is a metric of forecasting accuracy in statistics and is given by (21). Also, MAPE is represented by (22) where P ′ i is the forecast load demand, P i is the actual load, and M denotes the number of data points.
3) Deep Neural Network: To obtain an efficient and accurate DNN model, a search space for hyperparameters, such as weight initialization strategy, number of hidden layers, number of neurons, activation function, batch size, training epoch, and learning rate, was defined. After a search space was defined, a halving randomized search CV method was employed for hyperparameter tuning. Multiple comparative experiments were performed to confirm the model hyperparameters that are mentioned in Table II. The best performing DL model determined considering training time and accuracy was a DNN consisting of one input layer, four hidden layers, and one output layer. The number of neurons in the input layer equaled the number of independent data attributes. The number of neurons in the hidden layers was set to 75, 50, 40, and 30. The output layer consisted of one node because  III  TESTING OF CLUSTERED MODELS ON CLUSTER DATA WITH AND  WITHOUT TL FRAMEWORK APPLIED BETWEEN CLUSTERS the model tackled regression. Rectified linear unit (ReLU) activation function [37] was used as activation in the hidden layers, whereas the identity function was used as activation in the output layer. The loss function utilized was the meansquared error. Adam optimizer and its proposed DWR invariant had been employed for optimization. The models were implemented on a Keras framework. The training process used the Xavier normal weight initialization strategy. Based on the epoch-convergence history graph, the optimal number of epochs was set to 50 with a batch size of 128. After training, the models are saved as .pkl files for later use. The old models are used as starting points for training newer models with the help of TL to achieve faster convergence.

IV. EXPERIMENTAL RESULTS
Extensive experiments were performed to evaluate the performance of the ITL-based methodology. The utilized datasets are the power consumption records at ten and 1000 distribution nodes in the electrical network.
In one set of experiments, individual models are developed using the individual datasets, and in another set of experiments, the clustering approach is applied to group the similar distribution nodes into groups depending on the similarity metric of hourly power usage.
The employed approach is the k-Medoid clustering technique to eliminate the sensitivity to outliers in data analytics. According to the within-cluster error elbow curve, the ideal count of clusters is determined as 3 for ten distribution nodes data and as 93 for 1000 distribution nodes dataset [6].
The initial cluster (cluster 0) is trained using the conventional way without any TL. The other clusters are trained with the help of TL from cluster 0 and the fine-tuning is performed using the corresponding dataset of the cluster. The knowledge from the training of cluster 0 is used for training cluster 1, cluster 2, and so on.

A. Results on Ten Distribution Nodes Dataset
The performance of traditional learning and TL between dissimilar clusters on clustered models for ten distribution nodes dataset is shown in Table III. The RMSE of cluster 1 shows significant improvement after the transfer of knowledge. However, the performance of the model for cluster 2 shows a negative transfer of learning, indicating that the model converged to a local minimum rather than a global optimization point. The negative learning can be explained because the TL is performed between the dissimilar distribution nodes belonging to different clusters. A few potential solutions that can be considered to avoid convergence to local minima are the following [38], [39]: 1) considering cyclic learning rate; 2) using stochastic gradient descent (SGD) with warm restarts; 3) considering high values for learning rate; 4) using metaheuristic algorithms such as gray-wolf algorithm, ant colony optimization, and harmony search; and 5) variants of optimizers such as vanilla gradient descent, QHAdam, YellowFin, AggMo, QHM, and Demon. The negative TL can be removed when the transfer of knowledge happens between the distribution nodes that are similar. This is observed in subsequent tables. Moreover, the improvement with TL is more pronounced when the data for target tasks are not sufficiently large. In this work, weight regularization is utilized along with Adam optimizer to eliminate negative learning when the knowledge transfer is to occur between dissimilar clusters. The ten distribution nodes are clustered into three clusters. With the k-Medoid clustering algorithm, it was determined that the three clusters of distribution nodes are: {0, 1, 2, 6}, {5}, and {3, 4, 7, 8, 9}. One clustered model based on DNNs was developed for each cluster. Thus, the three clustered models have been developed and these have been tested on the individual datasets of the distribution nodes and the results of the performance with and without the use of TL are shown in Table IV. The first column in Table IV represents the distribution node number or transformer number (tf). The similar distribution nodes are grouped into the same clusters; however, any two clusters are assumed to be dissimilar. With the transfer of knowledge between dissimilar clusters, it is possible that the transfer is either positive or a little on the negative side. However, the gain in the execution or training time is always positive. The gain in time is shown in Table V. From Table V, it is clear that the time to train the models with TL is much less than the time to train the models without TL.
If TL is between similar distribution nodes, there is no negative knowledge transfer. The k-Medoid clustering algorithm depending on the criterion of similar energy usage clustered the ten distribution nodes into the clusters {0, 1, 2

B. Results on 1000 Distribution Nodes Dataset
The performance of TL with respect to training time has also been verified with a second case study on 1000 distribution nodes that, according to elbow curve and k-Medoid clustering, were grouped into 93 clusters, and the models were developed using DNNs. As shown in Fig. 6, the time to train the clustered models using TL is always less when compared to the time taken to train the clustered models without TL. This confirms that the TL allows for faster convergence of models.

C. Weight Regularization to Eliminate Negative Learning Between Dissimilar Datasets
For TL between dissimilar clusters, utilization of an improved Adam optimizer was proposed to eliminate any negative learning and to break out from local convergence. The first optimization step involves the use of a cyclical learning rate in which the learning rate is initialized to a larger value and is scheduled to decrease subsequently to prevent the avoidance of global minima. The proposed optimizer invariant is utilized with DWR and cyclical learning rate to eliminate overfitting and to break out from local minima toward the global minimum.
The weight update rule in the general Adam optimizer is given by the following: where f is the effective gradient term and α is the learning rate.
The general Adam optimizer is characterized by a large step size when gradient change is less and a smaller step size when gradient change is rapid, and the adaptability in step size is performed by maintaining moving averages (called moments) of gradient over the steps.
The implemented optimizer invariant employs DWR. This allows for weight regularization without the association of hyperparameters such as learning rate (α) and weight decay factor (δ).
The weight update rule in the proposed DWR-Adam optimizer invariant is given by the following: The weight decay factor is introduced as a coefficient to the weight of the previous iteration and lies between 0 and 1.  This forces the weights learned to be small, and thus, the model generalizes better. For convenience, the clustered models using DWR are denoted by Clus-TL-DWR-DNN.
1) Weight Regularization on Ten Nodes Dataset: Figs. 7 and 8 show the performance of TL after weight regularization on ten distribution nodes dataset. The results, obtained after the testing of clustered models is performed on cluster data, are shown in Fig. 7. The graph of TL with weight decay regularization is at the lower bound of error when compared to the model development without TL for clusters 1 and 2. At no point, the error is high in the case of model development after TL. This indicates that the negative learning has been eliminated by the use of weight regularization in the optimizer.
The results, obtained after the testing of clustered models on individual transformers' data, are shown in Fig. 8. The graph of TL with weight decay regularization is at the lower bound of error when compared to the model development without TL for all the transformers, including tf 1, tf 8, tf 3, tf 6, tf 7, and tf 9. At no point, the error is high in the case of model development after TL. This corroborates that the negative learning has been eliminated by the use of weight regularization in the optimizer.
2) Weight Regularization on 1000 Nodes Dataset: The performance of TL after weight regularization on the 1000 distribution nodes dataset is presented in Table VII. To analyze the performance of the proposed weight regularization TL modeling (Clus-TL-DWR-DNN), several state-of-the-art benchmark models, including LR, ARIMA, and deep LSTMs, are selected as comparative methods, as shown in Table VII. Weight regularization utilized during objective function optimization in the proposed model eliminates negative knowledge transfer.  Fig. 9. TL results when the data availability is low.

TABLE IX COMPETITIVE EVALUATION AGAINST STATE-OF-THE-ART MODELS
The proposed Clus-TL-DWR-DNN has a higher overall development time of 20.17 min while maintaining an average MAPE error to a minimum of 7.20% when compared to clustering-based TL modeling that has 3.23 min as development time and an average MAPE of 31.96%.

D. Results on Targets With Smaller Datasets
Besides, the effect of TL has been analyzed with smaller datasets. As observed in Fig. 9, for smaller datasets, the model developed from scratch has low accuracy when compared to the model with knowledge transferred from a similar distribution point. As the size of the dataset increases, the accuracy of both the models, with and without TL, increases, and when a threshold size is reached, these models have very close accuracy values. The results of the performance of TL, when the data availability is low, are verified on the available dataset (see Table VIII). As shown in Table VIII, the model with TL performs 38% better than the model without TL when the data size for the second model is 5% of the original dataset. In all the cases of data availability, the TL model outperforms the conventional model by 13%-43%.

E. Benchmark Case Study for Competitive Evaluation
The proposed multilayer methodology is compared against the state-of-the-art models generated on a normalized benchmark dataset and the results are tabulated in Table IX. As shown in Table IX, the proposed model has the least training time of 10.2506 s and a highly competitive accuracy with an nRMSE of 0.1057.

V. CONCLUSION AND FUTURE WORK
This article proposed a methodology to develop highly accurate trained models even in case of the unavailability of historical data in large quantities. The methodology employs an ITL mechanism to improve the accuracy of the newer models from the knowledge gained during the training of a similar task in the past. The proposed TL model not only improves the accuracy for smaller datasets but also improves the execution time to reach convergence for any size of training data. The effectiveness of the proposed methodology is verified through a case study of hourly energy forecasting where the model predicts hourly load 24 h ahead of time and the used features are 24 past lag values, season, and time series extracted features. The set of experiments was executed for multiple distribution energy datasets while using clustering and additionally, without clustering. The DNNs are used for training the forecasting regression models. The proposed methodology enables the use of the trained TL models from the scenario where large quantities of historical energy consumption data are available to the scenario where the available data are small. To eliminate the negative transfer of knowledge, the TL is employed between datasets with similar energy consumption patterns and similar datasets are determined by the first stage of clustering in the proposed methodology. In cases of knowledge transfer between dissimilar clusters, the proposed weight regularization-based TL approach eliminates negative learning. The overall results indicate that the knowledge transfer using the proposed methodology improves the accuracy of newer models, reduces the time of convergence, and reduces training time for DL models compared to that of models without TL.
In future studies, we plan to utilize different correlation coefficients instead of clustering techniques to determine the similarity between distribution nodes before employing TL between similar nodes. In this work, only one dataset is considered as a source dataset disregarding the fact that the other datasets may contain useful patterns for the target task. Hence, in the future, we plan to perform multisource TL to enhance the accuracy performance.