A Novel Combined DenseNet and Gated Recurrent Unit Approach to Detect Energy Thefts in Smart Grids

Due to the illegal use of electricity, non-technical losses are exponentially increasing in electricity distribution systems day by day. With the debut of smart meters in the smart grid, new electricity theft attacks are welcomed. The investigation of abnormal electricity consumption patterns helps in detecting electricity thieves. Moreover, existing methods have poor electricity theft detection (ETD) accuracy due to imbalanced datasets provided by the utilities. They have also failed to capture both periodicity and non-periodicity of 1-D daily electricity usage data. We primarily propose a novel sampling technique to balance the dataset, named as random oversampling using both classes (ROBC). This technique performs oversampling using both the theft and normal classes. With this technique, the problem of low accuracy has been resolved. We also propose a unique ETD model using densenet-fully convolutional network (DenseNet-FCN) and gated recurrent unit (GRU) with a light gradient boosting machine (LightGBM), known as DenseNet-GRU-LightGBM, to address the above mentioned concerns. DenseNet-FCN module extracts periodic and non-periodic patterns from 2-D electricity consumption data in a precise way. Whereas, GRU module captures as well as memorizes features from 1-D consumption data. Afterwards, LightGBM module is used as an ensemble classifier to give final ETD results. As a result, the proposed model has excellent ETD results. Comprehensive simulations indicate that the proposed scheme outperforms other existing methods regarding ETD.


I. INTRODUCTION
Electricity theft has raised negative impacts on electric utilities over past few years around the globe. The stealing of electricity at the distribution side results in The associate editor coordinating the review of this manuscript and approving it for publication was Ayaz Ahmad .
billions of losses known as non-technical losses (NTLs) [1]. This misconduct includes tampering electricity meter to change readings, bypassing of meter, faulty meter, hacking meter, illegal or reversing connections and non-payment of tariffs. Approximately, NTLs worth $13 million in central American nation of Honduras over the past three years [2].
For a traditional grid, the methods used for electricity theft detection (ETD) include: manually checking a meter setup or misconfiguration, analyzing bypassed power transmission line, using sensors or other hardware devices to detect fraud. In general, the manipulated meter readings are compared with the normal readings [3]. However, these approaches are not sufficient to accurately detect whether a person itself has tampered a meter or by default failure has been occurred. Moreover, these approaches are also very expensive, extremely inefficient, require manual effort and additional cost for hiring the inspection team. With the evolution of smart grid (SG), some of the previous bottlenecks are resolved. SG is a combination of traditional grid with the communication networks, which works intelligently in transferring the information between supplier and consumer [4]. One of the main aims of SG is to increase effective economic distribution and reduce physical as well as financial losses of both the utility and consumers. However, the debut of advanced metering infrastructure (AMI) in SG increases the possibility of new electricity theft attacks. The primary way of these attacks is cyber-attacks [5]. Such attacks cannot be identified with the conventional methods. Therefore, data-driven methods (either classification or clustering based methods) are widely used to detect such attacks. These methods require labeled (inspected) or unlabeled data (uninspected) for ETD. Classification methods usually request for the labeled dataset; whereas, clustering based methods demand for unlabeled dataset. Labeled dataset consists of consumers that are already inspected and classified as abnormal or normal consumers. On the contrary, unlabeled dataset contains consumers that have not been inspected for a long time or never been inspected.
By analyzing the consumer's previous electricity consumption behavior, it becomes easy for the classification methods to detect electricity theft. However, there are many limitations in the current methods used to detect electricity frauds. The primary issue is that imbalanced data is available for training, which limits the ETD rate of the machine learning methods. Many of them require particular tools or devices and manual intervention to extract features. Moreover, these methods have low accuracy and are vulnerable to the contamination attacks. The attacker may change and pollute the data, which misleads the learning process of the models to consider the contaminated data as normal data [6]. These factors negatively affect the model's performance and increases high false positive rate (FPR). The problem of imbalanced data is a major concern in supervised machine learning methods for ETD [7]. These methods experience low performance due to the underrepresentation of a minority (theft) class during training. Some of the authors consider sampling approaches to minimize the data imbalance ratio of honest and theft classes before passing the data to supervised learning methods. However, these approaches have the limitations of information loss, overfitting and divergence from actual theft cases. In this regard, little attention has been paid to resolve the data imbalance problem for improving NTLs detection. This potentially raises the need for an efficient and cost-effective solution for ETD.
The significant research contributions of this paper are as follows.
1. Novel technique: To overcome the limitations of the existing sampling approaches, a new random oversampling using both classes (ROBC) is proposed to balance the data. The minority (fraud) and majority (honest) classes are considered for oversampling (OS). This technique not only resolves overfitting problem but also considers resemblance with the realistic theft data and enhances the learning ability of the supervised models to improve the detection rate.
2. New model: A hybrid model is proposed in which a new supervised deep learning method, i.e., densenet-fully convolutional network (DenseNet-FCN) and existing gated recurrent unit (GRU) are applied in parallel. The outputs of these methods are further given as an input to light gradient boosting machine (LightGBM) for ETD. 1-D daily data is passed as an input to GRU module for memorization and 2-D data is given to the DenseNet-FCN for better generalization. LightGBM is then applied to give final ETD results and to improve the learning ability of weak learners on the basis of loss function.
3. Inclusive simulations: Extensive simulations are performed to find the appropriate values of hyperparameters on which proposed model performs the best. The same strategy is applied to the benchmark methods. Afterwards, the performance comparison of the proposed model versus various state-of-the-art methods performed using two performance metrics: area under the receiver operating characteristic (AUC-ROC) and precision-recall area under the curve (PR-AUC).
The remaining paper is arranged as follows. Section II elaborates the previous ETD related work. Afterwards, Section III discusses our proposed methodology for ETD. Simulation settings and results are presented in Section IV. Finally, Section V concludes the paper and provides future directions. Table 1 provides the acronyms and nomenclature.

II. RELATED WORK
This section discusses the conventional NTL detection methods and their limitations. Section II-A presents the ETD methods and an overview of the sampling methods is discussed in Section II-B. Section II-C elaborates the problem statement of this paper.

A. ENERGY THEFT DETECTION METHODS
ETD solutions used in the literature are broadly grouped into two types: hardware and data enabled. Both of these solutions have their pros and cons. Specifically, hardware type solutions contain designing sensors, smart metering devices or electrical equipment for detecting fraud consumers. However, the predominant limitations of these solutions include high cost associated with the development of these devices, maintenance difficulty and failure of such devices due to manipulation or change in the weather conditions. On the VOLUME 11, 2023  other hand, data based solutions seek considerable attention in NTL detection. Literature is also teemed with supervised, unsupervised and semi-supervised machine learning solutions to detect electricity frauds. Few supervised machine learning solutions include the wide and deep convolutional neural network (W&D-CNN) [3], [8], multiple linear regression (MLR) [9], convolutional neural network with random forest (CNN-RF) [5] or CNN with long short-term memory (CNN-LSTM) model [10], maximal overlap discrete wavelet packet transform-adaptive boosting (MODWPT-AdaBoost) [11], [12], ensemble bagged tree [13], gradient boosting theft (GBT) detector [14], simple neural network with boosting method [15] and support vector machine (SVM) [16] to detect electricity thieves.
A local matrix reconstruction based detection algorithm is proposed in [17] to identify abnormal electricity consumption patterns in power systems. Authors [18] propose an approach for identifying illegal loads connected with distribution systems in which load buses are modeled as QV buses. This approach highlights the theft on the basis of difference in the calculated and measured active powers. The authors in [19] propose a suspicion assessment scheme using binary tree method that probes about the deviation in consumptions. Authors [20] propose a theft detection technique with LSTM and bat based random undersampling boosting method that yields better results than traditional LSTM. Another more advanced extension of recurrent neural network (RNN), known as bidirectional GRU [21] is used in the literature to achieve better results in terms of ETD. Hu et al. [22] propose a theft detection model using bidirectional Wasserstein generative adversarial network to overcome the limitations of unavailability of theft samples and inefficiency in handling high dimensionality. So, the deep neural networks have not only achieved wonderful performance in the theft detection [23] but also in other areas [24], [25], [26] of SG. Whereas, the point forecasting based methods in [27], [28], [29], [30], [31], [32], and [33] can also be used for ETD.

B. SAMPLING METHODS
Whenever we discuss about ETD or anomaly detection, there is always a class biasness issue while training supervised learning methods [14]. It is very difficult for these methods to learn and intelligently predict the cases of a class, if insufficient data is given for that class. However, theft cases seldom occur as compared to the honest cases in the realworld. Therefore, we need to balance both theft and honest consumer classes for better generalization. Only few authors consider to solve the imbalance problem using OS [8], undersampling (US) [11] or synthetic minority oversampling technique (SMOTE) [5], [10], [14], [43], [46] before applying supervised machine learning methods to predict NTLs. Still, each of these solutions has some drawbacks, as described in the problem statement Section II-C.

C. PROBLEM STATEMENT
The model biasness problem of supervised learning models from imbalanced data is still under consideration. To deal with this problem, mainly OS, US and SMOTE have been applied in [3], [11], and [44], respectively. All of these methods have their constraints. On one side, random generation of data makes exact copies of the existing samples, which are 59498 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.
likely to overfit the model. On the other side, US discards useful information from the majority class, which could be necessary to train a classifier. So, from both aspects, samples chosen for OS or US lead to biasness of results. In [10] and [47], authors have used SMOTE to balance both minority and majority classes by synthesizing new cases. However, newly created samples do not belong to the actual residential consumers because SMOTE takes into account the minority class cases. This mechanism introduce the additional noise due to the overlapping of same class. Moreover, SMOTE is less efficient for high dimensional data. In addition, the studies use 1-D data to perform the detection of electricity thefts, which is less efficient to obtain the non-periodicity (key characteristics of thieves). In [3], [10] and [29], authors have proposed hybrid models, such as W&D-CNN, CNN-LSTM and LSTM-multi layer perceptron (MLP) for detecting electricity thefts. However, final classification results based on single feedforward neural network (SFNN) lead to overfitting and poor generalization, i.e., misclassified results.

III. PROPOSED METHODOLOGY
Electricity theft is one of the primary issues in smart metering infrastructure as it leads to high NTL [48]. Therefore, a more efficient and reliable model is proposed to perform ETD and is tested using real world dataset provided by the utility, as shown in Figure 1. The collected data is based on the historical load consumptions (i.e., on-field inspected data) obtained from residential consumers in populated areas of China. In Figure 1

A. DATA PREPROCESSING AND CLEANSING
To apply DenseNet-FCN and GRU for extracting and memorizing features effectively, we first clean the raw data. For data preprocessing, we perform several tasks, such as missing values imputation, removing outliers, data normalization, etc. Energy consumption data usually contains erroneous or missing values caused by several reasons, such as unscheduled maintenance of power system, failure of smart meters and transmission issues. The consumer whose complete history has zero values, will be considered as outliers and removed. However, in useful record on the other hand, there are fluctuations in the data for example if one reading is zero kW then the next reading can be more than 0. Two types of missing values were found in the smart meter readings; missing interval data that is associated with transmission problems and missing channel data. The basic solution to handle this problem is either to eliminate these samples in the case of very large dataset or replace these values by fill-in methods.
The existing fill-in methods replace the missing value by the most probable value [49], [50]. Here, we employ the linear interpolation mechanism f (x i,t ) to identify and restore missing values based on the following equation (1) [3]: where x i,t is the particular record consumption data i at time interval t. x i,t+1 and x i,t−1 are next and previous electricity consumption values, respectively. g = x i,t−1 + x i,t+1 shows the sum of previous and next consumer's reading available in data. If consumption data value is either non-numeric or null, then it is represented as NaN [3]. Moreover, there are anomalous values (outliers) found in the data. Hence, these outliers are resolved using "Three-sigma rule of thumb". The following formula [3] is defined as: where X is the vector that is comprised of x i,t computed for each time interval by days for each week and h = avg(X ) + 2σ (X ). In the equation, σ (X ) and avg(X ) indicate the standard deviation and average value of X , respectively. If x i,t lies outside the standard deviation, it will be considered as outlier and removed. The Three-sigma mechanism efficiently streamlines the electricity consumption data by minimizing the outliers. Afterwards, the scaling method, known as min-max is used over cleaned data for scaling the data, as neural networks are sensitive to diverse data. The scaling method takes single observation at a time and subtracts the smallest value in the data from it. Then, the difference between the largest and smallest data points is used as denominator value, which rescales the dataset in the range of [0, 1].

B. PROPOSED SAMPLING TECHNIQUE
In order to handle the imbalanced data issue (L.1), data sampling is the most common method [1]. The two broader strategies are: OS and US, as mentioned in the literature. The prime intention is to balance the honest and theft class ratio by either lessening the majority class, i.e., US or duplicating minority class samples, i.e., OS. The data imbalance problem can be handled at two stages: preprocessing stage and classification stage. In other words, sampling technique can VOLUME 11, 2023 be applied after the data is normalized; besides this, classification based sampling can also be done. A new sampling technique, ROBC (S.1) is proposed in this work, which is applied at the preprocessing stage. This technique competes with the existing techniques and overcomes the issues in those techniques.
The pseudocode of ROBC is given in Algorithm 1 (that validates S.1), in which input variables are given as dataset S, minority class y and majority class z. In this paper, minority class and majority class are interchangeably used as theft class and honest class, respectively.
The total number of consumers (cases) is calculated from the dataset such that S contains 42,372 consumers. Afterwards, the number of honest and theft consumers are determined and their difference is stored in u i , i.e., 38,757 (honest) -3615 (theft) = 35,142 samples that need to be created. So, we first make the distributions of both y and z.
These distributions are divided into percentiles Pr 1 , Pr 2 and Pr 3 . Then, we calculate percentage (%) of the data residing in the percentiles of both distributions. We randomly select 5% and 95% data points from honest and theft consumer's distribution to perform OS, respectively. The reason is that by choosing the data points only from the theft class leads to overfitting and by synthetically creating data points diverge the data from actual data. After selection, we merge the selected data points and make a new distribution. Finally, we integrate the newly created samples at the end of existing dataset while preserving the timestamps. The process is repeated until y == z. This mechanism describes how theft samples are created. Now, through ROBC, the classifiers will be able to better learn the theft or suspicious consumption. This type of data is essential for the training of classification model to detect consumers with suspicious consumption.

Algorithm 1 ROBC technique
Initialize: theft consumers x m , honest consumers v m , 25 th percentile Pr 1 , 50 th percentile Pr 2 , 75 th percentile Pr 3 Given: S, preprocessed dataset with minority class y labeled as 1 and majority class z labeled as 0 Output: is from positive side of distribution, 15% from negative side and 65% from mean 9: Randomly select 5% data points from z such that 1% is from positive side distribution, 1% from negative side and 3% from mean 10: Merge the data points (new distribution is created) 11: Store this distribution in dataset S 12: End for 13: Balanced dataset S ′ C. DENSENET-GRU-LIGHTGBM MODEL By being motivated from [3], a preliminary analysis on the electricity consumption data is conducted to gain the insights. The dataset made available by State grid corporation of China (SGCC), 1 which contains 42,372 consumers' data of 1035 days. The data shows daily basis fluctuations in the electricity consumption of consumers', which makes it hard to obtain the key attributes (periodicity and non-periodicity) possessed by the honest and theft consumption data. Instead of daily based 1-D consumption data, the data in 2-D manner according to weeks can better provide its periodicity and non-periodicity. Therefore, the proposed framework for ETD, as depicted in Figure 2 consists of mainly three modules: DenseNet-FCN module, GRU module and LightGBM module. DenseNet is a variation of ResNet, which means that it will perform well for 2-D data. Deep learning models have better generalization ability as they help to derive new features from the existing ones. So, we transform the daily basis 1-D data of electricity consumption into 2-D weekly. Then, we pass the transformed data to DenseNet-FCN as an input to further extract important features and increase the data diversity. However, for better memorization of the data patterns, 1-D data is passed to GRU module. GRU captures the properties of both the honest and theft consumption patterns with less computational time. In short, 1-D (daily) data 1 state grid corporation of China http://www.sgcc.com.cn/ is fed to GRU module and weekly data i.e. 2-D is passed as an input to DenseNet-FCN module. Finally, rather than using sigmoid that acts as SFNN, we employ the LightGBM to give the final ETD results and improve the model's performance. Figure 2 demonstrates the architecture of the DenseNet-FCN model for feature extraction. DenseNet-FCN [51] is built upon the idea of feature reuse, dense connectivity, improved flow of data and gradients throughout the method. The information flows from downsampling (condensing) to upsampling (expanding) path. Downsampling is the encoding part and upsampling is the decoding part. Downsampling contains convolution, dense and transition down blocks. Contrarily, upsampling contains convolution, dense and transition up blocks. The whole method is composed of an input layer, four transition up and down blocks with five layers, dense blocks with 26 layers at encoding and decoding side and finally an output layer. Each layer takes the input from its preceding layers. Relatively, the DenseNet-FCN model does not strictly follow the rules of dense connection due to the introduction of several downsampling operations. It is noteworthy that we combine feature maps through concatenation rather than summation. The feature map's size needs to be consistent for concatenation, which means the convolutional layer's output size is same as input. Feature maps are stored in memory after the concatenation that is memory consuming. Moreover, the network can complex, consumes more resources and halt if every layer follows the dense connection.

1) DENSENET-FCN
Keeping in mind this bottleneck, we use the initial layer as 64 convolutional kernels with 3×3 filter size to extract potential features. Convolution stride is set to 2 that minimizes both the computational complexity of the model and size of feature map. Our fully convolutional network mainly comes under the dense block. Each dense block acts as a feature extractor, which includes the list of operations like dropout with value of 0.2, batch normalization (BN), conv (1 × 1) and rectified linear unit (ReLU). It is notable that the dense connection is strictly followed between layers in the dense block. Transition up or down layer also contains ReLU, BN, conv (1 × 1), dropout with value of 0.1 and (2×2) pooling operation. We set conv (1 × 1) in the transition layer to minimize the feature map's size, avoid overfitting and increase depth rate at same time. Max pooling operation is used for the dimensionality reduction and dropout layer is added to avoid overfitting in such a densely connected network.
2) GRU MODULE GRU is new but pretty similar to LSTM, an enhanced variant of RNN. It uses the hidden state to transfer the information instead of cell state. Also, it has only two gates: reset gate and update gate. The former gate decides how much information to forget from the past. The latter gate confirms to make the decision either to throw or reserve new information. A sigmoid σ gate scales the data between 0 and 1.  Where, 0 means no information is allowed to pass through the hidden state. 1 means the information should fed as an input to next state. A tanh gate is the activation function of candidate state that squashes the values between -1 and 1. GRU has the following equations (3-6) [52]: where t is the timestamp, x t is input value and h t represents the hidden state. wz and wr are the weights of update u t and reset r t gates, respectively. Whereas, w is a candidate output [52]. Some important advantages of GRU over LSTM or other temporal learning methods are the efficient learning with less parameters, insensitive to noise and information in the GRU is more distributed. GRU is used to find temporal correlation and detect fluctuations in the consumptions as it compares the value at one time stamp with the other and perform prediction on this basis. Moreover, it is efficient in terms of time and memory because of two Gates. So, 1-D data is fed into GRU for learning sequential information.

3) LIGHTGBM MODULE
A LightGBM is based on GBT, which is proposed by Microsoft in 2017. It forms a strong learner by combining the several weak learners to boost the performance. It basically uses leaf-wise generation strategy that can reduce loss when growing the same leaf. We use it as a classifier for final results as well as for improving the learning ability of the weak learners on the basis of calculated loss. So, it gives 0 as output for honest consumption and 1, if theft is detected.

IV. SIMULATION RESULTS
This section presents the simulation results of the proposed model, which is implemented using Python as follows.
1. Theft profiles are generated using ROBC strategy to balance positive (theft) and negative (honest) classes. The missing data values are recovered using linear interpolation method. After regaining values, the adopted algorithm for the outlier removal is "Three-sigma rule of thumb". [53]. Furthermore, we apply min-max normalization of Scikit − learn library for scaling the data after removing the erroneous values. Finally, we take two types of data: 1-D and 2-D to capture both periodicity and non-periodicity. 2. GRU model is trained on 1-D data by means of keras library. 1-D data is transformed into 2-D data and passed into DenseNet-FCN for learning the significant samples of consumption. Neural networks are formed and trained using TensorFlow framework that is an open source deep learning package [54]. Light boosting method is used as an ensemble classifier to predict the final honest and theft consumers. The proposed NTL detection solution is simulated in Google Colab.

A. DATASET AVAILABILITY
To train and test the proposed solution, real smart meter data of SGCC is used, which is labeled data. Specifically, this dataset contains 42,372 electricity consumers' data of 1035 days (January 1, 2014 to October 31, 2016). The data is sorted according to dates. Days are the features in column, so, this means that it is a multivariate data. Consumption values of the consumers are represented as observations in rows. The data is imbalanced as the honest electricity consumers are 38,757; whereas, electricity thieves are 3,615. It contains erroneous values as well. Therefore, we have applied methods for dealing with outliers and imbalanced data in preprocessing stage, as described in Sections III-A and III-B.

B. PERFORMANCE METRICS
One of the most difficult challenges to tackle electricity theft with supervised learning approach is to choose suitable metric for the class imbalance problem. Particularly, in confusion matrix, true positive (T + ) corresponds to the fraud case and true negative (T − ) refers to the no-fraud case identified correctly. Moreover, false positive (F + ) and false negative (F − ) refer to no-fraud case falsely identified as fraud and undetected fraud cases, respectively. In this paper, we conduct simulations using two most suitable performance metrics: AUC-ROC curve and PR-AUC [55]. Loss function i.e. binary cross entropy (BCE) [44] of the proposed hybrid solution is calculated for DenseNet and GRU methods. AUC is often used to evaluate the classification performance, which indicates the model's capability to distinguish between classes. It is the measure of quality of model's separability, i.e., numerical representation of a binary classifier. Moderate classifier has the AUC value of 0.5; however, the value of a perfect classifier is equal to 1. AUC is desirable due to scale and classification threshold invariants. AUC is calculated using the formula [56]: where rank i represents the ranking sample i, p and n are the number of positive and negative cases. Moreover, ROC highlight the visual representation of a binary classifier, i.e., the curve of probability. It shows the true positive rate (TPR) on y-axis and FPR on x-axis to get the visual understanding of the model's performance. Another name of the TPR is recall or hit rate and is given as: Recall is the ratio of truly identified fraud consumers to the total number of consumers. On the other hand, FPR is defined in equation (9) as: Ideally, AUC will gradually move towards top-left, which means that the model is correctly predicting positive and negative cases. The objective of ETD is to increase TPR and decrease FPR.

C. SIMULATION SETTINGS
The dataset is divided with the ratio that 70% training and 30% testing is done of the proposed model. As neural networks strongly depend on their hyperparameters, therefore we adjust the network of the proposed DenseNet-GRU-LightGBM model by controlling the size of its hidden layers. We set the values of parameters using grid search and then monitor the performance on the validation dataset. Table 2 shows the range of values for parameters and the optimal value found between them. Dropout is used as a hidden layer for every block of DenseNet-FCN and GRU modules. Dropout values of 0.1 and 0.2 were chosen, of which the model performs well with 0.1 dropout value; however, with greater value it tends to overfit. The number of layers selected for DenseNet-FCN is 67. With 56 layers, it performs moderately in terms of ETD. By increasing the number of layers, number of parameters increases, demand for large data increases, computational time exceeds and thus, leads to overfitting. So, to avoid these issues, we choose DenseNet-FCN with 67 layers that performs better than with 56 and 103 layers.
On the other hand, as GRU has less computational time as compared to LSTM, the number of layers is set to be 10 with 30 units for each GRU layer. Dropout layer is also used with each GRU layer to reduce overfitting. Max depth set for light boosting method is 5 where lambda_l1 and lambda_l2 are 0.1 and 0.01, respectively. Alpha is the learning rate or parameter size of Adam optimizer, which means that larger the value of alpha results in faster initial training. It is well suited to large data size or parameters of the model. Adam is a robust technique, which is computationally inexpensive and requires little memory. Beta_1 and Beta_2 are the first and second moment exponential decay rates of optimizer, respectively. The values closer to 1 means that the model will deal with sparse gradient problem. Epsilon is the small number, which prevents division by zero during implementation. Whereas, Beta_1, Beta_2 and epsilon are the default values of the Adam optimizer. Table 2 presents the configuration of models parameters and Adam's parameters learning_rate.

D. SIMPLE DENSENET RESULTS
In order to assess the performance of the proposed solution for ETD, extensive simulations are performed. Firstly, the Val_loss of simple models i.e., DenseNet model and GRU model, is observed separately and then the performance of these models as a hybrid approach is evaluated. Figure 3 shows performance of simple DenseNet model when weekly profiles (2-D) are given as input that is validated as V.4. Here, loss is the training loss and Val_loss is the validation loss that shows the learning capability of the model. We set epochs to be 50 to show the clear visualization of loss at each iteration. Early stopping technique is employed for the model that looks for loss and learning rate.

E. GRU MODEL RESULTS
The performance contribution about GRU module is depicted in Figure 4 and validated as V.4. Likewise, its loss is also gradually decreasing with the increase in iterations on both the training and validation sets. However, at iterations 10, 14 and 20, there is a sudden increase in loss on the validation data. This is due to overfitting, which can occur when either same batch of data is selected and given to the model during training or the number of samples given as input contains 0 or NaN . Training loss value starts from 3.0 to 0.61. Val_loss of GRU starts from 0.8 to 0.3. It performs well at 30 th iteration with Val_loss value of 0.34. We can observe (through validation V.4) that the GRU model gives satisfying results when 2-D profiles are given to it.

F. THE PROPOSED DENSENET-GRU-LIGHTGBM MODEL RESULTS
The electricity consumption of normal consumer is typically consistent and within a normal range based on their household size, appliances used, and daily routines. It is metered and recorded accurately by the utility company. On the other hand, a theft consumer steals electricity by tampering with the electricity meter or using illegal connections to bypass the meter altogether. The electricity consumption of a theft consumer is typically much higher than a normal consumer and is inconsistent, varying from day to day. Due to tampering, most of the consumptions on monthly basis are either at 0kW or less consumption as compared to actual consumption which is why that consumer is regarded as suspicious or theft user.
The example of actual theft samples created by ROBC is shown in Figure 5. Here three random theft cases i.e., manipulated electricity consumption (Case 1, case 2 and case 3) in october 2015, from the dataset is shown and compared with three random theft cases (Case 4, case 5 and case 6) generated by ROBC for the same month. If we look at theft case 6 created by ROBC, most of the electricity consumptions by user are 0kW or low consumption which makes it suspicious as theft user. Traditional methods for sampling as discussed earlier in Section II-C, replicate the samples in the dataset which cause overfitting while learning machine learning models. It can be clearly seen that the generated samples by ROBC do not show the replication of existing theft samples, which makes it more acceptable than the traditional sampling techniques like SMOTE. It is also depicted that theft cases created by ROBC show irregular patterns and reflect with the existing samples in the dataset validated as V.3. Figure 6 depicts the performance of the proposed combined solution in terms of Val_loss. It is evident that the proposed solution smoothly minimizes both loss and val_loss because it efficiently captures the periodicity and non-periodicity of consumers' behavior through GRU and DenseNet-FCN. Furthermore, to demonstrate the significance of the proposed ROBC technique, the ETD based performance of the proposed model is also compared with existing SMOTE technique in Figure 7. AUC values range from 0 to 1, zero means poor classification ability of the model and one means good detection ability of the model. In AUC-ROC and PR-AUC curves, there is a dotted line of no skill that shows the performance of the model in terms of prediction. This line is located at 0.5 value, the closer AUC value to 1, the better results of the model. It is clear that the model has low performance because SMOTE introduces additional noise by overlapping of the same class. It only achieves the AUC-ROC value of 0.79.  On the other side, the value of AUC-ROC is 0.84 for the proposed solution using ROBC technique, as depicted in Figure 7. The results demonstrate that the ROBC technique is more appropriate for ETD than the existing SMOTE.
PR-AUC curve is also shown with SMOTE in Figure 8, whose value is 0.80 for our model. Whereas, proposed model performs best when applying ROBC with PR-AUC value of 0.87. The PR-AUC results indicate that the proposed sampling technique reduces the overfitting problem and noise, which increases the precision and recall values of the proposed ETD model.

G. PARAMETERS SETTING OF BENCHMARK METHODS
To evaluate the performance of proposed model for ETD, a comparison is performed with other state-of-the-art benchmark approaches: XGBoost, SVM, W&D-CNN, CNN-LSTM, LSTM-MLP and GRU on the same dataset.

1) EXTREME GRADIENT BOOSTING (XGBOOST)
It is the most popular algorithm due to its faster execution speed and improved performance in machine learning. It has already been successfully implemented for ETD [8] and it is now applied widely in different fields for classification and VOLUME 11, 2023 regression purposes. It combines the number of decision trees to make more accurate predictions. Generally, it boosts weak learners to become strong learners by increasing weights of the features on which the weak learners do not perform well. Table 3 shows the hyperparameter values set for the method during validation. The objective function of this method is the summation of particular loss functions for all predictions and summation of regularization terms for all learners.

2) SVM
Many previous researches have applied SVM to figure out the presence of electricity theft [57], [58]. The goal of this algorithm is to create a hyperplane that maximizes the distance between classes in order to detect the theft more accurately. Sigmoid kernel is used for SVM. The value of penalty C hyperparameter is given in Table 4.

3) W&D-CNN
This architecture was proposed by authors in [3] for the NTL detection. Raw 1-D daily basis electricity consumption data is fed to wide network and 2-D data is passed to the deep CNN. We set similar setup to fairly compare the performance of W&D-CNN with our model. Sigmoid, ReLU and Leaky ReLU activation functions are alternatively selected to figure out the model's best performance. Also, MLP module has also been implemented as a classifier module within the architecture for obtaining the final classification results. The parameters setting of this model is stated in Table 5 where a and b depict the neuron values in the densely connected layers of W&D-CNN components. R represents the number of layers in the deep CNN network and the filter value in convolutional layers is set to be 20.

4) CNN-LSTM
The parameters' setting of CNN is the same as that of deep CNN, as described above. However, the difference is the use of LSTM module in place of the wide CNN. LSTM is an extension of RNN that connects previous information to the next task. It is able to retain and utilize past information as well as to resolve gradient vanishing problem. However, it is computationally expensive and memory consuming [10]. The set of hyperparameter values for this model are given in Table 6.

5) HYBRID LSTM
LSTM performs same as RNN with better memorizing ability and prevents gradient vanishing problem. Parameter values of LSTM are same as described above. However, SFNN, also known as MLP, is used to display the results. A sigmoid kernel is used for the MLP based model in [44]. Parameters' setting is given in Table 7.

6) GRU
GRU is already described in Section III-C2. It performs similar to the LSTM; although, more computationally efficient than LSTM. It has no cell state, so, it is unable to maintain the long-term sequential information. GRU is implemented with the simple MLP layer [26] in order to compare it with LSTM-MLP and the proposed model for ETD. There is no need to make a separate table for GRU's hyperparameters as its values are similar to the proposed GRU module but with additional MLP layer encompassing sigmoid kernel function.

7) COMPARISON AND RESULTS DISCUSSION
In this section, the proposed model is compared with the most recent ETD methods, as mentioned above. Parameter settings used by each model are described earlier. However, all these methods are provided with the same SGCC dataset as an input for training and preprocessing to fairly check each one's performance.  As it can be seen from Figure 9 that the proposed solution performs best than the other existing models on the validation dataset. AUC-ROC score is about 0.92 of the proposed model; whereas, XGBoost and CNN-LSTM have the AUC-ROC score of 0.82 with less FPR. The reason is that the proposed sampling technique synthesizes more efficient samples than the traditional techniques, which reduces the overfitting problem. Moreover, with the utilization of GRU, DenseNet-FCN and LightGBM for ETD that efficiently learn periodicity and non-periodicity from the electricity consumption data. However, rest of the models, such as W&D-CNN, GRU and LSTM have high FPR with VOLUME 11, 2023 59507 Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.  AUC-ROC scores of 0.78, 0.75 and 0.72, respectively. The reason is that these methods lead to overfitting issue. On the other hand, if we observe PR-AUC curve of the proposed method against other methods in Figure 10, its score is nearly equal to 0.9, which means that the proposed model has the lowest FPR and the highest TPR. Other methods have score ranging between 0.70-0.83 except CNN-LSTM, which performs worst with the lowest precision and recall. It is validated from AUC-ROC and PR-AUC as V.5 and V.6 that with LightGBM method, the proposed solution outperforms the single classification models and other hybrid models. Figure 11 briefly depicts that SVM does not perform well on the imbalanced dataset. Even though, balanced data is given to improve its performance, which still gives underrated performance. Apart from this, other benchmark methods have the precision ranging between 0.65-0.80. Still, the proposed hybrid model beats all of the benchmark methods. Table 8 provides the limitation and proposed solutions' mapping along with the validation. The validation of solutions is done through simulations shown in figures. Hence, the proposed model contributes well in solving the limitations and enhancing the performance for ETD.

V. CONCLUSION AND FUTURE WORK
A novel ROBC technique is presented in this work to equally learn both the honest and theft cases by supervised machine learning models for improving ETD. This technique resolves the overfitting problem, considers resemblance with the actual theft cases and enhances the learning capability of the models to improve the ETD rate. This paper also presents a methodology for ETD known as DenseNet-GRU-LightGBM model. The methodology contains feature engineering using DenseNet-FCN and GRU where former is employed to get potential features from 2-D weekly data in order to better learn non-periodic data. Latter module gets features from daily 1-D electricity consumption records for better memorization. LightGBM is then used to give the final classification results. We conduct comprehensive simulations on the realistic electricity consumption data provided by SGCC. The simulation results demonstrate that the proposed solution outperforms existing methods, such as W&D-CNN, CNN-LSTM, LSTM, GRU, XGBoost and SVM in terms of AUC-ROC and PR-AUC. Also, proposed solution is valid to other scenarios than the ETD such as industrial or economic applications regarding anomaly or intrusion detection. For future work, we plan to use electricity theft datasets from different countries on the proposed framework to determine its resilience and efficiency regarding ETD in the electricity distribution system. We will consider more state of the art deep learning models and analyze their computational efficiency.
ZEESHAN ASLAM is currently a Lecturer with the Department of Computer Science, Bahria University, Islamabad Campus, Pakistan. His current research interests include data science, machine learning, deep learning, computer vision, artificial intelligence, software engineering, privacy preservation, and smart grids.
TAMARA AL SHLOUL is currently an Assistant Professor (humanities) with the Liwa College of Technology. She has vast experience of teaching education and humanities courses, along with experience in school supervision, thinking skills, and higher education improvement ability. Her current research interests include teacher socialization and professional development.
AQDAS NAZ received the bachelor's degree in software engineering from the University of Engineering and Technology, Taxila, Pakistan, the master's degree in software engineering from the National University of Sciences and Technology, Pakistan, and the Ph.D. degree from the Research Laboratory (ComSens), COMSATS University Islamabad, Islamabad Campus. Her current research interests include data science, machine learning, deep learning, optimization, security and privacy, energy trading, and smart grids.
MUHAMMAD IMRAN NADEEM received the M.S. degree in software engineering from the National University of Sciences and Technology, Pakistan. He is currently pursuing the Ph.D. degree in software engineering with Zhengzhou University, Zhengzhou, China. His current research interests include machine learning, deep learning, natural language processing (NLP), data analysis, and optimization.
MOSLEH HMOUD AL-ADHAILEH received the Ph.D. degree in computer science (AI). He is currently the Director of e-learning and distance education for operation with King Faisal University. His current research interests include artificial intelligence(AI), machine learning (ML), natural language processing, robotics programming, knowledge representation, and e-learning strategies and technologies.
YAZEED YASIN GHADI received the Ph.D. degree in electrical and computer engineering from Queensland University. His dissertation on developing novel hybrid plasmonic photonic on-chip biochemical sensors. He was a Postdoctoral Researcher with Queensland University. He is currently an Assistant Professor of software engineering with Al Ain University. He has published more than 80 peer-reviewed journals and conference papers and he holds three pending patents. His current research interests include developing novel electroacoustic-optic neural interfaces for large-scale high-resolution electrophysiology and distributed optogenetic stimulation. He was a recipient of several awards. He received the Sigma Xi Best Ph.D. Thesis Award for his Ph.D. degree.
HEBA G. MOHAMED was born in Alexandria, Egypt, in 1984. She received the B.Sc. and M.Sc. degrees in electrical engineering from Arab Academy for Science and Technology, in 2007 and 2012, respectively, and the Ph.D. degree in electrical engineering from Alexandria University, Egypt, in 2016. In 2016, she was an Assistant Professor with the Alexandria Higher Institute of Engineering and Technology, Ministry of Higher Education, Egypt. Since 2019, she has been an Assistant Professor with the Faculty of Engineering, Communication Department, Princess Nourah bint Abdulrahman University, Saudi Arabia. In 2022, she was an associate professor in Egypt. Her current research interests include cryptography, wireless communication, mobile data communication, the Internet of Things, and computer vision.