Simulation of electricity consumption data using multiple artificial intelligence models and cross validation techniques

Worldwide, electricity production exceeds its consumption which leads to wasted financial and energy resources. Machine learning models can be utilized to predict the future consumption to avoid these significant losses. This paper presents the data for the monthly electricity consumption on the community level during May 2017–December 2019 in Dubai, United Arab Emirates. It was acquired from Dubai Pulse, an online repository containing consumption data from Dubai Electricity and Water Authority which provides utility services to the Emirate. Multiple parameters, such as population and number of buildings, were acquired from Dubai Statistics Center in addition to temperature which was obtained from Dubai International Airport. Additional features, such as expatriate ratio, number of customers, and building occupancy, were computed from the available data and utilized to generate a dataset towards accurate prediction. Various linear regression variants, support vector machines, decision tree models, ensemble models, and neural networks were implemented to forecast electricity consumption. The models were trained on two different formats of the same dataset, which were generated by sorting the data with respect to time, named as temporally ordered dataset, and by randomly dividing the data, labelled as randomly split dataset. In addition, the dependence of the models on the amount of data was identified by varying the size of the testing data. Moreover, two cross-validation (CV) procedures, namely rolling CV method and moving CV method, were applied to assess the reliability of the models. All analyses were evaluated by utilizing several performance metrics, namely root mean squared error, coefficient of determination, i.e., R2, 10-fold CV score, mean absolute error, median absolute error, and computational time. Furthermore, this data could be utilized to analyze the effect of coronavirus disease 2019 (COVID-19) prevention measures in Dubai on electricity usage as well as evaluate the consumption patterns at the consumer level.

and by randomly dividing the data, labelled as randomly split dataset.In addition, the dependence of the models on the amount of data was identified by varying the size of the testing data.Moreover, two cross-validation (CV) procedures, namely rolling CV method and moving CV method, were applied to assess the reliability of the models.All analyses were evaluated by utilizing several performance metrics, namely root mean squared error, coefficient of determination, i.e., R 2 , 10-fold CV score, mean absolute error, median absolute error, and computational time.Furthermore, this data could be utilized to analyze the effect of coronavirus disease 2019 (COVID-19) prevention measures in Dubai on electricity usage as well as evaluate the consumption patterns at the consumer level. ©

Value of the Data
• This data is significantly detailed with more than 50 million observations providing the monthly electricity consumption per customer.In addition, essential clarifications are presented in this paper about the consumption data and its behavior.• Utility service providers can use the data to plan forward for the required energy supply, reduce losses in the grid, and improve the overall efficiency towards energy savings and sustainability.They can achieve this by identifying the communities with fluctuated consumption to prepare plans for load balancing as well as integrating the varying consumption patterns with the different end users, i.e., rate category, to create and implement strict regulations towards reduced consumption.• The present study analyzes the effect of external factors, such as ambient temperature, annual population and number of buildings per community, number of sites, expatriate ratio, as well as building occupancy, on electricity consumption patterns in fast-growing cities and determines the best-performing machine learning model with the most stability and reliability.• The historical records of electricity consumption presented in this paper can benefit the scientific community to have robust models that generate accurate forecasts for a specific city and other cities with similar demographic features.• The data can be further analyzed at the household level to gain insight into customer behavior and consumption patterns by extracting the observations and aggregating them with respect to each household, e.g., contract account.• This data can be used to train various machine learning models to estimate electricity consumption, such as linear regression variants, support vector machines, decision tree models, ensemble models, and neural networks.

Data Description
The dataset details the monthly electricity consumption at the community level in Dubai, United Arab Emirates, from May 2017 to December 2019.It consists of electricity consumption records, socio-demographic, and climatic factors obtained from three repositories.The first repository is Dubai Pulse in which Dubai Electricity and Water Authority (DEWA), the electricity provider in Dubai, uploads monthly electricity consumption records for each household account [3 , 4] .A list of the number of observations obtained from each file is shown in Table 1 .As the study aims to evaluate the consumption patterns in Dubai, accounts enrolled in the renewable energy generation program using solar panels, i.e., Shams Dubai, were omitted.This was due to the lack of any numerical relation between electricity consumption and Shams Dubai consumption.
Table 2 lists the attributes of both data sources.Moreover, DEWA has nine rate categories based on the site type, e.g., commercial, industrial, expatriate residential, etc.
Table 3 describes the present rate categories, as well as the annual number of accounts and energy consumption for each rate category.Fig. 1 presents a heat map for the aggregated electricity consumption of all consumers per community.
The second repository is Dubai International Airport where the monthly values of the ambient temperature were acquired for the entire Emirate during the study period [5] .Moreover, the annual population and the number of buildings per community were obtained from the third   repository which is Dubai Statistics Center which provides various statistical data from 24 different governmental sectors in Dubai, including demographic, economic, and climatic records [6] .Fig. 2 shows a heat map for the population per community.Table 4 presents a list of the total number of buildings per building type for the entire city.Overall, Dubai encompasses more than 250 communities, which were divided into nine sectors in this study, as shown in Fig. 3 to facilitate the presentation of the accumulated annual consumption and population in Table 5 .
Additional features were extracted and computed using the acquired data.Pre-processing methods were carried out to aggregate electricity consumption with respect to time and community.Afterwards, the obtained features were added to the dataset, such as temperature, pop-  ulation, number of buildings, number of sites, expatriate ratio, and building occupancy.Next, the necessary data cleaning, integration, and transformation processes were applied in which the lower and upper one percentile were removed, and the electricity consumption values were normalized.Fig. 4 shows the flowchart for the methodology followed to obtain the dataset.The dataset presented in this article is included in two versions.The first one is "Aggregated Dataset.csv" which consists of the aggregated monthly electricity consumption records on the community level and the exogenous parameters.This dataset is generated using the simulation software [2] .The other is "Final Dataset -Processed.csv" which contains the final dataset generated after applying multiple pre-processing and feature importance steps.This dataset is used to model the electricity consumption predictors presented in the related manuscript [1] .Table 6 shows a brief description and a statistical summary of the mean and range values of all the features in the aggregated dataset after pre-processing.Moreover, the repository includes a supportive file called "Aggregated Electricity and Temperature on City Level.csv" which consist of the aggregated electricity records on the level of Dubai city.
The month values were represented by the trigonometric function, i.e., cosine, as displayed in Fig. 5 .
Table 7 illustrates the number of communities per month in the final dataset prior to modeling.Fig. 6 presents the effect of extreme value removal, whereas Fig. 7 additionally illustrates this effect with the monthly variation of temperature and Fig. 8 demonstrates this impact with  the annual deviation of population.Box plots were plotted for the monthly electricity consumption as demonstrated in Fig. 9 .The processed dataset was arranged in two forms to represent two datasets, namely 1) temporally ordered dataset where the electricity consumption values were ordered with respect to time and 2) randomly split dataset with a random order when trained.
Table 8 and Table 9 present the evaluation results of several modes on the temporally ordered and randomly split datasets, respectively.Additionally, Table 10 presents the results of assessing different modeling techniques using moving and rolling cross validation while Fig. 10 presents the models performance through different moving cross validation with varied dataset sizes.Furthermore, Table 11 displays the best-performing models employed for investigating the feature importance in predicting electricity consumption.It compares all features to a base scenario that includes climate.

Experimental Design, Materials and Methods
This section describes the utilized data sources, obtained and computed external factors, preprocessing steps, applied prediction analyses, and evaluation metrics.In addition, the following research questions were addressed: 1) What is the effect of exogenous features on electricity   The present study aimed to study the effect of external climatic and demographic factors on electricity consumption patterns which provided insight into the first research question.In addition, the monthly electricity consumption was forecasted on the community level in Dubai  using various machine learning models that tackled the second research question.Moreover, the third research question was addressed by examining the performance and robustness of the developed models in which the size of the training dataset was varied as well as moving and rolling cross validation methods were implemented.

Data sources
This section demonstrates the utilized data sources, namely, Dubai Electricity and Water Authority, Dubai Statistics Center, as well as Dubai International Airport.They represent reliable governmental sources for the required datasets of electricity consumption and external factors in Dubai.

Consumption records
Data for electricity consumption and Shams Dubai, in which household owners generate and sell solar energy through a smart power grid, were obtained during May 2017-December 2019 from an online repository, Dubai Pulse [3 , 4] .Table 1 shows the number of observations of the electricity consumption in each data file ranging between 1.4-1.8 million observations per file.Each file contained 8 attributes, namely billing portion, community, rate category, consumption period, calendar month, contract account, business partner, and consumption unit, which are described in Table 4 .The contract account represents a unique value of a residential, commercial, industrial, or governmental site, owned by a business partner who can be associated with multiple contract accounts.In addition, an individual billing portion may contain multiple communities.Moreover, the data files also contained contract accounts enrolled in Shams Dubai initiative, in which households utilize solar panels as well as the electricity grid as an energy source.The data for these accounts contained 2 attributes called import unit and export unit instead of consumption unit as indicated in Table 2 .There was no numerical relation between the electricity consumption and Shams Dubai consumption.Furthermore, rate categories indicate the type of the contract account, e.g., residential, commercial, industrial, and governmental.Rate categories as well as the number of accounts and total consumption for each rate category are shown in Table 3 .Furthermore, Fig. 5 demonstrates a heat map for the electricity consumption in all communities which was generated by computing the annual average of the electricity consumption with respect to the community.

External factors
Various external features that affected electricity consumption in the literature were determined and curated to facilitate modeling.For instance, monthly values of the ambient temperature were acquired from Dubai International Airport for the entire Emirate during the study period [5] .Fig. 3 shows the monthly variation of climate in Dubai.In addition, the annual population and number of buildings per community were obtained from Dubai Statistics Center [6] .Population data was available from 2017 to 2019, whereas the number of buildings were provided for 2019 only.Fig. 2 shows a heat map for the population of all communities in Dubai.The number of buildings data was provided for nine different types of buildings as demonstrated by Table 4 which also summarized the total number and respective percentage of each type of building.Multi-storey ratio buildings consist of two or more storeys and are built in nonconventional shapes, e.g., pyramid, cone, l-shape.Fig. 3 shows a graphical representation of the allocation of communities to sectors.The allocation of communities to sectors was implmented in this study to present a statitical summary for each sector.Table 5 summarizes the number of communities, population, and electricity consumption for each sector.Multiple features were extracted and computed from the various acquired data, such as number of sites, expatriate ratio, and building occupancy.For instance, number of sites represents the number of contract accounts with a positive consumption.Expatriate ratio was determined by dividing the number of expatriate residential accounts by total residential accounts, whereas the ratio of the population to number of buildings was utilized to obtain building occupancy.The computation of these factors was necessary to increase the accuracy of the studied models.

Dataset aggregation
All observations were stacked in one file then, the billing portion, business partner, and consumption period attributes were removed.Next, duplicate records were discarded to avoid any bias that may occur within the dataset.Afterwards, Shams Dubai contract accounts were excluded as there was no numerical relation between electricity consumption and Shams Dubai consumption.In addition, observations with a negative consumption were discarded to avoid the cancelation of actual instances when the data is aggregated.Additionally, observations prior to May 2017 were eliminated to limit the dataset to the study period.Moreover, all observations were aggregated with respect to the contract account.For instance, each data file contained two observations for the same contract account, where the first consumption was for a portion of the prior month, whereas the other belonged to the current month.The instances for the same month were added from two different files for the exact contract account.Instances of the same contract account in an individual file were not aggregated as they belong to two different months.Afterwards, observations with zero consumption and missing community data were discarded.The entire database containing more than 50 million observations was aggregated with respect to community and calendar month.Multiple external factors were joined to the dataset, namely temperature, population, number of buildings, number of sites, expatriate ratio, and building occupancy.The same temperature data was utilized across all communities for each month, whereas the annual values for population and number of buildings were utilized across all months of the respective year in the dataset.Next, all instances with missing data were removed.Furthermore, various percentages of extreme values, representing outliers were discarded from the dataset to determine the optimum dataset size with the highest accuracy.For instance, the initial and final one percentile were excluded to ensure the elimination of any errors, then the consumption data was normalized [7] .Fig. 4 summarizes the data sources and the applied pre-processing methods.Next, the correlation coefficient was computed and all variables within a range of −0.2 to 0.2 were eliminated.This has been implemented to ensure the elimination of any potential sources of errors.

Consumption prediction
The United Arab Emirates (UAE) and other Gulf Cooperation Council (GCC) countries have reported a significant increase in energy consumption over the past decades [1] .Despite the unique consumption patterns in the region, there has been a lack of research on large-scale energy consumption at the district or city level.Therefore, this paper investigates a selection of machine learning models, some of which have not been previously deployed for consumption prediction in this region at the mentioned scale.The models explored in this study include: • Simple regression models, such as multiple linear regression (MLR), lasso regression (LR), and ridge regression (RR), deployed as a baseline for consumption prediction.• Support vector regression (SVR) which can model the non-linearity of the data; however, it may not perform well when there are noise in the dataset.• Decision Tree (DT) regressor as a tree-based baseline model.
• Ensemble models, including random forest regression (RF), gradient boosting machine (GBM), extreme gradient boosting (XGB), light gradient boosting (LightGBM), and stacking regressor.Ensemble models are generally capable of efficiently modeling complex data relationships despite the correlation or linearity degree.Bagging and boosting ensemble techniques comprise many decision tree regressors as base estimators.However, bagging, as used in random forest models, has low variance, while boosting ensemble techniques, as in gradient boosting machine, extreme gradient boosting, and light gradient boosting, can reduce the prediction bias.In contrast, stack ensemble models can include baseline or complex models as base estimators, and can improve the model prediction accuracy.Nevertheless, based on the complexity of the models, they might require more computation resources and training time.• The feedforward artificial neural network (ANN) model can extract complicated data relationships and produce high-performance models.It consists of neurons requiring some computational resources compared to baseline models.
In addition, the following evaluation metrics were utilized to assess the performance of the applied machine learning models which addresses the research questions of the present study: Med AE = med ian X 1 − X 1 , . . ., X N − X N (4) where N is the data size, X i and X i are the observed and forecasted values, respectively, and X i and X i are the observed and forecasted means, respectively.In addition, varying percentages of both datasets were utilized to test the load forecasting models.Tables 8 and 9 show the performance of ANN, GBM, DT, MLR, LR, RR, and SVR, whereas the behavior of the remaining models, namely RF, XGB, LightGBM, and staking regressor, was presented by Abdallah et al. [1] .Moreover, two CV schemes, i.e., moving, and rolling, were utilized to evaluate the ability of the models to train new data accurately.In moving CV, the models were tested using varying dataset sizes to define the optimum length which yields an accurate estimation.Next, the models were trained using the optimum dataset length while sliding by an individual month every prediction instant.On the other hand, rolling CV implemented a 12month long dataset to forecast the first instance, then the dataset length was increased by an individual month every prediction instant.A detailed elaboration of the utilized machine learning models as well as moving and rolling cross validation methods was provided by Abdallah et al. which was followed in the present study [1] .Fig. 10 illustrates the average RMSE obtained by RF, stacking regressor, XGB and LightGBM applying moving CV scheme when tested on various dataset lengths.Moreover, the average and standard deviation of the forecasting accuracy for the remaining models implementing rolling and moving CV procedures are demonstrated in Table 10 .The feature importance was investigated by utilizing the best performing models to predict the electricity consumption with all features versus a base scenario consisting of climate in Table 11 .

Limitations
During the data collection, a few limitations were observed which may impact the interpretation of the presented data.The obtained climatic features were collected from a single location, i.e., Dubai International Airport, which may not fully represent the various geographical locations of different communities in Dubai.In addition, the population records were obtained for each community on an annual basis; hence, the same value was utilized across all months of the year for each community.

Fig. 1 .
Fig. 1.Heat map of the electricity consumption in all communities.

Fig. 2 .
Fig. 2. Heat map of the population in all communities.

Fig. 3 .
Fig. 3. Satellite image for the sectoral distribution of all communities.

Fig. 6 .
Fig. 6.Box plots of electricity consumption pattern a) before removal of extreme values and b) after removal of extreme values.

Fig. 7 .
Fig. 7. Electricity consumption pattern before and after the removal of extreme values.

Fig. 8 .
Fig. 8. Annual electricity consumption pattern before and after the removal of extreme values.

Fig. 9 .
Fig. 9. Box plots of the electricity consumption data for each month.

Fig. 10 .
Fig. 10.Assessment of moving cross validation method on the top four models using varying dataset size.

Table 1
Description of the number of observations for each data file.1839,

815 Total Number of Observations 55,886,785 Table 2
Description of the attributes present in DEWA electricity consumption and shams Dubai files.

Table 3
Annual number of accounts and electricity consumption for each rate category.

Table 4
Summary of the total number of each type of building in 2019.

Table 5
Summary for the annual sum and average of the population and electricity consumption per sector.

Table 6
Descriptive statistical summary of all utilized features in the aggregated cleaned dataset.

Table 7
Description of number of communities each month post aggregation.

Table 8
Evaluation of the remaining models using different test sizes of temporally ordered dataset.

Table 9
Evaluation of the remaining models using different test sizes of randomly split dataset.

Table 11
Comparison of best-performing models against base scenario (including climate) for predicting electricity consumption.