Big Data Processing for Commercial Buildings and Assessing Flexibility in the Context of Citizen Energy Communities

In this paper, we propose a cloud-based big data processing approach to evaluate the flexibility potential of commercial buildings by type and benefits for the owners. The pandemic times changed electricity consumption patterns with a substantial impact on energy markets. Many activities moved from large commercial offices and schools to residential buildings. With machine learning algorithms, the flexibility forecast can be improved to help energy suppliers, grid operators, and traders better calculate the flexibility potential of commercial buildings. With better forecasts, grid operators can identify and mitigate risks, prevent malfunctions, and schedule maintenance works in advance. Using flexibility forecast as input and results from previous studies regarding flexibility coefficient by state and demand response programs, we propose an original method to assess load flexibility of commercial buildings and calculate the benefits for their owner. The exemplification is done with an extensive hourly dataset from the U.S.A. of 14,976 comma-separated values files with a total of 131.18 million records showcasing the electricity and gas consumptions and their breakdown for one year.


I. INTRODUCTION
In July 2019, the European Union (EU) introduced Citizen Energy Communities (CECs), including residential consumers, prosumers, and local entities such as distributed energy resources, storage facilities, and industrial and commercial buildings. Such communities generate large volumes of smart meter data that can be analyzed to extract useful insights related to flexibility potential and assess the benefits and Enabling Technology Costs (ETCs) for Demand Response (DR) programs. Load flexibility helps CECs to handle the fluctuations and high volatility of load, wind speed, and solar radiation. Smart metering data has multiple applications such as billing, cluster identification, tariff setting, load forecast, optimization, market simulation via blockchain, and flexibility assessment especially when data is provided at the appliance or group of appliances level. While some studies discuss the applications of smart meter data [1] and load forecasting based on Big Data [2], many challenges [3] still need to be addressed as the large volume of data does not provide directly useful insights and hints regarding future trends. Thus, the EU policy envisions an energy transition that allows prosumers and CECs to share, trade, aggregate, and sell the electricity surplus [4], [5] and even own and manage the grid. These activities are accompanied by large volumes of data that can offer useful feedback for energy suppliers, grid operators, traders, consumers, prosumers, and aggregators [6]. DR programs target to extract and reward flexibilities and use them to manage the variation output of Renewable Energy Sources (RES) and the load of the Electric Vehicle (EV) charging stations [7]. Flexibilities are defined in terms of type, size (quantity), duration, control technology considering the specific sector [8]. For instance, [9] defined residential, commercial, and industrial sectors enabling technologies for DR. Furthermore, they define DR service types (shift, shed, and shimmy). Thus, commercial buildings data can be grouped types, Independent System Operator (ISO) affiliation, correlated with several DR services, flexibility coefficient by state, and ETCs to assess the flexibility potential and benefits [10]. Furthermore, COVID-19 pandemic times influenced the load pattern, shifting the load from commercial buildings such as large offices and schools to residential consumption [11]- [14], so improved data-driven architecture for buildings data exchange is needed [15]. Big data technologies and IoT are frequently used in consumer-oriented energy optimization and prediction, including the day-ahead forecast for buildings [16] or total energy consumption [17]. This paper stems from the research underlined in [18] that focused simple data analytics performed with large volumes of data. It also takes into account our previous research in terms of data models for flexibility and DR assessment [19]. Comparing with the method proposed in [19], we identify and emphasize the similarities and differences described in section II. Flexibilities and Direct Load Control (DLC) are also studied in [20]. The novelty of the current study consists in: • data centers (DCs) are usually the primary choice for storing large quantities of data. Some studies considered DCs as computing facilities of interest for exploiting their energy flexibility and have proposed an Energy Marketplace to allow DCs to act as active energy players integrated into the smart grid [21]. But the study mentioned above and other papers on energy flexibility [22], [23] lack comprehensive analyses of data lake solutions as powerful backends for running DR assessment. To address this gap, we compared two data lake architectures (Hadoop based vs AWS), and different approaches within these architectures as far as costs and speed are concerned. The cost-wise and speed-wise comparisons are conducted to identify the pluses and minuses of the two solutions to store and prepare the data (e.g., to perform data reduction) for running the algorithms; • a different approach to DR assessment, considering not only shift DR service but also the combination shed & shift. Furthermore, when estimating the DR, we updated the analyses by rethinking the implementation of DR programs. The analyses take into account the results of previous studies [9], [10]. It includes the flexibility forecast performed with Machine Learning -LSTM (Long Short-Term Memory) recurrent neural networks and Regression Analysis aiming to determine future consumption values to evaluate the flexible potential better and estimate the efficiency of DR programs. • flexibility assessment method that relies on a step-bystep approach consisting in a) Dividing appliances into programmable and non-programmable to separate flexible and fixed consumption; b) Calculate total forecasted consumption of programmable appliances at hour h; c) Calculate daily mean consumption using the load profile; d) Extract peak hours for the analyzed interval: one month, year, etc. e) Identify the start and stop peak hour; f) Apply flexibility coefficient to obtain the shiftable consumption; g) Obtain total consumption to shift from peak to offpeak hours; h) Calculate the gain or benefit that can be obtained by shifting and shifting/shedding programmable appliances; i) Compare the gains and choose the DR program. The large-scale rollout of smart metering systems that takes place in most European countries generates numerous datasets. Therefore, we propose to extract meaningful insights from this data and raise the awareness of the consumers regarding DR programs potential to bring savings and increase a pro-environmental behavior by assessing the flexibilities in terms of quantities and monetary benefits. The input data comes from numerous smart meters in CSV files grouped by state and building type. Our objective is to store it in S3 (AWS) and HDFS, and reduce it with Athena and Sagemaker or Hive, to be further processed in Python obtaining the flexibility forecast. Then, the flexibility potential is assessed by implementing the method proposed in section II. The input data flow and processes proposed in this paper are graphically described in Figure 1. The paper is structured as follows: In Section II, we start with the Big Data processing approach, describing and analyzing the dataset, continuing with comparing the data lake architectures, proposing a flexibility assessment method, with the final subsection comparing two forecast methods; In section III, the simulations using the forecast methods on the dataset are conducted and the results of flexibility method implementation are presented; In section IV, the conclusions are drawn. VOLUME XX, 2017

A. Cloud vs. Hadoop Distributed File System (HDFS) storage
To store and process the data needed for analyses, we needed a solution that can reliably offer access to various services to multiple types of users, as shown in Figure 2. Different applications (including IoT devices) can usually interact directly with the data lake by writing semi-structured data. Application developers access the data lake directly or through query engines that offer different views on the data (e.g., by providing SQL functionality). Data scientists prefer Jupyter notebooks [24] to interact with the data through various libraries, using SQL-enabled query engines, data dictionaries, or reading the files directly (e.g., in Pandas DataFrames). Some data scientists prefer the flexibility and standardization of SQL to query their data. Consequently, some schema can be applied (on-read/on-write) to the data, which might lead to the loss of certain nuances of the raw data. Because of this loss of nuances, it is often advisable to work with the initial raw data.

FIGURE 2. The interactions between the actors and the applications
Standard databases usually employ a schema-on-write approach. That means that the schema of the data is defined before the data is stored (i.e., CREATE TABLE before INSERT rows). This is more rigid but works well for dense and structured data and where constraints or indexes are necessary. Big Data oriented solutions enable a more flexible schema-on-read approach where semi-structured or unstructured data is stored as it is and the schema is applied (i.e., the data is parsed) later when the data is being consumed. Sometimes, after the schema is applied and the data analyzed, the resulting output is loaded into a schemaenabled database for future use. By combining all these types of data to store, govern and process at scale, we consider a data lake [25]. Many times, such lakes are supplied with data from ingesting real-time flows. Data lakes are being used in many domains, including smart grids [26], and can provide data scientists a plethora of inputs to train better models. Of course, as always, having access to lots of data doesn't tackle by itself the bad or inaccurate data issues [27].
Our dataset is, as previously discussed, is made of multiple CSV files. Useful data comes from the payload of the file, but also from the name of the file and of the folder. The file and folder names contain useful information such as country, state, city, location type, and date (the payload contains the day, month, and hour and the filename, the year). Columnar storage formats such as Parquet [28] or ORC (Optimized Row Column) can greatly decrease query time and costs [29], especially for aggregations (AVG, COUNT, SUM), as some cloud services such as Amazon Athena or Spectrum charge by the queried data size. Therefore, if the CSVs are scanned multiple times, it makes sense to convert them into Parquet.
The second stack from Table 1 is comprised of Amazon Web Services cloud solutions. Even though similar solutions exist from other Cloud providers, AWS tends to be the standard for heavy lifting machine learning engineering. For comparison, an Azure Studio-centered solution from Microsoft is more suited for business users who want to train a model in a drag-and-drop manner without using much, if any, code. Cloud solutions are far from free but benefit from the advantages of running on a mature platform offering various services. Using the cloud for data science projects enables easy transitioning from testing and prototyping to production on top of data durability guarantees. It can also help accelerate the training phase (easy access to GPUs, to multiple machines for horizontal scaling) and handle almost unlimited concurrent connections. S3 is the object storage service from AWS. While you can't run an operating system or a database management system on S3 (you would need EBS -Elastic Block Storage), it is the standard solution for hosting a data lake. It offers multiple tiers each with its advantages, disadvantages, and pricing. Storing 1TB will roughly cost $23/month on S3 classic, $12.5 on S3 Infrequently Accessed, $10 on S3 IA One Zone, $4 on Glacier, and $0.99 on Glacier Deep Archive (long retrieval time, up to 12h). There are also costs related to data access (for PUT, COPY, POST, LIST, GET, SELECT -$0.005/1000 requests), monitoring (for Intelligent Tier), data transfer, transfer acceleration, or cross-region replication. With Amazon Athena, we run SQL queries directly on the S3 data and uses AWS Glue Data Catalog to store the metadata (table definitions). It is interesting to note that Amazon Athena uses Apache Hive DDL to define tables 1 . So, it doesn't require any data movement, like EXTERNAL TABLES in Hive. For Athena, the LOCATION changes from an HDFS folder to an S3 folder in a bucket. The payment is made by the amount of scanned data per query.
Redshift is the main AWS OLAP solution that easily integrates with S3 and Glue Data Catalog whereas Aurora is the AWS-built OLTP solution for the cloud. Both use SQL natively. Still, Redshift Spectrum can spread queries across multiple stores such as Redshift and S3 similar to how Phoenix can query across HDFS, Hive, and HBase. AWS offers Sagemaker, a Jupyter environment that makes available the libraries and the infrastructure to make use of the data lake. For example, to run a 5 note Spark cluster from a Sagemaker Notebook (e.g., to assess data quality) we use: The files can now be queried using standard SQL. To extract information from the file and folder names, we use split_part and regexp_extract on $PATH (similar to INPUT__FILE__NAME from HiveQL1), part of a view based on a query. The information can be retrieved by querying the view as shown in SQL1 from Table 3. As in HDFS, no pre-processing of the files is needed (e.g., concatenating files, adding columns, etc.) and when new files are added to the S3 folder, the view will see the new information.
Whereas Hive offers easy conversion between TEXTFILE tables and ORC, AWS offers, among others, easy conversion from Redshift, Aurora, or Athena to Parquet. This can be done using SQL statements as shown in SQL2 from Table 3. SQL2 moves and compresses the data from one S3 folder to another, creating subfolders for each distinct value of the partitioned_by clause (e.g., if there is data for 40 states, 40 subfolders will be created in /energy_parquet). As shown in SQL3, we can do a full query of energy_parquet and get the same result as in SQL1. This improves query time and lowers costs by reducing the data scanned. One problem here is that as CSV files are added or updated so has to be the parquet table (e.g., by using INSERT+SELECT) as it doesn't automatically see the new data.
The run times from the tables are for comparison reasons (speed increments when adding more data or switching from one format to another). They can easily vary by up to 10% depending on many factors, including the general AWS load. We notice that the SQL implementation of Athena is similar to the one from Hive (there are some differences in the functions and the accepted regex expressions). By studying Table 3, it is interesting to notice that the conversion from CSV to PARQUET was done in less than 15 seconds and the size of the dataset was reduced by a factor of 3.5 (due to compression). If new CSV files are added to the S3, the Parquet table can be easily updated by using a Lambda function [30] and a CloudWatch Events rule.
The results are similar to the ones when converting from CSV to ORC from Table 2 in respect to size reductions and improved query time. Even though the full scan from SQL3 took almost the same amount of time as SQL1, the amount of data scanned was much less, so the costs will be more than three times smaller (on Athena you pay by the amount of scanned data per query). The results suggest that it makes sense to convert to Parquet if you have to do more than one full table query on the CSV data. To further our research, we also conducted aggregation queries on the view constructed directly on CSV data and the parquet table (SQL4-SQL7).   We can observe that when using the CSV-based view, all the data gets scanned for the group by while when using the Parquet table, a lot less data is scanned thus lower costs. Using a limit clause to get only a subset of the rows has little impact on query time or the amount of scanned data for both cases. If we add to the SQL4-SQL7 queries a sum function on a different column (e.g., sum(cooling)), the amount of scanned data remains the same for the first 2 queries and increases by 55% for the latter two.
In this section, we have shown that the two architectures described in Table 1 are similar in many ways. The conversion from CSV to a columnar format brings important speed benefits, but the comparable queries will run considerably faster on AWS (see HiveQL6 vs. SQL6 and HiveQL7 vs. SQL7). We chose the AWS solution because of the speed benefits and better integration with Sagemaker for building the models discussed in the next section. On the other hand, the AWS solution incurs monthly costs which are reasonable for our data preparation needs, but will get higher as the volumes increase (e.g., running 30 queries/day in Athena, each query scanning 500 GB, the monthly cost would be $2299, not including the tier dependent S3 costs).

B. Load flexibility forecast and assessment method
The flexible energy consumption forecasting can play a very important role at a global level for DR program event creators, such as energy suppliers, grid and system operators, aggregators [31], [32]. The flexibility potential can be improved by performing the forecast for programmable appliances which allows suppliers and grid operators to make the right decisions at the right time. Thus, the prediction of future consumption allows a better assessment of the flexibility and efficiency of the DR programs.
There are many methods to predict consumption [33]- [35] and most of them depend on historical data. Historical data series can be used to create prediction patterns with the aim of predicting future consumption. Machine learning -Long Short-Term Memory (LSTM) is a type of recurrent neural network used in deep learning that can be used to forecast consumption. Another method by which future flexible consumption can be forecasted is regression analysis [36]. Regression analysis is a set of statistical processes for estimating the relationships between a dependent variable and one or more independent variables. Both methods are suitable for time series forecasting. The linear regression algorithm attempts to minimize the sum of the squares of the differences between the observed value and the predicted one, whereas recurrent neural networks hold a hidden layer that acts as a memory function that considers previous time steps while calculating what the next value in the sequence should be. Unlike regression that deals especially with linear dependencies, recurrent neural networks can also cope with nonlinearities. Also, regarding the dataset, recurrent neural network, being a method of deep learning, requires a large set of data to train the model compared to the regression model that needs at least a little bit more data than the number of its parameters. In Table 4, the most important steps for building the forecast models are briefly described.
One flexibility assessment method is envisioned starting from the results of previous studies that identify the flexibility coefficient by state [37] and propose and fully describe the DR enabling costs and DR programs such as SHIFT, SHED, and SHIMMY to control appliances and flatten the load curve [9]. The most important aspect of this method is to update the load profile at least monthly because the load can suffer from seasonal changes and unexpected phenomena such as the pandemic leading to significant changes in consumption patterns from offices to residents.
The main steps of the proposed method are presented in Table 5. First, the appliances are classified into programmable (PA) and non-programmable (NPA) to separate the flexible consumption from fixed one. Then, the hourly total consumption of forecasted PA is summed up. Using a valid load profile, the average consumption is calculated to identify the start and stop peak hours . By comparing the consumption of each hour with the average, one or more peak intervals will be extracted. Then, the flexibility coefficient (gamma flex) will be applied to the total hourly consumption of the forecasted PA indicating the consumption that is available for DR programs or the reliable flexibility potential that can be summed up to get the total. By multiplying the peak and offpeak rate difference with the total, we will obtain the gain from shifting PA. The same approach is carried out for the SHED DR program, but usually not all PAs are appropriate for shedding, thus a combination of SHED and SHIFT is proposed instead. In the case of shedding, the gain is obtained by multiplying the total with the peak rate as that consumption is shifted but not canceled. The proposed method is implemented with the consumption data of commercial buildings from the U.S. in the next section Simulation and Results. Step

LSTM Model Regression Data Preparation
Transforming the raw data: ✓ Data Cleaning ✓ Feature Selection ✓ Feature Engineering ✓ Dimensionality Reduction The same steps are performed.

Setting up models
The dataset is divided into 2: 1 set for training and 1 for testing. The models will be developed using the training dataset and will make predictions on the test dataset.
To demonstrate the predict_model() function on unseen data, a sample of records will be withheld from the original dataset for predictions.

Data transformation
✓ Time Series to Supervised Learning To transform the time series into supervised learning, a function can be defined to use the observation from the last time step (t-1) as input and the observation at the current time step (t) as output.
✓ Time Series to Scale It consists in transforming data from time series to scale using the MinMaxScaler. Process Reversal-to bring the forecasts on the differentiated series to their original scale Data will be used without being transformed.

Model development
For the LSTM, the model must be specified: ✓ a loss function and an optimization algorithm; ✓ the number of neurons in the first visible and from the exit layers; ✓ the training period and the size of the batch.
The following functions are used: ✓ setup() function initializes the environment and creates the transformation pipeline to prepare the data for modeling and deployment. ✓ compare_models() function trains all models in the model library and scores them using k-fold crossvalidation to evaluate measurements (more than 20 models are trained and evaluated using crossvalidation). ✓ predict_model() function is used to predict the model using an unseen dataset. 3) Calculate daily mean consumption using the load profile (real values from the monthly updated profile), n=24:

4) Extract peak hours
for the analyzed interval: one month, year, etc. Identify the start peak hour and stop peak hour : IF THEN 5) Apply flexibility coefficient to obtain the shiftable consumption: , 6) Obtain total consumption to shift from peak to off-peak: 7) Calculate the gain or benefit that can be obtained by shifting PA using the difference between peak rate and off-peak rates: 8) Calculate the gain or benefit that can be obtained by shedding PA: 8) Calculate the total gain or benefit that can be obtained by shifting and shedding PA. First, calculate the share of shedding q and shifting appliances 1-q of PA: 9) Compare the gains and choose the DR program (i.e. ALLSHIFT, SHIFT & SHED, etc.).
Comparing with the method proposed in [19], we identify the similarities and differences described in Table 6:   TABLE 6. SIMILARITIES AND DIFFERENCES BETWEEN THE CURRENT METHOD AND THE METHOD PROPOSED IN [19] No. Differences Similarities 1 In the current method, we consider the flexibility forecast as input, not the historical data as in [19] which provides more accurate results; Both compute flexibility potential and savings;

2
Peak and off-peak hours are differently selected, in the current method the selection considers the average consumption and it has to be updated periodically, whereas in [19] the peak and off-peak hours are selected following the load curve without a precise rule. The current approach provides more accurate and realistic results since the load curve is seasonally changing or it suffers from changes; Both take into account the same appliances as programmable PA and nonprogrammable NPA;

3
The current method calculates the energy for SHIFT and SHED considering the share of the appliances in each category, whereas in [19] the consumption of appliances by SHIFT and SHED is calculated separately. It does not impact the output significantly.
Both consider two DR programs: 1) all shift and 2) the combination between SHIFT and SHED; Thus, the current method is structured in 9 steps, providing more precision, and the results are more accurate as it starts from the proposed flexibility forecast using one of the regressors or LSTM.

A. Analysis of the datasets and extracting insights from big data
As discussed in the previous section, the files have been stored in S3, the object storage service from Amazon Web Services (AWS), and the data preparation, model training, and deployment have been done mainly in Amazon Sagemaker (a Jupyter environment). We used the AWS Data Wrangler library to seamlessly connect to the S3 stored data files and read them into Pandas DataFrames. To analyze the data and extract valuable insights, we used Athena to query data and to build a table-to-S3 dictionary in Amazon Glue. For the current study, the preliminary analysis data coming in .csv format. The dataset contains 14,976 .csv files with a total of 131.18 million records 2 . The files contain both electricity and gas consumption and their corresponding breakdown, but as our study is related to flexibility potential in terms of electricity consumption, we will focus on electricity. The programmable load consists of fans, cooling, and heating appliances, whereas the non-programmable load consists of interior lighting and equipment. Comparing programmable (flexible) and non-programmable loads in Table 7, we notice that at some intervals the flexibility is zero (Minimum = 0) and the total consumption of programmable appliances is smaller than interior lights and equipment meaning that there is more room for flexibility in case lighting and interior equipment become more flexible in the future. This aspect could be considered for buildings depending on their type. Furthermore, in Table 8, we analyze the statistic indicators for peak and off-peak hours since their particularities are significant for DR programs. However, peak hours could be split into mild-peak and critical-peak hours depending on the season, schedules, and working shifts, aiming to analyze load flexibility potential. Thus, we calculate the statistics indicators splitting the appliances into programmable and non-programmable ones operating at peak and off-peak hours. Therefore, we target to apply DR programs to programmable appliances that operate at peak, whereas nonprogrammable appliances will remain unchanged.
Using the reduced data approach proposed in section II.A, we can extract valuable insights regarding commercial building operation and flexibility potential. Considering the state and ISO affiliation (the nine regional Independent System Operators that control the load in the U.S. are depicted in Figure 3(a) as CAISO, ERCOT, ISO-NE, MISO, NORTHWEST, NYISO, SOUTHEAST, SOUTHWEST, and SPP) and DR [37], we calculate the average flexibility potential in percentage for our dataset of commercial buildings at the ISO level (as in Figure 3(b)). ISO affiliation is added to the initial dataset grouping the states to corresponding ISOs since it is important for control areas to know the flexibility potential and envision strategies to use it.  There is a quite significant difference between ISO-NE, NORTHWEST with around 2%, and MISO with over 13%. Thus, the flexibility of buildings is not uniformly distributed among ISOs. In Figure 4, the breakdown of electricity consumption is showed for each ISO. The appliances that can provide flexibility are heating, cooling, and fans, whereas interior equipment and lights are considered less flexible. The electricity consumption by appliance type is provided in Figure 5. Cooling systems and fans are the most numerous programmable appliances. They represent 76% of the total programmable appliances. The electricity load curves for the 16 types of buildings are provided in Figure 6. These curves are interesting as they reveal the buildings with the highest consumption at peak hours and in general. For instance, hospitals, large offices, secondary schools, and large hotels are buildings with the highest consumption. However, some complementarity could be identified in terms of consumption hours among buildings with the highest consumption. Large hotels have two peaks: late evening and morning peaks, whereas the others reflect more activity during 6 and 18. However, gas load curves are much different (as in Figure  7) and they can be considered in additional analyses to reflect the possible transfer to release the grid stress from electricity to gas and vice-versa. Load profiles are drawn for total electricity and gas consumption of all buildings as in Figure 8 and breakdown by appliance type as in Figures 9 and 10. Most of the heating is done by gas, probably because the gas price was more convenient. Therefore, it is reasonable that heating by electricity is small. Furthermore, the load profile for electrical appliances is very relevant for strategy makers as some rules have to be taken into account. For instance, cooling even it is the highest programmable load, cannot be shifted entirely to the night hours. Thus, a mix of strategies in terms of DR programs should be considered.

FIGURE 10. Electricity load profile breakdown by appliance type
Moreover, very relevant is Figure 11, showing the potential of each building type. It seems that hospitals have the highest potential, with the highest flexibility, but it is sensitive in terms of patients' comfort and safety so the DR program should be adjusted accordingly considering the building type. Thus, large offices and secondary schools could be more targeted in terms of DR programs penetration as they will allow scholars and employees shifts of even working from home that will somehow transfer the consumption from commercial to residential buildings, especially during pandemic times.

B. Analysis of the datasets and extracting insights from big data
Using the reduced dataset according to section II.A, we intend to perform the flexibility forecast that will be the input data for the proposed flexibility assessment method to evaluate the flexibility potential and the efficiency of DR programs. From the analysis of the data, we discovered that states such as Texas and California have the highest consumption values on the flexible component (fans, cooling, and heating) and states such as Delaware and Alaska, the lowest ones (as in Figure 12).  For forecast exemplification and analysis, we used the consumption data recorded in Texas, as it is one of the most representative state, aggregated at the hourly level, as follows: • combined with weather data from NOAA (National Center for Environmental Information/ https://www.ncdc.noaa.gov/crn/qcdatasets.html) for regression analysis using PyCaret; • without enhancing the data (no adding information about the weather) for the LSTM model. The weather values taken from NOAA are values at the hourly level and contain information such as air temperature, precipitation, global solar radiation, relative humidity, surface infrared temperature, soil moisture, and soil temperature. They are forecasting variables used as input data in the regression model.
As proof of concept, we aim to predict future consumption on one of the flexible components namely "Fans: Electricity [kW](Hourly)". Of course, the models can be applied to any component in the dataset.
For regression analysis (using different regressors), we follow several steps: 1) Data Preparation: The first step in performing the analysis was to prepare the data. From the initial dataset of 131 million records, 8.5 million related to consumption in the state of Texas were extracted using aggregation queries as in section II.A. The new dataset obtained was then aggregated at the hourly level to be combined with the weather dataset. Because weather data is provided for specific time slots (hours: 2,6,8,11,13,14,15,17,19,21,23), the combination of the two, the consumption records and the weather time slots, resulted in a final set of 7,270 records.
2) Setting up the model: Unseen Data. From the obtained dataset, a 10% sample was retained from the initial dataset to be used for prediction.

3) Model implementation and comparison:
After the environment was initialized, all the regression models available in the library were run and compared. Out of them, we chose to further analyze and plot the results of Random Forest Regressor, Linear Regression, and AdaBoost Regressor to determine the accuracy of different regressors. The forecast results with 14 regressors are presented in Table 9. A model was created for regressors that were used to predict consumption for 24 hours. From the dataset, the day of July, 15th, a summer day in which flexible consumption usually registers significant increases, is selected and to that day we apply the above-mentioned models. The results are shown in Figures 13-15. After checking the results, we've noticed that for Random Forest Regressor and Linear Regression, the forecasted consumption represents 96% of the actual consumption (in total, prediction is lower with 4% than actual consumption). In contrast, with AdaBoost Regressor, prediction is higher with 9% than actual consumption. For the LSTM model, we use the data with the initial variables, without weather data, representing the consumption in Texas aggregated at date level. The number of observations in the dataset is 8,395. Unlike the regression simulation, where we considered variables related to the weather conditions, for this model we intend to determine the future consumption based on the previous consumptions recorded for the other components (flexible or not). The difference between the two models consists in dividing the dataset into two: a training and a test set, followed by the transformation of the data from time series to supervised learning, respectively into scales for the LSTM model. To build the model, the following parameters are used: train data -6000; loss function -"mean_squared_error"; optimization algorithm -ADAM; neurons in the first visible layer -100; epochs -100; batch size -70. As in the case of regression, we intended to forecast the consumption for 24 hours. The results are presented in Figures 16 and 17. It can be observed from Figure 17 that the train and test performances are pretty close, and we can infer that the model converges quite quickly. Starting from this observation, we can say that LSTM is a better choice for our model.
If we compare with the results depicted in Table 10, it can be noticed that LSTM performs better, considering that RMSE is 1,234.40. While performing the analysis, after several iterations, we discovered that the larger the training set, the better the result obtained. For LSTM models using increasingly larger datasets ensures a better training of the model. Thus, we can infer that this model is more suitable for solving time series problems which don't require additional calculation efforts, considering that the initial dataset (without weather variables) is used. Of course, the computing power available when using large sets of data must be considered.

C. DR programs and flexibility assessment implementation
The current much higher variety of energy sources, compared with the more predictable large power plants and global load leads to the necessity of more back-ups from the generation and load side. The generation reserve consists of rapid generating units (gas and hydro) that can rapidly change the output or even consume at night (pump-hydro power plants). The Demand Side Management (DSM) includes the DR concept that targets the load flexibility to balance generation and consumption. This concept is progressing a lot due to the advancement of Information and Communications Technology (ICT) and increasing awareness and motivation from demand. As mentioned, the flexible loads in our dataset are heating, cooling, and fans that could be engaged in at least one of the DR services: shift, shed, or shimmy with specific enabling technology. Shift DR services consist in rearranging the loads and involve shifting the operation of flexible loads (programmable appliances) from high-rate to low-rate hours (using a Time-of-Use (ToU) tariff), with the benefits of the price difference (peak rate could be 0.38 Euro cents/kWh and off-peak rate 0.09 Euro cents/kWh). Shed DR services rely on the capacity of some appliances to temporally reduce the load at peak intervals. For instance, the consumption of cooling, fans, or heating systems will be reduced for short intervals without disturbing the consumers' comfort for the residential sector or commercial activities. Even if the reduction is just for short intervals such as 5 or 10 up to 15 minutes, the system will benefit from multiple loads simultaneously reduced that will lower the consumption curve. The perception of DR services is very important because if the consumers perceive the DR program as negative, altering their commercial activities, they will not participate in DR. Shimmy DR services are more complex, remotely controllable, and usually imply enabling technologies that allow the appliance to follow a precise dispatch signal increasing and decreasing (more often) the load. They also vary in terms of time horizon implementation; shift DR service is applied on long and medium-term, shed on medium and short-term, shimmy on short and ultra-short-term. However, the DR services implementation depends on the building type, physical infrastructure, or the existence of the enabling technologies that involve considerable costs. According to [37], [38], for controlling Heat Ventilation Air Conditioning (HVAC) with shed & shift, the ETC or cost is around 242$, with shed only is 169$, whereas with shimmy is considerable: 2376$. The cost includes control technology, communication, and hardware. There are different costs for lighting, refrigerators warehouse, and water heaters, but they are not included in our dataset. Thus, when estimating the benefits from DR services, we have to subtract the ETCs. As the DR capability estimation is usually below 10%, we assumed these percentages from programmable appliances consumption as flexible with the results of 4 scenarios multiplied by the DR services for which the ETCs are available. However, it is not possible to simulate shimmy DR service as it implies fine up/down tunning depending on the system real-time balancing requirements that can be paid at a fixed price as in Florida Power & Light 4 or variable as in auctions. Therefore, we analyze shed & shift DR service, shedding cooling systems at peak and partially shifting the heating and fans systems that totalize 24% of the programmable appliances as in Figure 18. For simulation, we start identifying programable appliances or the flexibility potential of the commercial buildings, peak (6-21) and off-peak hours, and possible DR programs or their efficient combination that can be implemented for our dataset. Reasonable DR capability percentageflexibility coefficients (between 1 and 10%) of the programmable appliances are considered according to previous studies. The results of the DR programs simulation are provided in Table  11 and Figure 19 using the flexibility forecast results described in the previous section. Benefits are calculated considering the price difference. Thus, we consider that the peak rate is 0.38$ and the off-peak rate is 0.09$. When shedding the colling systems, the entire shed energy is saved. FIGURE 19.

Results of the DR programs simulation per year
Comparing the two main DR programs that rely on 5% of the programmable appliances, the combination SHED & SHIFT is better than ALLSHIFT by 19%. Also, the advantage of the combination SHED & SHIFT is given by the different time horizons of implementation. For ALLSHIFT, the shifted energy is proportional to the benefits, the more energy is shifted the benefits are higher as they depend on the difference of peak and off-peak rates. But shedding also implies load reduction without rescheduling the appliances that represent not only money savings, but energy savings with social implications in terms of CO2, deforestation reduction, standard coal avoidance, etc.

IV.CONCLUSION
Starting from a large consumption dataset of commercial buildings of different types, we tested two approaches to store and reduce the data. We chose the AWS approach over the Hadoop one, because of lower processing time and a more mature and ready-to-use environment centered around SageMaker and S3.
After choosing the data lake architecture for big data, we applied two machine learning methods to forecast the flexibility: Long Short-Term Memory (LSTM) and regression analysis using the PyCaret library. Both methods provided promising results since roughly both estimations were pretty close to the actual consumption values. However, it can be concluded that the LSTM model is more suitable for the time series analysis as it provides a better RMSE score than the regressors. An additional insight that stems from the analysis is that for the LSTM model, a larger dataset is required to train the model and thus obtain a high accuracy forecast. Compared to LSTM, the regression model also offers good results on aggregated datasets, so it does not depend on having big datasets available to be able to return relevant forecasts.
With the reduced dataset and a reliable forecast, we proposed a flexibility assessment method to identify the flexibility potential of commercial buildings and the efficiency of several DR programs. The merit of our method is twofold: first, it uses the results from previous studies regarding the flexibility coefficients and DR programs; second, it is a novel contribution as it provides an approach to evaluate the flexibility that can be used by the grid operators to balance the systems and suppliers to enhance their market acquiring strategies. The results are better when combining the DR services SHIFT & SHED. As future work, we consider enhancing the triplet: big data -forecastflexibility assessment approach. Furthermore, we aim to improve the forecasting models featured in this study to obtain an even more accurate forecast. To do this, the scope will be broadened, and we plan to use an extended period. The benefit of this extension is that it will allow the models to train better and, as a consequence, we will have increased accuracy. Additional enhancements are to be investigated, such as trying to find the best parameters for the models, as well as identifying and selecting the most relevant variables for the newly created context.