Data-driven control of room temperature and bidirectional EV charging using deep reinforcement learning: simulations and experiments

This work presents a fully data-driven, black-box pipeline to obtain an optimal control policy for a multi-loop building control problem based on historical building and weather data, thus without the need for complex physics-based modelling. We demonstrate the method for joint control of room temperature and bidirectional EV charging to maximize the occupant thermal comfort and energy savings while leaving enough energy in the EV battery for the next trip. We modelled the room temperature with a recurrent neural network and EV charging with a piece-wise linear function. Using these models as a simulation environment, we applied a deep reinforcement learning (DRL) algorithm to obtain an optimal control policy. The learnt policy achieves on average 17% energy savings over the heating season and 19% better comfort satisfaction than a standard RB room temperature controller. When a bidirectional EV is additionally connected and a two-tariff electricity pricing is applied, the MIMO DRL policy successfully leverages the battery and decreases the overall cost of electricity compared to two standard RB controllers, one controlling the room temperature and another controlling the bidirectional EV (dis-)charging. Finally, we demonstrate a successful transfer of the learnt DRL policy from simulation onto a real building, the DFAB HOUSE at Empa Duebendorf in Switzerland, achieving up to 30% energy savings while maintaining similar comfort levels compared to a conventional RB room temperature controller over three weeks during the heating season.


Introduction
Buildings account for one-third of global primary energy consumption and one-quarter of greenhouse gas (GHG) emissions. Consequently, they have been identified as a critical element to enable climate change mitigation [1]. When looking at the energy use during a building's life-cycle, about 80% of it stems from building operation [2]. However, over the last two decades, buildings have become much more complex to operate optimally due to the integration of renewable energy generation, transformation, and storage devices [3]. Also, due to the electrification of the mobility sector, electric vehicle (EV) chargers are installed in buildings, thus further increasing the control complexity [4]. Today, multiple energy-flows are possible within a single building, which gives rise to the need for system-wide optimal energy management. At the same time, the users' needs for comfort, such as indoor thermal and visual comfort, and having enough energy in the EV battery for the next trip, shall be satisfied.
This multi-loop, multi-criteria control problem has raised specific needs within the building automation (BA) industry related to delivering optimal performance at low development and commissioning costs. In the following text, we first provide an overview of the BA industry requirements for an optimal controller for modern buildings. Then, we list the limitations of the current widespread rule-based (RB) controllers and the advanced, state-of-the-art model-based controllers, of which Model Predictive Control (MPC) is the most famous representative. We describe why both control methods fail to satisfy the current BA industry requirements. Following that, we motivate the potential of deep reinforcement learning (DRL) algorithms for the BA industry. We briefly review the current work on DRL applied to room temperature and EV charging control, and we close the Introduction with an overview of this work and a summary of main contributions.

Current BA industry requirements
BA requirement I -Multi-loop control policy: Compared to the situation before the 2000s, renewable energy generation, transformation, and storage devices have now been vastly integrated into new or retrofitted buildings allowing for more energy-efficient and cleaner operations, in terms of GHG and, in particular, CO 2 emissions [5,6]. A typical set of these devices could include photovoltaic (PV) panels, battery storage, a heat pump, and thermal storage. Hence, the number of possible energy flows and the number of decision variables have increased. For example, electricity could be obtained either from the grid, a stationary battery, or PV panels. Similarly, when and which electricity source to use to charge an EVdepends on several factors, such as weather prediction and price of the electricity. Therefore, control of a modern building is a multi-input-multi-output (MIMO), i.e., multi-loop, energy management problem, which requires finding a control policy for several controlled variables simultaneously while considering several external factors that influence it.
BA requirement II -Building-EV coupling: The buildingmobility sector coupling allows for more efficient control solutions than when these two sectors are addressed separately [7]. For example, when the electricity price is low, the building management system (BMS) could decide to heat the room, charge the EV, or store it in a stationary battery for later use. On the other side, this coupling also brings challenges. The charging of EVs causes additional energy consumption for a building, increasing its total -and possibly peak -energy consumption. Furthermore, most EV chargers start charging with full power as soon as an EV is connected. Therefore, if multiple EVs are charged at the same time in a neighbourhood, the aggregated demand can be very high, potentially causing energy dispatching and grid stability issues.
A particularly interesting symbiosis between a building and an EV arises when the latter is bidirectional, i.e., the EV battery can be charged and discharged. In that case, the stored energy could be used as a source of electricity for a building [8]. In this case, the EV battery expands the capacity of the stationary battery, if one is installed. The difference with the stationary battery lies in its availability -the battery of a bidirectional EV is only available when the EV is connected to the building. Therefore, a BMS can use the battery of a bidirectional EV for energy management when the EV is connected. However, a BMS shall also ensure to charge the EV battery to a satisfactory level before the next trip.
BA requirement III -Occupants comfort: In developed countries, people spend on average 80-90% of their time indoors. Therefore, the influence of building systems on occupants' well-being is deemed critical [9]. Consequently, occupants put more and more stringent requirements for comfort to facility managers, which is passed indirectly to the BA industry. Therefore, the value of a building controller is not only measured in terms of saved energy but also how comfortable the indoor environment is to the occupants.
BA requirement IV -Transferability: Buildings differ from each other in terms of construction properties (e.g. floor layout, geometry, materials used, age), installed building services (e.g. heating, ventilation, and air-conditioning (HVAC) systems), outside conditions (climatic region, orientation), and occupancy profiles. Therefore, an ideal building controller shall be able to provide optimal performance not only for the building it is designed for, but also for other similar buildings. If the engineering effort to apply such a controller to a similar building is small or negligible in terms of expert knowledge and time required, then the controller is considered transferable.
BA requirement V -Adaptability and continuous commissioning: The dynamics of a building can change significantly during its lifetime for several reasons, such as a retrofit, a change in the occupancy profile, or ageing. An ideal building control shall detect a change in the building operation performance, e.g. if a building starts to consume more energy than it used to, and readjust its parameters, i.e. adapt to the new situation. This capability of a controller is also called continuous commissioning [10,11].
Overall, the control of a modern building is a complex MIMO control problem with the objective to provide the desired thermal comfort to the occupants and simultaneously ensure the EV is charged to a satisfactory level for the next trip, all while minimizing the overall energy consumption to reduce the costs.

Limitations of RB controllers
Traditionally, more than 90% of industrial BA controllers are RB, such as bang-bang or proportional-integral-derivative (PID) controllers. They have fixed predefined rules, simple architectures with straightforward implementation, and several parameters with clear guidance on how to tune them. Even though RB controllers (RBCs) are widely adopted in the BA industry, there are several limitations on their use for achieving optimal control of modern buildings.
RBCs limitation I -Difficulty to achieve system-wide optimal performance: RBCs are suitable for single output control loops, whether single-input-single-output (SISO) or multi-input-single-output (MISO). Therefore, applying these single-output controllers to solve a multi-output (MIMO) control problem is a challenging and often infeasible task in practice as MIMO systems typically have dependencies between their sub-systems that cannot be neglected [10,12,13]. A MIMO system cannot be typically addressed as a collection of individual SISO/MISO systems [14]. For example, the temperature of the thermal storage determines the available heating capacity over the next couple of hours, while the output heating power of a heat pump connected to the thermal storage determines at what pace the temperature of the storage could be increased. A similar analogy could be drawn for the EV battery and its charging and discharging power. Indeed, optimal control of MIMO systems requires applying advanced MIMO control techniques [14].
RBCs limitation II -Absence of optimality guarantees: Even for single-output problems, manual tuning cannot provide optimality guarantees: control experts could tune the RBC, in particular PID, to provide close-to-optimal regulation performance by looking at the overshoot, rise time, stability margins, and disturbance rejection, but there is no mathematical optimization involved in the tuning of the parameters. Therefore, most of the RB controlled loops in buildings perform sub-optimally [10,12,13].
RBCs limitation III -Difficulty to include prediction rules: RBCs do not typically involve any prediction rule. A prediction rule could be defined, for example, for pre-scheduled dynamic comfort bounds, which change between narrow, e.g.
[22°C, 24°C], and wider, e.g. [20°C, 26°C] constraints. Such dynamic bounds are typical for office buildings, where wider comfort bounds are allowed outside of office hours to save energy. However, as an RBC would react to the change of the bounds only at the time of their change, such control will violate comfort. On the other hand, a predictive controller would pre-heat the room for some time before the narrow bounds need to be reached. Defining and tuning the prediction rule in an RBC would require experimenting with a building and determining the time dominant constant of a particular room so that the pre-heating interval could be defined precisely. However, this interval depends on the day of the year, the state of the room, i.e., accumulated heat in walls, and weather prediction. Hence, determining it precisely for all combinations of these parameters over the year is a challenging task [12].
RBCs limitation IV -Absence of self-adaptation: RBCs need to be re-tuned after a change in building dynamics to regain the previous performance, which requires expert knowledge and incurs costs [10]. (See BA requirement V) Overall, RBCs fail to satisfy all the needs of the BA industry for an efficient way of obtaining an optimal controller for a modern building -they can only provide sub-optimal performance and require expert knowledge during commissioning and maintenance.

Limitations of MPC controllers
Advanced controllers, on the other hand, in their classical and non-adaptive form, can overcome the first three limitations of RBCs. The most well-known representative of this type of controllers is MPC, which can calculate optimal MIMO control signals for several steps ahead while respecting the state, input, and/or output constraints. However, the performance of an MPC controller strongly depends on the quality of the underlying building model used to develop this controller. A building model of a poor quality, which does not represent well the true building dynamics, e.g. a simple grey-box model with some generic parameters, will lead to unacceptable control performance [15]. On the other hand, obtaining a high-quality building model is a complex and time-consuming task requiring expert knowledge. Therefore, the costs of developing and implementing an MPC controller are justifiable only for well-defined systems, where the same controller could be used on many instances of the same system. However, as buildings differ substantially from each other, the costs of developing and deploying MPC controllers outweighs the cost benefits, and, therefore, classical versions of MPC controllers have not yet been widely adopted in the BA industry [15][16][17].
Over time, stochastic [18], robust [19], and adaptive [20] MPC controllers have been developed to address or circumvent the need for a high-quality building model. Even though some directions are promising, in particular those of adaptive MPC controllers with online system identification [20], they have only been applied to single-zone temperature control problems and validated in simulation. Validation on real buildings and solving of more complex building problems is needed for these methods to be accepted by the BA industry.

State-of-the-art data-driven RB and MPC controllers
In recent years, due to the increased availability of stored sensor and actuator data in buildings, researchers have started exploiting the information contained in this past data to develop improved building controllers. Two trends can be observed: first, using data to improve classical control strategies, such as RB and MPC, and second, applying pure datadriven methods from the machine learning (ML) domain and adapting them to building control.
The first direction, data-driven autotuning of RBCs, even though interesting from the industry perspective due to potential direct applicability, has not yet been widely addressed in the literature -only some recent preliminary results exist [21,22].
Considerably more work has been published in the domain of learning-based MPC (LB-MPC) recently [23,24]. The most widely spread approach is to model the building dynamics with a neural network (NN) and use it as a model in the MPC framework. However, as NNs are non-linear models, the main challenge is to use them in a linear or convex fashion so that efficient solvers could be applied. One option is to design a NN that can be used within MPC by constraining the output of the model to be convex with respect to the control inputs [25]. Besides NNs, Jain et al. [16] uses Gaussian processes to learn a model, which is then used within MPC. Another approach uses random forests for modelling [26]. Recently, it has also been validated experimentally, and preliminary results are promising [27]. Even though initial results on data-driven MPC are promising, what is missing is the discussion on the scalability and transferability of these approaches across different buildings (see BA requirement IV).

Potential of DRL for building control
In terms of pure data-driven ML methods, reinforcement learning (RL), and in particular DRL, have emerged in recent years as approaches that can fulfil all the requirements for modern building control. Even though RL was established in the 1960s [28], complex problems remained out of reach until recently, when people started using NNs within the RL framework. Together with the increased availability of large data sets and extensive computational power on demand, this led to the popularisation of DRL methods and the demonstration of successful solutions for complex realworld problems [29,30]. Mnih et al. showed that DRL algorithms could achieve human-level or even super-human level intelligence in playing Atari games [31] and the game of Go [32]. Since then, other problems have been solved at human or super-human levels in image recognition [29], natural language processing [33], and medicine [34]. Motivated by these achievements in DL and DRL, building and control engineers started applying these methods to building control [35][36][37]. There are several reasons why DRL is a promising framework to fulfil all the BA industry requirements for control of modern buildings.
DRL potential I: DRL algorithms operating on a contin-  Figure 1: Overview of the room model, bidirectional EV model, and joint deep reinforcement learning controller uous state space, such as deep deterministic policy gradient (DDPG) [38], can learn a continuous control policy to maximize a given reward function through interactions with the environment. As DDPG requires many interactions with the building, this is not feasible in practice, and people have to rely on models to learn the optimal policy. There are no particular requirements on the underlying model, such as convexity condition, as needed in MPC. As a building model, one could use any kernel-type model. NNs are particularly popular as they can capture the non-linear dynamics of the building [39,40]. After fitting the model to the past data, it is used as the simulation environment in the RL framework.
DRL potential II: There are no restrictions on how the reward function could be defined. Not only a single criterion but also multi-criteria reward functions and trading off requirements could be used. Hence, complex MIMO control policies could be obtained at once (see BA requirements I, II, and III). DRL potential III: Once the method is working for a certain room or building, it could also be applied to other rooms or buildings. The main part of the algorithm could be reused directly, thus demonstrating the transferability of the method (see BA requirement IV). This problem is known as transfer learning, and it has been already considerably addressed in general reinforcement learning [41]. However, only limited prior work was published recently on the transferability of DRL algorithms for building control [42].
DRL potential IV: Finally, if updated with the newly received measurement data, the DRL algorithm could be updated online to adapt to the new building dynamics, thus fulfilling the BA requirement V [43].

State-of-the-art DRL-based room temperature and EV charging control
Most previous works on RL and DRL consider either controlling the building energy system, e.g. [27,[44][45][46][47][48][49], or EV charging, e.g. [50][51][52][53][54][55]. There are a few works that control both the charging of an EV and a building energy system, e.g. [56][57][58][59]. In [56] for example, a building equipped with PV, an EV and an energy storage system is considered as a smart grid system, but no temperature control is addressed. The authors of [57] minimize the costs of electricity through improved operations of an HVAC system, an EV, a washing machine and a dryer. In [58,59], one-day ahead planning is used for building control, including an EV supporting bidirectional charging.

Novelty and contribution of this work
In this work, we describe a fully black-box, data-driven, DRL-based method for the joint control of a room temperature and bidirectional EV charging (see Fig. 1). The main contributions of this work are the following.
First of all, the proposed data-driven pipeline requires only historical data to learn an optimal building control policy, and thus avoids the need for complex physics-based modelling required to develop advanced, model-based controllers (see Sec. 1.3) or expensive fine-tuning of the conventional, rule-based controller (see Sec. 1.2). As a black-box simulation environment that does not require any physics-based prior knowledge to train the policy, we use a Recurrent Neural Network (RNNs) model of the room thermal dynamics and a linear model of the EV battery. We applied Deep Deterministic Policy Gradient, which is a DRL algorithm in the continuous domain, to learn the control policy. Hence, this pipeline is a cost-effective way to obtain an optimal MIMO building control policy by only using available historical data of a building.
Secondly, we use the historical data from a real building, the DFAB HOUSE at Empa Duebendorf in Switzerland to obtain a close-to-reality simulation environment. We analyse the simulation results of the DRL policy in a heating season in terms of energy savings and occupant comfort and showed that it delivers better performance than a standard industrial RB controller. Furthermore, we considered an extended problem when bidirectional EV is connected to the building and the electricity price has two tariffs. We analysed the simulation results of the simultaneous control of room temperature and bidirectional EV (dis-)charging in terms of costs savings while minimizing the comfort violations for the desired comfort bounds and providing enough energy to the EV battery for the next trip. The obtained DRL-based control policy showed better performance compared with two standard industrial RB controllers -one for temperature regulation and another for EV (dis-)charging.
Thirdly, we validated experimentally the learnt DRL policy for room temperature control during the heating season. The DRL policy was directly transferred from simulation onto the real building, the DFAB HOUSE, and it was successfully regulating the temperature from the initial time of deployment, achieving up to 30% energy savings and better comfort satisfaction compared to a conventional, rule-based controller.
Fourthly, we discuss throughout the paper the potential of this approach to satisfy all the BA Requirements (I-V).

Structure
The paper is structured as follows: In Section 2, the case study used to showcase the proposed data-driven building control methodology and the data collection process are described. In Section 3, we present the methods used to model the room temperature and the SoC of the bidirectional EV. Further, we describe the definition of the RL environment and the reward functions for two different problems: i) room temperature control and ii) joint control of the room temperature and bidirectional EV charging. The simulation and experimental results are illustrated in Section 4. Finally, Section 5 provides an overview and concluding remarks of this work, as well as directions for future research.

Case study and data collection
The DFAB HOUSE, a three-storey residential building of the Empa demonstrator NEST in Duebendorf in Switzerland [60] (Fig. 2). NEST [61] (Fig. 2c) is a vertically integrated neighbourhood and a living lab. The DFAB HOUSE is operational since March 2019 and the corresponding sensor and actuator data is collected at 1 min resolution. We chose one bedroom (room 471, Fig. 2a) to apply our control algorithm. In this room, we can control the opening and closing of the valve that regulates the water flow into the floor heating system. As a bidirectional EV was not available at the time of this work, we emulated it based on the past charging/discharging data of the stationary battery at NEST. For information on data preparation see Appendix C.

Methodology
In the following subsections, first, the overview of the control problem and the corresponding model are provided. Then, the data-driven pipeline consisting of the data-driven modelling, RL environment, and DRL algorithm is described.

The control problem and model overview
The overview of the system to be controlled is illustrated in Fig. 1. It consists of two parts: the room temperature model and the EV battery charging/discharging model. These two models are mainly independent, as they serve two different needs of the building occupants, namely to provide indoor comfort and enough battery capacity for the next trip, respectively. They are, however, linked through the overall building electricity demand. If the EV is being charged, the used energy indeed represents additional building energy demand. If the electric energy for heating/cooling is sourced from the bidirectional EV battery instead from the grid, then the overall building demand is reduced.
We can therefore formulate the control problem as: given the energy stored in the bidirectional EV battery, what would be the optimal room temperature control (heating or cooling) and optimal EV (dis-)charging strategy such that the over-all costs for energy is minimized while satisfying the indoor comfort bounds and the minimum SoC of the EV at the moment of leaving. We assume that the EV leaves at 7:00 with a minimum of 60% SoC and returns at 17:00 with 30% SoC. The energy price is assumed to take a standard two-stage tariff profile, with a high price between 8:00 and 20:00 and a low price outside of this interval.

Data-driven modelling of the room temperature and bidirectional EV charging
In this section, the model of the room temperature and the weather are described. Then, the two models are combined to obtain the final room temperature model, and we provide details on RNN architecture, model training, and hyperparameter tuning. Finally, we describe the bidirectional EV charging/discharging model.

Room temperature model
We consider the temperature control of a single room (a single zone,i.e. room 471) at the DFAB HOUSE.The room temperature ∈  depends on the outside temperature ∈  , solar irradiance ∈ , in-/out-flowing water temperature of the pipes ℎ , ℎ ∈  ℎ , and the valve position ∈  (see Fig. 1). Here, the index denotes the time of the measurement. Since we will be using the room temperature model as a simulation environment for the RL agent, we need a model that predicts all uncontrollable (independent) variables. These are all of the above variables apart from the state of the valve . Therefore, we define the state of the room  as the space of all non-controllable variables: Table 1 for definitions of all spaces used).
One way to solve the modelling task would be to fit the data with a multivariate time-series prediction model in an end-to-end fashion. This would allow predicting the evolution of all the variables based on their past values. Since the data collection at the DFAB HOUSE only started in March 2019, there was less than a year of operation and available historical data at the time of this work. To make the most out of this limited amount of data, we took a few more considerations into account that led us to partition the room model into different sub-models. They are discussed in the subsequent sections. Remark 1. The control framework described here could also be applied to different types of heating and cooling systems, where heating and cooling is provided by two different devices, e.g. an electric heater and an AC unit.
To get a smooth time variable, we use = sin(̃ ) ∈  and = cos(̃ ) ∈  , wherẽ ∈  linearly goes from 0 to 2 during each day. To simplify the notation, we define ∈  ∶=  ×  as the combined time variable. Note that one could also define the time in a linear fashion, numbering the time intervals during each day. However, this induces jumps at midnight from the last to the first interval. In other words, two extreme values are given to two adjacent intervals. Introducing the smooth sine and cosine time variables allows us to transfer the idea that these intervals are close to each other to the model.

Weather model
While there is a correlation between, e.g. the room temperature and outside temperature, the influence of the room temperature on the weather is non-existent. Therefore, to avoid that output of the weather model depends on the room state variables, we train an individual model of the weather. Such a model could be useful if no weather prediction data is available on site, but only past observed weather data could be taken as inputs. This model predicts the weather variables (outside temperature and irradiance) based on the past values of those variables and the time of day. Let  ∶=  × , ∶= ( , ) denote the combined weather data. The weather model is then defined as the following mapping: Note that the weather model takes the previous values of the input series, i.e. − +1∶ and − +1∶ , into account to produce the output. The "hat" notation denotes a prediction variable.
The temperatures of the water entering and leaving the cooling/heating system over a few weeks in summer are shown in Figure 3. It can be seen that the water temperature coming from the heat pump is kept almost constant, but not always at the same level, which depends on the average outside temperature over a day. Since we are only interested in predictions with a rather short the horizon of one day at most, we decided to use a constant predictor for the water temperature variables. While this is a valid assumption for the inflow temperature, the outflow temperature is much more dynamic. However, we retained this assumption for the sake of the simplicity of this model.

Final room temperature model
The final room temperature model can now be defined. This model takes the previous values of the state variables in  and the controllable variable to predict the room temperature at the next time step, i.e.:  Table 1 Overview of variables used in the model and their corresponding mathematical spaces Note that we use +1 to make the prediction. This is done deliberately since the model should give us the next statê +1 given the next control input +1 .
Putting everything together, we can now build the full model of the room, , by combining the previously defined sub-models: the weather model ℎ (1), the water temperature model (??), and the room temperature prediction model (2), as follows: with ∈  and = ( , +1 ) ∈  . As mentioned previously, this model takes into account the previous values of the input series − +1∶ and the same number of control inputs − +2∶ +1 to compute the output. By feeding each model the correct input we can put together the desired output̂ +1 .

RNN model
RNNs are commonly used in time series predictions to capture its time dependencies and tendencies [63]. Fig. 4 illustrates how a single step prediction is made and this scheme is naturally expanded to multi-step predictions. In that setting, part of the input is unknown and relies on the previous outputs of the model. It is then merged together with the known input part and fed to the RNN to build the next prediction. Repeating this procedure allows one to get predictions for any number of steps for weather and room temperature models. Note that in practice, we train the actual recurrent model to only predict the difference in the prediction state, not the absolute state.
To optimize the loss, we use the ADAM [64] optimizer with a base learning rate to minimize the mean-squareerror (MSE) between the predictions and the ground truth. The training of the model lasted for episodes (see Table  2). We also monitor the losses on the training and on the validation set to get an idea about the amount of overfitting. On the left, the model uses the provided inputs to make a prediction̂ +1 . On the right, it extracts the true output +1 from the data, which can then be compared to the prediction to compute the loss and train the network.

Predict Output
The data used to fit the model is shuffled to avoid seasonal dependencies between the data in consecutive batches.
The hyperparameters that are used to tune the recurrent models are listed in Table 2. There are a few more parameters that we choose heuristically, for example, a number of neurons in each recurrent layer. To compare the performance of the models trained with different hyperparameters, we use the following objective. We predict 6 h (i.e. 24 timesteps of 15 minutes) into the future and take MSE between this prediction and the ground truth. For this process the validation data is used. The main idea is to find a model that generalizes well over multiple consecutive predictions and over unseen data. For the actual optimization, a Tree Parzen Estimator [65] is used, which is implemented in the Python library hyperopt [66].

Bidirectional EV charging/discharging model
We use a stationary battery available at NEST in order to emulate the battery of a bidirectional EV. This battery has a maximum capacity of 96 kW h at a SoC of 100%. However, we will restrict it to lie within the interval [20.0%, 80.0%] for safety reasons (the details on the safety are discussed later in Section 3.3.2). Furthermore, we limit the charge and discharge rate to ±100 kW. Both stated maximum capacity and maximum (dis-)charging rate are also found in the following EV models: Tesla Models S and X [67], BMW i3 [68], and Mercedes-Benz EQC [69].
The change in SoC is modelled to be proportional to the active power applied, but the proportionality factor can be different for charging and discharging. We also allow for a constant discharging rate when the battery is not used, i.e. if the applied active power is zero, the battery slowly decreases its SoC due to losses. Let ∈  ∶= [20.0%, 80.0%] be the SoC at time , let ∈  ∶= [−100 kW, 100 kW] be the average active power from time − 1 to time . Finally let Δ ∶= − −1 be the change in SoC at time compared to time − 1. Therfore, we model the change in SoC, or charging/discharging of the EV battery, as: where , = 0, 1, 2 are the variable coefficients that can be fitted to the data using least squares. Finally, we can define the battery model as: It models how the SoC evolves when an active power of +1 is applied. We consider the model to be charging if the active power is positive and discharging otherwise.

RL environment
In RL, an agent is learning a control policy through interaction with an environment. Let  be the state space and  be the action space and let and be the state and the action at time , respectively. Then the environment denoted by is a mapping ), ∈  is the reward received at time and ∈  is the boolean value which indicates if the current episode is over. In this work, we trained our agents in an episodic framework, with a fixed episode length of ∶= 48. With one timestep corresponding to 15 min, this corresponds to an episode length of 12 h. The episode termination indicator is thus defined as true if = , otherwise is false.
In our case, we naturally use a transition model ∶  ×  →  with ( , ) ↦ +1 , which corresponds to the form of our room model. All we additionally need to build the RL environment is a reward function ∶  × (×) → . The reward function returns the reward = ( , , +1 ) that agent gets when the selected action leads to a transition of the environment from state to the next state +1 . The general objective of any RL agent is to maximize the reward. Therefore, if one wants to minimize a certain cost function, one possibility is to use the negative of the cost as reward.
In the following sections, we define the environment of our particular problem using the previously described room temperature and EV (dis-)charging models.

Room temperature environment
The model (3) can predict all the variables needed to control the room temperature and is thus used as an environment in our case. Therefore, we use as the state space for the RL environment and ∶= , the space of valve states, as the action space since that is what can be controlled directly. We also define ∶= +1 as the action for the room temperature environment.
To initialize the environment in each episode, we sample an initial condition from the historical data in the database and we then use the model to simulate the behaviour of the room under the agent's policy for the length of the episode. This episodic framework allows us to control the errors of the model, since we know how well it performs. Further, to incorporate stochasticity, a disturbance term ( ) is added to the output of the deterministic model. This is assumed to help the agent find a policy that is robust to disturbances in the model. Mathematically, we thus define the room temperature environment as: Hence, the evolution of the states is defined as: The disturbance itself is modelled by an auto-regressive (AR) process that was fitted based on the residuals of the NN model. This ensures that the disturbance is realistic, i.e. as seen in the past data. The reward of the agent controlling the room temperature is defined as follows: where we defined ∶= ⋅ |ℎ − ℎ | and denotes the penalty function for room temperatures that are outside the comfort bounds. The parameter > 0 determines the weight of the temperature bound violation compared to the energy usage. The penalty function is defined as follows: Note that this function is always positive and increases linearly with → ±∞ as soon as the temperature gets out of the defined comfort bound [ , ].

EV battery environment
To build the RL environment for the EV battery, the battery model (5) described in section 3.2.5 is used: The SoC of the battery at a given time , ∈  , is used as state of the environment and the space of the applied active power  ∶=  is used as action state, with the action defined as the active power ∶= +1 ∈  . Note that the subscripts do not match since we defined +1 as the active power applied from to + 1, but this is also at time .
Besides restricting the active power, we also want to restrict the SoC of the battery to lie within a certain range. Since the battery model learnt from the data is piece-wise linear and strictly increasing, it can be inverted and used to build a fallback controller. We implemented two functions in the fallback controller. First, the fallback controller prevents the SoC from falling out of the previously defined safety range, [20.0%, 80.0%]. The actions are not directly used but clipped using the safety guaranteeing function that will clip the chosen actions to the required range for the constraints to be fulfilled. More details on how this function is defined can be found in Appendix B.3. As the constrained action needs to be fed to the learned model , the following is defined: ∶= − ( ) Furthermore, the fallback controller achieves a specified SoC at the desired future time by restricting the battery to be charged at high power when the SoC is too low when approaching . This makes it easy to build an environment for RL: we can choose the reward as the negative active power applied per timestep and we do not need additional penalties contained within the reward that penalize SoCs outside of the given bounds or not reaching the SoC goal at time . In this way, we omit choosing a heuristic factor for balancing the energy used and the SoC constraint violation (see Appendix B for details on SoC constrain violation). Figure 5 shows how the resulting environment behaves under two different heuristic agents that apply a constant action. One is discharging and the other is charging at a constant rate. Note that in this case, we chose as the end of the episode, i.e. = ∶= 48. One can see that the agent that constantly wants to discharge arrives at the minimum SoC after a few steps and needs to charge the battery at full capacity when approaching the end of the episode. The safety controller built into the environment prevents the SoC from falling below the minimum and charges the battery before the end of the episode, even if the agents continue to discharge.

Joint room temperature and EV battery environment
Since the joint environment consists of both the room and the battery environment, we combine them. This means that the action space will be ∶= × and the state space will similarly be combined as ). As both subsystems evolve independently, we simply use equations (6) and (10) to compute their next state, that we can then concatenate to yield the next state of the joint system. Since the reward was one-dimensional in both cases, we combined the two in a weighted sum as follows: where ( ) denotes a suitable energy price function that may vary over the course of a day, but is the same for different days. Note that compared to the room temperature environment, in this case, we are no longer interested in energy minimization but in price minimization. Maximizing thermal comfort remains also here. Note, also, that coefficient here is introduced to balance out the consumption of the battery and the room, which have different scales.

DRL algorithm
In this work, we used the Deep Deterministic Policy Gradient (DDPG) algorithm [38]. It is model-free, off-policy, and uses an actor-critic setting. Unlike standard Q-learning, it naturally handles continuous state and action spaces, which was one of the main reasons this algorithm was chosen. This choice was also motivated by previous work using this algorithm, for example in [70][71][72][73]. There exists an implementation of DDPG based on the python deep learning library Keras [74] in another library called Keras-RL [75].
Four neural networks are used within the DDPG algorithm: an actor (taking actions) and a critic network (evaluating these actions) and corresponding target networks for each of them. Note that the actor and its target network have the same architecture but different weights. The same applies to the critic and its target network. In our case, a fully Piece-wise linear fit Measurements Figure 6: Piece-wise linear EV battery charging/discharging model.
connected neural network with two layers of 100 units and the Rectified Linear Unit (ReLU) activation function was used for both the actor and the critic. To perturb the actions chosen by the actor network with exploration noise, an Ornstein-Uhlenbeck process (see e.g. [76]) was used. As for the RNN training in the modelling section, we used the ADAM optimizer [64] to update the parameters of the neural networks. The discount factor was fixed to 0.99. Note that a few more hyperparameters, like the learning rate for the ADAM optimizer and the number of training episodes, were adjusted manually. This could be avoided using automatic hyperparameter tuning, as it was done in the case of the neural network models in Section 3.2.4.

Results
In this section, the results of different elements of the proposed data-driven DRL-based control learning pipeline are presented. First, the evaluations of the room temperature and bidirectional EV (dis-)charging models are shown and analysed. Then, the simulation results of applying the DRL algorithm to the room temperature control are illustrated, followed by the results on the joint control of the room temperature and EV (dis-)charging operations. Finally, the experimental results demonstrating the DRL agent applied to the real building are presented.

Evaluation of the EV battery model
The piece-wise linear EV battery charging/discharging model, together with the real data collected at NEST used for fitting, can be seen in Fig. 6.
The 6h ahead SoC prediction using the EV battery model described in Section 3.2.5 is shown in Fig. 7a. Note that the ground truth is shown for comparison and was not used to fit the model. We also performed a more detailed analysis of the prediction performance of the battery model by analysing the mean absolute error (MAE) and maximum absolute error for a different number of prediction steps, up to 12 h prediction interval (Fig. 7b). The prediction captures the dynamics very well, with an MAE of the SoC of less than 0.75 % after 6h. On average, after 12 h, the prediction will be less than 1% away from the true SoC.

Evaluation of the weather model
We compare two methods for the weather model: a linear model and a recurrent neural network model. As a linear model, we chose a 5-fold cross-validated multi-task Lasso estimator from SKLearn [77]. For the RNN, we used the same configuration as the other RNNs in this study (see Sec. 3.2.4). Both models used the same inputs to make the predictions, i.e. data from the previous 19 steps. Further, we used clipping at 0 for the irradiance in both cases for a fair comparison. Note that this makes the model previously described as linear actually only piece-wise linear. Fig. 8 shows how the weather model performs when evaluated on the test set for one specific initial condition. It can be observed that the piece-wise linear model makes smoother predictions and diverges faster than the RNN model. The quality of predictions drops with the longer horizon and, overall, the RNNs provide better predictions, even though the linear model is comparable on short horizons.
Note that, by investing more thoughts into the piece-wise linear model, e.g. by manual feature engineering, one might obtain a linear model that may be able to outperform the RNN. On the other hand, as the dataset grows with time, it is easy to increase the size of the RNN to make it more powerful, which is not the case for the linear model, which is another reason the RNN was favoured.

Evaluation of room temperature model
The performance of the room temperature model is shown in Fig. 9. A quantitative evaluation of the model is shown in Fig. 9b, where the temperature prediction is done over a whole week. The MAE and maximum absolute errors are 0.5°C and 2.3°C after 12h, respectively. As this RNN model showed a satisfactory fit, we selected it as an environment to   train the DRL agent.
Note that the quality of the room model influences the final control performance. One known issue is that blackbox models, i.e. non-physics based models, do not extrapolate well for unseen data. In our case, the room temperature model could be outputting physically inconsistent behaviours in the worst case. For example, on a winter day with low solar irradiance and the heating turned off, a black-box model might predict an increase of the room temperature. Such inconsistent physical outputs of the room temperature model can influence the control policy search negatively, as the DRL agent could learn that it could heat the room by closing the heating valves. Therefore, the more physically-consistent behaviour a room model expresses for the test data, the better control performance of the DRL agent is expected. However, a detailed analysis of the physical inconsistency of the room temperature model for some input data is outside of the scope of this work.

Evaluation of the DRL agent for room temperature control
We evaluated the DRL agent for both heating and cooling seasons, either by taking two different agents, one for each season, or by letting a unique agent learn the global control policy. We tested both approaches and obtained better results for the separate agents. The reasons for better results in the case of heating only or cooling only agent is that it makes the problem less complex. In that way, the deep learning (DDPG) agent is able to find a better control policy.
It actually turned out that for heating cases only, the optimization of the DDPG agent was much harder than in the case of searching for a global control policy and required some manual hyperparameter tuning to perform well. There- fore, we decided to switch to a reference tracking mode by setting = = 22.5°C. This makes it easier for the agent to know what actions are beneficial for temperature control since the temperature bound violation is only exactly zero for = = 22.5°C. As soon as differs, the comfort violation will increase and the agent is penalized. We trained the RL agent for 20 ′ 000 steps and the evaluation is shown in Fig. 12, where the agent is compared to the following three controllers: one always opening the valves, one always closing them, and a rule-based bang-bang controller without hysteresis, which is a standard industrial controller. One can observe that the DDPG agent achieves on average 17% energy savings and 19% better comfort satisfaction compared to the rule-based controller.
A simulated case example is shown in Fig. 10. The DDPG agent can accurately control the room temperature by starting to open the valve before the RB controller, i.e. before the temperature reaches the setpoint, and opening them only a little to avoid overshooting. One can observe that the DDPG agent obtained the least comfort violations while using less energy than the rule-based agent. The quantitative analysis of this example shows 36% energy saving and 13% better comfort (see Fig. 11).

Evaluation of the joint room heating and EV charging control
As in the previous case of room temperature control, we again use three controllers as a comparison for the evaluation: • Valves Open, Charge: This agent always leaves the valves open, as the Valves Open agent in the previous setting, but additionally always charges the battery at full power instantaneously upon arrival of the EV until it is full.
[100%]  • Valves Closed, Discharge: This agent does the opposite of the previous one, i.e. it never opens the valves and constantly tries to discharge the battery at full power.
• Rule-Based: This agent does the same as the previous Rule-Based agent for the heating and constantly charges the battery at full power.
The performance of a MIMO DDPG agent trained on the joint environment is shown in Fig. 15. For the room temperature control, we used the same parameters as in Section 4.1.4 and we considered only heating cases. While again being able to reduce the comfort violations and the heating energy usage compared to the RB agent, the DDPG agent also achieved lower costs. As expected, the agent that never turns the heating on and discharges the battery uses the least energy, which also resulted in the lowest costs. Additionally, comfort violations are less pronounced than in the case of constant heating (constantly valve kept on), but still worse than in both RB and DDPG controlled cases.
A simulated example is shown in Fig. 13  agent manages to regulate the comfort better by using the energy stored in the EV battery. Compared to the RB controller for heating, which heats at the maximum power while the temperature is lower than the reference temperature of 22.5°C, the DDPG controller actively regulates the valves so that better tracking is achieved. In terms of EV battery management, the energy from the EV battery is immediately used at the beginning of the interval until the minimum level of 20% of SoC is reached, which makes sense due to the lower electricity tariff at this time. Then, before the start of the next trip, the fallback controller charges the EV battery to the required SoC. The DDPG control output is presented in full red line, while the constrained DDPG is shown in the dashed light red line. The quantitative analysis of this DDPG agent is shown in Fig. 14, where it achieves 63% energy savings, 71% better comfort, and 53% costs savings compared to two RB controllers, for a certain weighting factor between the energy cost savings and comfort satisfaction. Note that this result is specific to the weighting factor used in the reward function. On average, when tested over 10'000 his-     Figure 14: Joint EV charging and room heating control agent evaluation -Statistics of the example plotted in Figure 13.
torical intervals, the MIMO DDPG controller achieved 12% better comfort satisfaction, 11% energy savings, 63% less EV charging at home, and 42% energy costs savings compared to two standard RB controllers, for the same weighting factor.

Experimental results
The DRL control agent, which was obtained in Section 4.1.4 for the heating season and tested in simulation, was applied on the real building, the DFAB HOUSE, in room 471, for two weeks in February 2020. The performance of the  Figure 15: Joint EV charging and room heating control agent evaluation over a total of 10'000 steps.
DRL controller was compared with the performance of the room temperature bang-bang RB controller implemented in the same room over a subsequent week. The time-series results are shown in Fig. 16. Both controllers are aiming at the setpoint 22.5°C. Due to the chosen weighting factor emphasizing energy savings, the DDPG controller is using less energy, at the cost of comfort, keeping the temperature slightly under the setpoint (−0.3°C on average). On the other hand, the RBC is staying closer to the setpoint (−0.1°C on average), but it is using more energy.   As the ambient conditions were naturally different for both controllers, we compared them using the Heating Degree Days (HDD) as a normalization variable. As per definition, the HDD of a given day represents how far from 18°C the daily average temperature is [78]. In other words, higher heating degree days mean lower average outside temperature, for which we naturally expect more energy to be needed. The outside temperature was indeed approximately 4°C lower during the DDPG experiment, which forced the controller to use more energy and made it hard to compare both experiments without a normalization procedure.
The daily energy used by both the DDPG and the RBC during five experimental days each are plotted against the corresponding HDD in Fig. 17. We can see that the DDPG controller outperforms the RBC: at HDD levels of around 7 and 12.5, the energy savings are 28% and 26%, respectively. On the other hand, we can also observe that while both controllers used between 6 and 8 kWh during three days, the average outside temperature was much lower (about 4°C colder) during the DDPG experiment. In other words, the DDPG algorithm was able to use the same energy budget and main-tain similar comfort levels to the RB approach but in harsher conditions.
Additionally, the points in Figure 17 exhibit linear-like behaviours. To leverage that fact, we fitted a linear regression to both controllers to capture their global behaviour. This allowed us to clearly picture the gap between the RB algorithm and our proposed method, which on average saves around 25-30% energy.

Conclusion and Discussion
In this paper, we introduced a fully data-driven DRLbased method to obtain optimal control policies for MIMO building control problems. We demonstrated the method on the joint control of room temperature and bidirectional EV (dis-)charging to minimise the energy consumption and maximise occupants thermal comfort while ensuring enough energy stored in the EV upon leaving for the next trip. We demonstrated the method on a real building case study -the DFAB HOUSE at Empa Duebendorf in Switzerland with available past operational data of less than a year.
We show in simulation that the trained DRL agents are capable of saving on average 17% energy over the whole heating season while providing 19% better comfort satisfaction compared to a classical rule-based controller. When an EV is additionally connected to the building and two tariff electricity pricing is considered, the DRL agents can successfully leverage its battery and decrease the overall cost of electricity. The obtained DRL control agent achieved 12% better comfort satisfaction, 11% energy savings, and 42% energy costs savings compared to two standard RB controllers, one controlling the room temperature and another controlling the bidirectional EV (dis-)charging. This result is specific to the weighting factor used in the DRL algorithm to balance the energy cost savings and comfort satisfaction. Finally, we demonstrate a successful transfer of the learnt DRL policy from simulation onto the actual building achieving up to 30% energy savings while maintaining similar comfort compared to a conventional RB room temperature controller over three weeks during the heating season.
The data-driven DRL-based control method proposed in this work is a viable approach to satisfy all the BA industry requirements for control of modern buildings, as defined in Section 1.1. We demonstrated that this method could match the first three BA industry requirements. In terms of the fourth requirement on transferability (and usability) for similar control problems in other buildings, we can argue in favour of the developed method; This method is suitable for use on any other building to obtain a room temperature controller or a joint (MIMO) control of room temperature and bidirectional EV charging. One can reuse the same NN and DRL architectures structure to obtain the control policies.
We applied the same methodology to another room at the DFAB HOUSE, and we obtained comparable results. We believe that this method has a strong potential to work for any building or room, and could thus be a stepping stone towards obtaining transferable model-free data-driven room temperature control policies. As such, we also believe it to be valuable for the BA industry due to its potential for transferability, as it minimises the engineering efforts to obtain a custom-tailored controller for each room and building of interest while optimizing the energy savings and occupant comfort satisfaction.
However, we still need to address a few points before this method can achieve widespread transferability to any building or room.
The availability and quality of the building model is the first point to be addressed and explored. As demonstrated in this paper, the building model could be built as an RNN model, which could be directly applied to another room with the same setting, i.e. the same HVAC equipment and the same number of sensors and actuators. However, rooms generally differ in terms of HVAC equipment and the number of sensors and actuators. Thus, to model a different room, a certain engineering effort needs to be invested into linking the new inputs and outputs to the RNN model and fitting it. This process could be simplified and even automated if a linked, i.e. semantic, database of a building exists.
Secondly, the availability of past building operational data is a requirement to apply our black-box pipeline. While this may not be an issue for existing building with operational data stored in databases, it could be an issue for new or retrofitted buildings. A potential solution to this could be to apply transfer learning to the modelling part of this method and learn the dynamics of a new building with fewer data. Similarly, transfer learning could be applied to "jump-start" the learning of the control policy for another building, given already existing proven policies in other buildings. This is also directly related to the last BA industry requirement on (self-) adaptability and continuous commissioning of building controllers. Transferring a controller to another, unseen building or re-applying it to the same building after a retrofit or whenever a change of dynamics is observed, is in essence very similar problem and a very interesting direction for future work.  Table 2 Hyperparameters of the RNN

B.1. Minimum and maximum SoC constraints
We require the SoC of the battery to lie within predefined bounds [ , ] at any time. Assuming we start from , it suffices to show that the next SoC, +1 , given the previous SoC, stays within the bounds, and then apply the argument recursively. For the maximum constraint, we have to make sure that: Let us define the following helper function: Note that it is positive for all values of because of the properties on the coefficients . Now we can rewrite the equation above as: To get a bound for from this equation that does not contain itself, we need to make a case distinction: • Case 1: − − 0 > 0 This means that the SoC at the next step will be lower than the maximum SoC when = 0, therefore we can discharge as much as we want, i.e. we do not need to handle the case < 0, so we only look at > 0, therefore we have ℎ( ) = 1 + 2 • Case 2: − − 0 < 0 This means that the SoC at the next step will be higher than maximum SoC when = 0, therefore we need to discharge in any case, i.e. < 0, which means ℎ( ) = 1 Putting the two cases together, we get the following bound on the active power : Note that in the edge case − − 0 = 0 both cases return the same, i.e. the bound is continuous. Applying the same chain of reasoning to the case, one can derive the following: Note that this case is using the exact same function.

B.2. Achieving the goal SoC
We want to ensure that the battery is charged for some minimum desired amount at a given time . Assuming we are now at time , i.e. the SoC is , and assuming we can charge for a maximum of , then at the next timestep, the SoC has to be at least Note that, if we start with an SoC that is already too low to achieve the goal SoC, the bounds will require an active power > , which is not possible, and would be applied.

B.3. Constraining battery controller
Now we can finally combine all the previous constraints to define the controller that constrains the active power for the battery charging and discharging. We consider the following constraints: • Direct constraints: ≤ ≤ • SoC constraints: ≤ ≤ • Charging constraint: ≥ for = Note that we still use ∶= +1 . Using the formulas defined before, the last two constraints can be converted to constraints on as shown in equations (17), (18) and (19). Combining these constraints with the direct constraints on and choosing always the tightest one yields: Finally, we can define our safety controller that assures that the chosen action, i.e. the active power, lies in the appropriate range.
where is the clipping function defined as follows: [ , ] ( ) ∶= ⎧ ⎪ ⎨ ⎪ ⎩ ≤ ≥ else (22) Note that the function ( ) implicitly depends on a lot of parameters, i.e. , , , , , , , and the parameters of the model and not only on .

C. Data preparation C.1. DFAB data
The following variables are measured inside the DFAB unit and are processed as follows, before their usage in the data-driven learning process.
• Room temperature ( ): The room temperature contained a few data points at exactly 0°C which were removed. Furthermore, sequences of constant temperature that lasted for at least one day were removed, too.
In the next step, spikes in the temperature of a magnitude of at least 1.5°C were extracted and deleted. Finally, we applied Gaussian smoothing with a standard deviation of 5.0. • Water temperatures (ℎ , ℎ ): The water temperature of the heating water flowing into and out of the rooms was processed by removing all data points that did not lie in the range [10.0°C, 50.0°C] were removed and then smoothing with a Gaussian filter with a standard deviation of 5.0 was applied.

C.2. Weather data
Outside temperature and solar irradiance are measured by the weather station at NEST. They were processed in the following way.
• Outside temperature ( ): First, we remove values that are constant for more than 30 minutes. In the next step, we fill values that are missing by linear interpolation between the last and the next known value, but only if the time interval of missing values was less than 45 minutes. Finally, we smooth the data with a Gaussian filter with a standard deviation of 2.0.
• Irradiance ( ): Since the irradiance data series naturally contains values that are constant for a long time, e.g. zero at night, we only remove a series of data points if they are constant for at least 20 h. Then again we fill missing data points by interpolation and smooth the data as was done with the temperature data.

C.3. EV battery data
The data of the battery consists of the state of charge (SoC) and the active power used to charge or discharge the battery. The two time series were processed as follows.
• State of charge ( ): Since the SoC cannot lie outside of the interval [0.0%, 100.0%], we remove all values that lie outside that range including the boundary values. Further, if the data is exactly constant for at least 24 h, we assume something went wrong with the data collection and remove the data of that time interval.
• Active power ( ): In this case, we do not have strict boundaries for the values, so we only remove values where the series was constant for at least 6h.

D. Implementation
The work was implemented in Python version 3.6.6 and is not compatible to versions 3.5 and lower since f-strings were used. The main libraries that were used are listed in Table 3. Note that the most recent version of all libraries was used, except for TensorFlow [79] because of a dependency on another library, Keras-RL [75]. In most cases, the produced code is Pep-8. The actual code can be accessed at https://github.com/chbauman/MasterThesis. There is also information available on how to run the code.

D.1. Data whitening
As another data processing step, we whitened the data, i.e. it was scaled to have mean 0.0 and variance 1.0 before training the models. This is a standard procedure in machine learning and helps to avoid a bias in the feature importance while also allowing task-independent weight initialization in the neural network training. Since this was done manually, without the use of an existing library, this resulted in a few complications. For example, the reinforcement learning environment took the original actions as input and then had to scale them, feed them to the model and scale the output of the model back to the original domain to get the output for the agent.